@evo-hq/pi-evo 0.4.4 → 0.5.0-alpha.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@evo-hq/pi-evo",
3
- "version": "0.4.4",
3
+ "version": "0.5.0-alpha.10",
4
4
  "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
5
5
  "publishConfig": {
6
6
  "access": "public"
@@ -2,13 +2,89 @@
2
2
  name: discover
3
3
  description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
4
4
  argument-hint: <optional context about what to optimize>
5
- evo_version: 0.4.4
5
+ evo_version: 0.5.0-alpha.10
6
6
  ---
7
7
 
8
8
  # Discover
9
9
 
10
10
  Internal procedure for `evo:discover`. The user only sees the user-facing prompts, the dashboard URL, and the baseline score -- everything else is the agent's choreography.
11
11
 
12
+ ## Evo surface
13
+
14
+ General guidance on the skills and tools available in evo. Each line is a triggering condition: if you're about to do X, pull/dispatch/read this. Don't preload -- act when the trigger fires.
15
+
16
+ **Always have a sense of the skill before jumping into its references.** A skill body carries the decision-making; references are concrete contracts that assume a decision has been made.
17
+
18
+ ```
19
+ evo plugin
20
+
21
+ ├── Main thread (the orchestrator -- you, inside /evo:discover or /evo:optimize)
22
+ │ │
23
+ │ ├── Skills (Skill tool)
24
+ │ │ ├── evo:discover starting a new evo workspace / instrumenting a project
25
+ │ │ ├── evo:optimize after discover commits the baseline -- drives the loop.
26
+ │ │ │ Args: subagents=N (read sizing-the-round FIRST),
27
+ │ │ │ autonomous, subagents-only, budget=N, stall=N
28
+ │ │ ├── evo:finetuning task is finetuning / post-training / training a model
29
+ │ │ └── evo:infra-setup need a remote backend, pooled workspaces, lease/slot
30
+ │ │ management, or specific provider auth/setup
31
+ │ │
32
+ │ └── Subagents to dispatch (Task tool, subagent_type=...)
33
+ │ ├── evo:benchmark-reviewer before the baseline run, or whenever the
34
+ │ │ benchmark command / harness changes
35
+ │ └── evo:ideator stalled, or every ~5 committed experiments.
36
+ │ One subagent per brief:
37
+ │ failure_analysis, literature, frontier_extrapolation
38
+
39
+ ├── Subagent thread (each subagent spawned by /optimize step 5)
40
+ │ │
41
+ │ ├── Skills (the subagent loads this on first turn -- the brief's first
42
+ │ │ sentence mandates it; not auto-loaded by the host)
43
+ │ │ └── evo:subagent load FIRST -- defines the iteration protocol
44
+ │ │ + brief field shape the subagent operates under
45
+ │ │
46
+ │ └── Subagents to dispatch (Task tool, subagent_type=...)
47
+ │ └── evo:verifier ALWAYS dispatch pre AND post every evo run.
48
+ │ Pre: ~30s static analysis before the experiment runs.
49
+ │ Post: result-validity audit after it commits.
50
+ │ Not optional. Not ad-hoc.
51
+
52
+ └── Key references (Read tool, on demand)
53
+ ├── discover/references/
54
+ │ ├── constructing-benchmark.md designing + assembling a benchmark from scratch
55
+ │ ├── sdk_python.py / sdk_node.js wiring per-task instrumentation -- preferred path
56
+ │ ├── inline_instrumentation.py inline fallback when SDK can't be used.
57
+ │ │ Copy as-is; do not reimplement (file header
58
+ │ │ explains why)
59
+ │ ├── sizing-the-round.md BEFORE invoking /evo:optimize with any
60
+ │ │ specific subagents=N. Single-GPU /
61
+ │ │ single-exclusive-resource -> subagents=1
62
+ │ ├── proposing-dimensions.md choosing what to optimize when not obvious
63
+ │ └── instrumentation-contract.md the format evo reads (result + traces shapes)
64
+
65
+ ├── finetuning/references/
66
+ │ ├── glue.md writing train.py -- I/O contract evo expects
67
+ │ ├── diagnostics.md per-failure-mode diagnostics
68
+ │ ├── false-progress.md what doesn't count as improvement
69
+ │ ├── trace-schema.md per-task trace JSON schema for training runs
70
+ │ ├── rl/ RL framework references
71
+ │ │ └── art.md ART (Algorithm-Refined Training)
72
+ │ ├── sft/ SFT framework references
73
+ │ │ └── tinker.md Tinker SFT
74
+ │ └── serving/ eval-time inference references
75
+ │ └── vllm.md vLLM serving config + LoRA-multi
76
+
77
+ ├── infra-setup/references/
78
+ │ └── provider-matrix.md provider/backend summary (auth, setup, costs)
79
+
80
+ └── references/ (shared across skills)
81
+ ├── evo-wait.md any time you need to wait without burning
82
+ │ context (subagent completion, training,
83
+ │ ideators, GPU activity, any long-running)
84
+ ├── agent-sdk-reference.md SDK API surface
85
+ └── cli-quick-reference.md CLI subcommand cheat sheet
86
+ ```
87
+
12
88
  ## Host conventions
13
89
 
14
90
  This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
@@ -40,20 +116,20 @@ evo --version
40
116
  The output must be exactly:
41
117
 
42
118
  ```
43
- evo-hq-cli 0.4.4
119
+ evo-hq-cli 0.5.0-alpha.10
44
120
  ```
45
121
 
46
122
  Three outcomes:
47
123
 
48
124
  1. **Matches exactly** — continue to step 1.
49
125
  2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
50
- > Your installed evo CLI is on a different version than this skill (`0.4.4`). Run:
126
+ > Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.10`). Run:
51
127
  > ```
52
- > uv tool install --force evo-hq-cli==0.4.4
128
+ > uv tool install --force evo-hq-cli==0.5.0-alpha.10
53
129
  > ```
54
130
  > Then re-invoke this skill.
55
131
  3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
56
- > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.4.4` (or `pipx install evo-hq-cli==0.4.4`). Then re-invoke this skill.
132
+ > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.10` (or `pipx install evo-hq-cli==0.5.0-alpha.10`). Then re-invoke this skill.
57
133
 
58
134
  Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
59
135
 
@@ -180,7 +256,8 @@ build/
180
256
  evo init --name "<short project name>" \
181
257
  --target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
182
258
  --host <claude-code|codex|opencode|openclaw|hermes|pi|generic> \
183
- --instrumentation-mode <sdk|inline> [--gate "<gate command>"] \
259
+ --instrumentation-mode <sdk|inline> \
260
+ --per-exp-timeout <seconds> [--gate "<gate command>"] \
184
261
  [--commit-strategy <all|tracked-only>]
185
262
  ```
186
263
 
@@ -188,6 +265,8 @@ evo init --name "<short project name>" \
188
265
 
189
266
  **`--name` should be a short human-readable project label** for dashboard display, chosen from the repository/product context. Existing workspaces without a name fall back to the repo directory name; do not hand-edit config just to migrate them.
190
267
 
268
+ **`--per-exp-timeout` is required.** Wall-clock seconds for each `evo run` invocation. Becomes the workspace default; override per-call with `evo run --timeout N`. Pick based on what the benchmark actually costs end-to-end on this hardware -- if you don't know yet, time the benchmark once locally and use ~2x that. Typical ranges: a unit-test-style benchmark is 300-900s; a small-model SFT + eval cycle is 1800-3600s; a large-model train run is several hours. Set conservatively -- a too-tight value kills experiments mid-flight; a too-loose value wastes budget only when something actually hangs. Update later with `evo config set per-exp-timeout <seconds>`.
269
+
191
270
  **`--commit-strategy` is optional.** Default is `all`. Override with `--commit-strategy tracked-only` only when you want the stricter shisa-kanko flow where new files must be staged explicitly and acknowledged at `evo run` time.
192
271
 
193
272
  **Placeholder semantics.** Benchmark and gate commands support two placeholders, resolved lazily at run time by `evo run` / gate evaluation:
@@ -205,7 +284,8 @@ evo init \
205
284
  --target agent/solve.py \
206
285
  --benchmark "python3 {worktree}/benchmark.py --target {target}" \
207
286
  --metric max \
208
- --host claude-code
287
+ --host claude-code \
288
+ --per-exp-timeout 1800
209
289
  ```
210
290
 
211
291
  Use the same runtime entry point the project already uses, but make sure the command does not assume uncommitted runtime state exists inside the worktree. Worktrees are git checkouts; untracked directories such as local virtualenvs, build caches, and downloaded models are usually not present there. If the benchmark needs setup or a package-manager runner, configure evo's runtime recipe instead of baking local paths into the benchmark command:
@@ -225,6 +305,14 @@ Dashboard live: http://127.0.0.1:8080 (pid 12345)
225
305
 
226
306
  **Relay that line back to the user verbatim.** If port 8080 is busy, evo auto-increments -- show whatever port prints. The URL is how the user watches the run.
227
307
 
308
+ **Benchmark commands must be eval-only.** Do NOT wrap training and evaluation into a single benchmark command. If your benchmark command runs training before scoring, every gate revalidation and every `evo run --check` retrains from scratch, and the experiment budget burns on duplicated training instead of new experiments. Training is a separate step the agent invokes BEFORE `evo run`:
309
+
310
+ 1. The agent makes changes (data curation, hyperparameter selection, technique choice, training code edits).
311
+ 2. The agent runs training to produce a checkpoint at the experiment worktree's `final_model/` (or wherever the technique's recipe in `evo:finetuning/references/glue.md` specifies).
312
+ 3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced checkpoint and emits a score.
313
+
314
+ The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs train+eval together with no eval-only mode, wrap it: add a `--skip-train` flag, or have the wrapper detect an existing checkpoint at `final_model/` and short-circuit the train step. Without this, evo's gate-recheck and re-score mechanics retrain repeatedly and the budget evaporates.
315
+
228
316
  **Runtime environment.** If the benchmark needs keys or other runtime variables, configure them through evo rather than copying `.env` into worktrees or hand-editing `config.json`:
229
317
 
230
318
  ```bash
@@ -318,7 +406,31 @@ Paths below are relative to this `SKILL.md` file (resolve them against the skill
318
406
  - Benchmark is Python or Node: paste `references/inline_instrumentation.py` (or `.js`) and call `log_task` / `logTask` per task, `write_result` / `writeResult` once at the end.
319
407
  - Benchmark is any other language: port the contract from `references/instrumentation-contract.md` directly into that language (~10-15 lines: read the env vars, write each task trace, atomically publish the result). Do not add a Python/JS wrapper around it.
320
408
 
321
- ### 10d. Cheap validation run
409
+ **Per-task emission is load-bearing.** If your benchmark evaluates N independent items (per-question math, per-test-case unit tests, per-document QA, per-sample reasoning trace), emit ONE `log_task` / `report` / trace file PER ITEM -- not one aggregate. Include the item's input, expected output, model output, and any per-item metadata as extras; that detail is what the dashboard's per-task panel + the verifier's reproducibility spot-check + the ideator's failure-clustering all rely on. Wrappers that compute the average score themselves and emit a single aggregate task entry look like they work but lose every diagnostic capability evo provides. The reference files have explicit USAGE EXAMPLES showing the per-item loop AND an ANTI-PATTERN block showing what NOT to do (see `references/inline_instrumentation.{py,js}` and `references/sdk_{python,node}.{py,js}`). Follow them. Single-aggregate emission is only valid when the benchmark really is one indivisible measurement (one e2e workflow, one perf number) -- and even then, attach every observable as extras.
410
+
411
+ **Runner-library wrappers are the common failure mode.** When the selected benchmark wraps a runner library (e.g. `inspect_evals`, `lm-eval-harness`, `evals`), the per-item loop is hidden inside the runner. The wrapper script's natural shape is to call the runner, read its aggregate output, and write a single `{"score": X}` to `result.json`. This is the anti-pattern. **Even when the runner library handles the per-sample loop internally, the wrapper script MUST parse the runner's per-sample output JSON and emit one `log_task(item_id, score=..., extras={...})` per item.** The runner typically already writes per-sample data to disk (most of these libraries do — `inspect_evals` writes per-sample logs, `lm-eval-harness` writes per-doc records); the wrapper just hasn't been forwarding it to evo. Without per-item forwarding, the dashboard's Tasks tab is empty, the verifier can't spot-check, and the agent has no way to diagnose WHICH items the model fails on — which is required for any RL-on-failures or curriculum strategy.
412
+
413
+ **Write traces for the future reader.** The only consumers of a committed experiment's traces look BACK at them -- a future orchestrator picking the next move, a verifier auditing for false progress, an ideator clustering failure modes. A trace that records `{score: 0}` and nothing else is unrecoverable: the future reader cannot tell whether the model produced wrong reasoning, unparseable format, or no output at all. Two rules of thumb:
414
+
415
+ - **`log_task` extras carry context** -- the item's input, expected output, the model's actual output (or first ~500 chars of it), parse outcome, any error. Cost: a few KB per task. Payoff: diagnosis is possible later without re-running the eval.
416
+ - **`evo annotate <exp_id>` is the human-readable summary** -- one or two factual lines after a run commits: which data sources were used, what the score actually represented, the failure mode if any. Annotations get loaded into future orchestrator decisions and ideator briefs, so write them as facts ("model emitted `\boxed{}`, eval prompt requested `ANSWER:`, format mismatch suspected"), not as plans or feelings.
417
+
418
+ ### 10d. Audit with benchmark-reviewer (mandatory before baseline)
419
+
420
+ Before `evo run` is invoked for the first time, audit the harness via the bundled subagent:
421
+
422
+ ```
423
+ Task(subagent_type="evo:benchmark-reviewer",
424
+ prompt="workspace=<absolute path to dir containing .evo/>\nbenchmark_command=<the literal --benchmark string from evo init>\nunit=<one-line description of an item, e.g. 'AIME problem', 'HumanEval task', 'BFCL turn'>")
425
+ ```
426
+
427
+ The subagent returns a JSON report with `passed`, `findings[]`. Each finding has `severity: block|warn|note`.
428
+
429
+ **Gate the baseline on this.** If `passed: false`, address every `block` finding before re-invoking the reviewer. Do **not** proceed to `evo run` until the report comes back clean. Typical `block` findings: aggregate-only emission (most common -- see step 10c), training source that overlaps the held-out set, no real gate registered for a constructed benchmark, harness silently writes `{"score": 0.0}` on error instead of crashing.
430
+
431
+ `warn` and `note` findings are informational -- record them in `.evo/project.md` and proceed.
432
+
433
+ ### 10e. Cheap validation run
322
434
 
323
435
  Before the full baseline, validate the toolchain with the cheapest possible end-to-end run (single task, smallest split, dry-run flag -- whatever is fastest). Run the check from the main repo root:
324
436
 
@@ -339,7 +451,7 @@ This is the authoritative wiring check, and it is language-agnostic -- it runs t
339
451
 
340
452
  Fix any issues and re-validate before proceeding.
341
453
 
342
- ### 10e. Commit inside the worktree
454
+ ### 10f. Commit inside the worktree
343
455
 
344
456
  Logical commits are ideal but not required. Minimal acceptable:
345
457
 
@@ -405,9 +517,33 @@ End the skill by reporting in chat:
405
517
  - The baseline experiment ID and score
406
518
  - The chosen optimization dimension and why
407
519
  - A one-liner on next steps: "Run `/evo:optimize` to start the optimization loop."
520
+
521
+ **Do not run experiments outside `/evo:optimize`.** Even if the workspace's resource profile forces serial execution (e.g. exclusive-GPU, width 1), you still go through `/evo:optimize` with `subagents=1`. The optimize loop's value isn't just parallelism -- it's the structured loop around every experiment (scan-subagent cross-cutting analysis, brief writing, verifier pre/post hooks, ideator spawning on stall, frontier reconciliation). Bypassing optimize to "drive experiments directly" loses all of that and reverts to ad-hoc iteration. If you are tempted to skip optimize because the workload is serial, read `plugins/evo/skills/optimize/SKILL.md` for how to configure it for serial work -- the answer is `subagents=1`, NOT "don't run optimize."
522
+
408
523
  - **Resume after crash:** if the host, the shell, or the machine restarts mid-flow, re-invoke `evo:optimize`. Evo reads `.evo/` and resumes from the last committed experiment -- no special restore procedure.
409
524
  - **State is local to this machine:** experiment commits on branches like `evo/run_0000/exp_*` survive `git push --all`, but orchestration state (graph, annotations, project notes) lives only in `.evo/`. If that history matters to you, back up `.evo/` separately (e.g., `tar -czf evo-state-$(date +%F).tar.gz .evo/`).
410
525
 
526
+ ## Polling discipline
527
+
528
+ When waiting on a long-running background process (training, evaluation, batch generation, any externally-spawned long task), do NOT use `while true; do sleep N; tail file; done`. That loop never exits when the underlying process crashes -- the tail keeps reading the same dead file, the agent interprets "no growth" as "still working," and the agent blocks indefinitely.
529
+
530
+ Use `evo wait`. The CLI is the bounded, structured replacement:
531
+
532
+ ```bash
533
+ # wait for a training subprocess to exit, OR its log to stall, OR the GPU to go idle,
534
+ # whichever first; 60-minute ceiling; structured JSON on stdout
535
+ evo wait --for process=$TRAIN_PID \
536
+ --for log-growth=$TRAIN_LOG \
537
+ --for gpu-idle \
538
+ --timeout 60m --stall-threshold 5m --json
539
+ ```
540
+
541
+ Multiple `--for` flags combine; the wait returns on the first matching condition. The JSON output's `exit_reason` and `triggered_by` identify which condition fired (process-exited / log-stalled / gpu-idle / timed-out). Process / log-growth / gpu-* watches do not require an evo workspace; the workspace-anchored watches (`--for experiments`, `--for ideators`) still work for ideator-proposal and commit waits.
542
+
543
+ Full surface, exit codes, JSON shape, and examples in `references/evo-wait.md`.
544
+
545
+ If `evo wait` is not available for some reason (older CLI on PATH), fall back to a bounded poll loop that checks all three signals -- process liveness via `kill -0 $PID`, log growth via `wc -c` delta, GPU via `nvidia-smi --query-gpu=utilization.gpu` -- and exits on any one going negative. NEVER unbounded `while true`.
546
+
411
547
  ## Inspection commands (for debugging, reference only)
412
548
 
413
549
  ```bash
@@ -6,6 +6,13 @@
6
6
  * - Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from process.env.
7
7
  * - Writes traces/task_<id>.json per task.
8
8
  * - Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
9
+ *
10
+ * **Per-task emission is the load-bearing discipline.** If your benchmark
11
+ * evaluates N independent items (per-test-case, per-document, per-sample),
12
+ * call logTask ONCE PER ITEM with as much detail as you have (input,
13
+ * expected, model output, timings). Do NOT roll up to one aggregate call --
14
+ * the dashboard's per-task panel and the verifier's reproducibility check
15
+ * both rely on per-item traces. See the USAGE EXAMPLE at the bottom.
9
16
  */
10
17
 
11
18
  import {
@@ -95,3 +102,41 @@ export function writeResult(score) {
95
102
  }
96
103
  return score;
97
104
  }
105
+
106
+
107
+ // === USAGE EXAMPLE (copy + adapt) ===
108
+ //
109
+ // For a benchmark scoring N independent items, emit one logTask per item.
110
+ // writeResult() with no arg aggregates from the per-task scores.
111
+ //
112
+ // async function main() {
113
+ // const cases = await loadTestCases(); // N items
114
+ // for (const [i, c] of cases.entries()) {
115
+ // const output = await runUnderTest(c.input);
116
+ // const correct = output === c.expected;
117
+ // logTask(
118
+ // `case_${String(i).padStart(2, "0")}`, // stable id per item
119
+ // correct ? 1.0 : 0.0, // per-item score
120
+ // {
121
+ // input: c.input, // extras land in the trace JSON
122
+ // expected: c.expected, // for diagnosis later
123
+ // actual: output,
124
+ // elapsed_ms: c.elapsed,
125
+ // }
126
+ // );
127
+ // }
128
+ // // No arg -> writeResult averages the per-task scores
129
+ // const final = writeResult();
130
+ // console.log(`final: ${final.toFixed(4)}`);
131
+ // }
132
+ //
133
+ // Anti-pattern: one logTask or writeResult with the aggregate score. Loses
134
+ // the per-item detail the dashboard and verifier rely on. Don't do this:
135
+ //
136
+ // const score = passed / total;
137
+ // logTask("eval_total", score); // <-- aggregate; useless
138
+ // writeResult(score);
139
+ //
140
+ // Exception: if the benchmark really is a single indivisible measurement
141
+ // (one e2e workflow, one perf number), emit one task with that score AND
142
+ // pass every observable as extras (timings, allocations, error log).
@@ -5,6 +5,18 @@ Contract:
5
5
  - Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from env.
6
6
  - Writes traces/task_<id>.json per task.
7
7
  - Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
8
+
9
+ **Per-task emission is the load-bearing discipline.** If your benchmark
10
+ evaluates N independent items (per-question math, per-test-case unit
11
+ tests, per-document QA, per-sample reasoning trace), call `log_task`
12
+ ONCE PER ITEM with as much detail as you have (problem text, model
13
+ output, expected answer, intermediate reasoning) -- not one aggregate
14
+ call with the rolled-up score. The dashboard's per-task panel and the
15
+ verifier's reproducibility spot-check both rely on per-item traces;
16
+ emitting one aggregate `log_task("eval_total", score)` makes both
17
+ useless and loses the diagnostic value of the run. `write_result()`
18
+ aggregates from the per-task scores automatically -- you don't need to
19
+ roll up yourself. See the USAGE EXAMPLE at the bottom of this file.
8
20
  """
9
21
 
10
22
  from __future__ import annotations
@@ -107,3 +119,41 @@ def write_result(score: float | None = None) -> float:
107
119
  else:
108
120
  print(payload)
109
121
  return score
122
+
123
+
124
+ # === USAGE EXAMPLE (copy + adapt) ===
125
+ #
126
+ # For a benchmark scoring N independent items, emit one log_task per item.
127
+ # write_result() with no arg aggregates _SCORES into the final score.
128
+ #
129
+ # def main():
130
+ # problems = load_aime_problems() # list of N items
131
+ # model = load_model()
132
+ # for i, problem in enumerate(problems):
133
+ # output = model.generate(problem.question)
134
+ # correct = extract_answer(output) == problem.expected
135
+ # log_task(
136
+ # f"aime_q{i:02d}", # stable id per item
137
+ # score=1.0 if correct else 0.0, # per-item score
138
+ # question=problem.question, # everything else goes as **extra
139
+ # expected=problem.expected, # and lands in the trace JSON
140
+ # model_output=output, # for diagnosis later
141
+ # tokens_used=len(model.last_tokens),
142
+ # )
143
+ # # write_result() with no arg returns mean(_SCORES) over all logged tasks --
144
+ # # no need to compute the average yourself
145
+ # final_score = write_result()
146
+ # print(f"final: {final_score:.4f}")
147
+ #
148
+ # Anti-pattern: ONE log_task or write_result call with the aggregate. The
149
+ # dashboard's per-task panel + the verifier's reproducibility spot-check
150
+ # both need per-item traces; aggregate-only emission makes them useless.
151
+ #
152
+ # # DO NOT do this:
153
+ # score = sum(per_problem_correct) / len(per_problem_correct)
154
+ # log_task("eval_total", score) # <-- aggregate; loses detail
155
+ # write_result(score)
156
+ #
157
+ # Exception: if the benchmark really is a single indivisible measurement
158
+ # (one e2e workflow, one perf number), emit one task with that score AND
159
+ # include every observable as **extra (timings, allocations, error log).
@@ -3,6 +3,12 @@
3
3
  // The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and
4
4
  // $EVO_RESULT_PATH. Traces flush on each report() so the dashboard can
5
5
  // stream progress live.
6
+ //
7
+ // **Per-task emission is the load-bearing discipline.** Loop over your N
8
+ // independent items and call run.report(id, {score, ...}) ONCE PER ITEM.
9
+ // Do NOT roll up to one aggregate `run.report("eval_total", ...)` call --
10
+ // dashboard panel + verifier reproducibility check both rely on per-item
11
+ // traces. The Anti-pattern at the bottom is what to avoid.
6
12
 
7
13
  import { Run, Gate } from '@evo-hq/evo-agent';
8
14
 
@@ -26,3 +32,16 @@ for (const task of criticalTasks) {
26
32
  gate.check(task.id, { score: result.score });
27
33
  }
28
34
  await gate.finish();
35
+
36
+ // ---- ANTI-PATTERN (do NOT do this) ----
37
+ //
38
+ // Reporting one aggregate task entry loses every diagnostic value of
39
+ // per-item traces. SDK aggregates from per-task reports automatically.
40
+ //
41
+ // // WRONG:
42
+ // const scores = await Promise.all(tasks.map(t => evaluate(t).then(r => r.score)));
43
+ // run.report("eval_total", { score: scores.reduce((a, b) => a + b) / scores.length });
44
+ // await run.finish();
45
+ //
46
+ // Exception: if your benchmark really is a single indivisible measurement,
47
+ // report one task AND attach every observable as extras.
@@ -5,6 +5,13 @@ Install `evo-hq-agent` with this project's package manager/runtime, for example
5
5
 
6
6
  The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and $EVO_RESULT_PATH.
7
7
  Traces flush on each report() so the dashboard can stream progress live.
8
+
9
+ **Per-task emission is the load-bearing discipline.** Loop over your N
10
+ independent items and call `run.report(task_id, score=..., ...)` ONCE PER
11
+ ITEM. Do NOT roll up to one aggregate `run.report("eval_total", score=avg)`
12
+ call -- the dashboard's per-task panel and the verifier's reproducibility
13
+ spot-check both rely on per-item traces. The example below shows the right
14
+ pattern. The Anti-pattern at the bottom is what to avoid.
8
15
  """
9
16
 
10
17
  from evo_agent import Run, Gate
@@ -41,3 +48,21 @@ with Gate() as gate:
41
48
  for task in critical_tasks:
42
49
  result = evaluate(task, agent)
43
50
  gate.check(task["id"], score=result.score)
51
+
52
+
53
+ # ---- ANTI-PATTERN (do NOT do this) ----
54
+ #
55
+ # Computing the aggregate yourself and reporting it as a single task
56
+ # entry loses every diagnostic value of the per-item traces -- dashboard
57
+ # can't show which problems failed, verifier can't spot-check, ideator
58
+ # can't cluster failure modes. The SDK aggregates from per-task reports
59
+ # automatically.
60
+ #
61
+ # # WRONG:
62
+ # scores = [evaluate(t, agent).score for t in tasks]
63
+ # run.report("eval_total", score=sum(scores) / len(scores))
64
+ # run.finish()
65
+ #
66
+ # Exception: if your benchmark really is a single indivisible measurement
67
+ # (one e2e workflow with one score number), report one task AND attach
68
+ # every observable as extras.
@@ -30,6 +30,14 @@ Read `.evo/project.md`'s resource-profile line first (discover records it). If i
30
30
 
31
31
  When a run needs an exclusive resource, serializing benchmark *execution* (width 1) is correct even though the *edits* are independent — on the worktree backend `evo run` executes the benchmark in-place, so concurrent `evo run` means concurrent benchmark processes on that one resource.
32
32
 
33
+ **Latency / timing / throughput benchmarks deserve a per-workspace judgment call, not a fixed answer.** When the metric IS time, jitter, or rate, sibling-process CPU/cache/memory-bandwidth pressure can BIAS the measurement (not just add noise) — and the orchestrator may then promote a "winner" that's just a contention artifact. But this doesn't always happen, and harness softeners (warmup, min-over-N batches, outlier rejection) reduce the risk. Things to weigh case-by-case before picking width:
34
+ - How big is the optimization's expected effect vs. the variance the harness reports under parallel runs? If the effect is much larger than measurement jitter, modest parallelism is fine.
35
+ - How much of the benchmark's wall-clock is the actual timed section? Long edit/compile phases overlap safely; only the timed section needs isolation.
36
+ - Can a winner be cheaply re-confirmed solo before being promoted? If yes, going wider for exploration with a solo-confirm gate is reasonable.
37
+ - Does the harness already filter contention (e.g., reject batches with outlier jitter)?
38
+
39
+ If unsure, start narrower and widen once you've confirmed measurements are stable. Width 1 is the safe default for *unknown* timing-sensitive benchmarks; don't apply it reflexively when the workspace has data that says otherwise.
40
+
33
41
  ## 3. Depth — `budget` (iterations per subagent within its branch)
34
42
 
35
43
  Depth trades exploration against spend, and keys off cost per run, not concurrency:
@@ -1,8 +1,7 @@
1
1
  ---
2
2
  name: infra-setup
3
3
  description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
4
- disable-model-invocation: true
5
- evo_version: 0.4.4
4
+ evo_version: 0.5.0-alpha.10
6
5
  ---
7
6
 
8
7
  # Infra Setup