@evo-hq/pi-evo 0.4.5 → 0.5.0-alpha.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/discover/SKILL.md +145 -9
- package/skills/discover/references/inline_instrumentation.js +45 -0
- package/skills/discover/references/inline_instrumentation.py +50 -0
- package/skills/discover/references/sdk_node.js +19 -0
- package/skills/discover/references/sdk_python.py +25 -0
- package/skills/{optimize → discover}/references/sizing-the-round.md +8 -0
- package/skills/infra-setup/SKILL.md +1 -2
- package/skills/optimize/SKILL.md +167 -8
- package/skills/optimize/workflows/evo-optimize.js +674 -0
- package/skills/report/SKILL.md +1 -1
- package/skills/subagent/SKILL.md +89 -7
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,13 +2,89 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.
|
|
5
|
+
evo_version: 0.5.0-alpha.11
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
9
9
|
|
|
10
10
|
Internal procedure for `evo:discover`. The user only sees the user-facing prompts, the dashboard URL, and the baseline score -- everything else is the agent's choreography.
|
|
11
11
|
|
|
12
|
+
## Evo surface
|
|
13
|
+
|
|
14
|
+
General guidance on the skills and tools available in evo. Each line is a triggering condition: if you're about to do X, pull/dispatch/read this. Don't preload -- act when the trigger fires.
|
|
15
|
+
|
|
16
|
+
**Always have a sense of the skill before jumping into its references.** A skill body carries the decision-making; references are concrete contracts that assume a decision has been made.
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
evo plugin
|
|
20
|
+
│
|
|
21
|
+
├── Main thread (the orchestrator -- you, inside /evo:discover or /evo:optimize)
|
|
22
|
+
│ │
|
|
23
|
+
│ ├── Skills (Skill tool)
|
|
24
|
+
│ │ ├── evo:discover starting a new evo workspace / instrumenting a project
|
|
25
|
+
│ │ ├── evo:optimize after discover commits the baseline -- drives the loop.
|
|
26
|
+
│ │ │ Args: subagents=N (read sizing-the-round FIRST),
|
|
27
|
+
│ │ │ autonomous, subagents-only, budget=N, stall=N
|
|
28
|
+
│ │ ├── evo:finetuning task is finetuning / post-training / training a model
|
|
29
|
+
│ │ └── evo:infra-setup need a remote backend, pooled workspaces, lease/slot
|
|
30
|
+
│ │ management, or specific provider auth/setup
|
|
31
|
+
│ │
|
|
32
|
+
│ └── Subagents to dispatch (Task tool, subagent_type=...)
|
|
33
|
+
│ ├── evo:benchmark-reviewer before the baseline run, or whenever the
|
|
34
|
+
│ │ benchmark command / harness changes
|
|
35
|
+
│ └── evo:ideator stalled, or every ~5 committed experiments.
|
|
36
|
+
│ One subagent per brief:
|
|
37
|
+
│ failure_analysis, literature, frontier_extrapolation
|
|
38
|
+
│
|
|
39
|
+
├── Subagent thread (each subagent spawned by /optimize step 5)
|
|
40
|
+
│ │
|
|
41
|
+
│ ├── Skills (the subagent loads this on first turn -- the brief's first
|
|
42
|
+
│ │ sentence mandates it; not auto-loaded by the host)
|
|
43
|
+
│ │ └── evo:subagent load FIRST -- defines the iteration protocol
|
|
44
|
+
│ │ + brief field shape the subagent operates under
|
|
45
|
+
│ │
|
|
46
|
+
│ └── Subagents to dispatch (Task tool, subagent_type=...)
|
|
47
|
+
│ └── evo:verifier ALWAYS dispatch pre AND post every evo run.
|
|
48
|
+
│ Pre: ~30s static analysis before the experiment runs.
|
|
49
|
+
│ Post: result-validity audit after it commits.
|
|
50
|
+
│ Not optional. Not ad-hoc.
|
|
51
|
+
│
|
|
52
|
+
└── Key references (Read tool, on demand)
|
|
53
|
+
├── discover/references/
|
|
54
|
+
│ ├── constructing-benchmark.md designing + assembling a benchmark from scratch
|
|
55
|
+
│ ├── sdk_python.py / sdk_node.js wiring per-task instrumentation -- preferred path
|
|
56
|
+
│ ├── inline_instrumentation.py inline fallback when SDK can't be used.
|
|
57
|
+
│ │ Copy as-is; do not reimplement (file header
|
|
58
|
+
│ │ explains why)
|
|
59
|
+
│ ├── sizing-the-round.md BEFORE invoking /evo:optimize with any
|
|
60
|
+
│ │ specific subagents=N. Single-GPU /
|
|
61
|
+
│ │ single-exclusive-resource -> subagents=1
|
|
62
|
+
│ ├── proposing-dimensions.md choosing what to optimize when not obvious
|
|
63
|
+
│ └── instrumentation-contract.md the format evo reads (result + traces shapes)
|
|
64
|
+
│
|
|
65
|
+
├── finetuning/references/
|
|
66
|
+
│ ├── glue.md writing train.py -- I/O contract evo expects
|
|
67
|
+
│ ├── diagnostics.md per-failure-mode diagnostics
|
|
68
|
+
│ ├── false-progress.md what doesn't count as improvement
|
|
69
|
+
│ ├── trace-schema.md per-task trace JSON schema for training runs
|
|
70
|
+
│ ├── rl/ RL framework references
|
|
71
|
+
│ │ └── art.md ART (Algorithm-Refined Training)
|
|
72
|
+
│ ├── sft/ SFT framework references
|
|
73
|
+
│ │ └── tinker.md Tinker SFT
|
|
74
|
+
│ └── serving/ eval-time inference references
|
|
75
|
+
│ └── vllm.md vLLM serving config + LoRA-multi
|
|
76
|
+
│
|
|
77
|
+
├── infra-setup/references/
|
|
78
|
+
│ └── provider-matrix.md provider/backend summary (auth, setup, costs)
|
|
79
|
+
│
|
|
80
|
+
└── references/ (shared across skills)
|
|
81
|
+
├── evo-wait.md any time you need to wait without burning
|
|
82
|
+
│ context (subagent completion, training,
|
|
83
|
+
│ ideators, GPU activity, any long-running)
|
|
84
|
+
├── agent-sdk-reference.md SDK API surface
|
|
85
|
+
└── cli-quick-reference.md CLI subcommand cheat sheet
|
|
86
|
+
```
|
|
87
|
+
|
|
12
88
|
## Host conventions
|
|
13
89
|
|
|
14
90
|
This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
|
|
@@ -40,20 +116,20 @@ evo --version
|
|
|
40
116
|
The output must be exactly:
|
|
41
117
|
|
|
42
118
|
```
|
|
43
|
-
evo-hq-cli 0.
|
|
119
|
+
evo-hq-cli 0.5.0-alpha.11
|
|
44
120
|
```
|
|
45
121
|
|
|
46
122
|
Three outcomes:
|
|
47
123
|
|
|
48
124
|
1. **Matches exactly** — continue to step 1.
|
|
49
125
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
50
|
-
> Your installed evo CLI is on a different version than this skill (`0.
|
|
126
|
+
> Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.11`). Run:
|
|
51
127
|
> ```
|
|
52
|
-
> uv tool install --force evo-hq-cli==0.
|
|
128
|
+
> uv tool install --force evo-hq-cli==0.5.0-alpha.11
|
|
53
129
|
> ```
|
|
54
130
|
> Then re-invoke this skill.
|
|
55
131
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
56
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.
|
|
132
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.11` (or `pipx install evo-hq-cli==0.5.0-alpha.11`). Then re-invoke this skill.
|
|
57
133
|
|
|
58
134
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
59
135
|
|
|
@@ -180,7 +256,8 @@ build/
|
|
|
180
256
|
evo init --name "<short project name>" \
|
|
181
257
|
--target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
|
|
182
258
|
--host <claude-code|codex|opencode|openclaw|hermes|pi|generic> \
|
|
183
|
-
--instrumentation-mode <sdk|inline>
|
|
259
|
+
--instrumentation-mode <sdk|inline> \
|
|
260
|
+
--per-exp-timeout <seconds> [--gate "<gate command>"] \
|
|
184
261
|
[--commit-strategy <all|tracked-only>]
|
|
185
262
|
```
|
|
186
263
|
|
|
@@ -188,6 +265,8 @@ evo init --name "<short project name>" \
|
|
|
188
265
|
|
|
189
266
|
**`--name` should be a short human-readable project label** for dashboard display, chosen from the repository/product context. Existing workspaces without a name fall back to the repo directory name; do not hand-edit config just to migrate them.
|
|
190
267
|
|
|
268
|
+
**`--per-exp-timeout` is required.** Wall-clock seconds for each `evo run` invocation. Becomes the workspace default; override per-call with `evo run --timeout N`. Pick based on what the benchmark actually costs end-to-end on this hardware -- if you don't know yet, time the benchmark once locally and use ~2x that. Typical ranges: a unit-test-style benchmark is 300-900s; a small-model SFT + eval cycle is 1800-3600s; a large-model train run is several hours. Set conservatively -- a too-tight value kills experiments mid-flight; a too-loose value wastes budget only when something actually hangs. Update later with `evo config set per-exp-timeout <seconds>`.
|
|
269
|
+
|
|
191
270
|
**`--commit-strategy` is optional.** Default is `all`. Override with `--commit-strategy tracked-only` only when you want the stricter shisa-kanko flow where new files must be staged explicitly and acknowledged at `evo run` time.
|
|
192
271
|
|
|
193
272
|
**Placeholder semantics.** Benchmark and gate commands support two placeholders, resolved lazily at run time by `evo run` / gate evaluation:
|
|
@@ -205,7 +284,8 @@ evo init \
|
|
|
205
284
|
--target agent/solve.py \
|
|
206
285
|
--benchmark "python3 {worktree}/benchmark.py --target {target}" \
|
|
207
286
|
--metric max \
|
|
208
|
-
--host claude-code
|
|
287
|
+
--host claude-code \
|
|
288
|
+
--per-exp-timeout 1800
|
|
209
289
|
```
|
|
210
290
|
|
|
211
291
|
Use the same runtime entry point the project already uses, but make sure the command does not assume uncommitted runtime state exists inside the worktree. Worktrees are git checkouts; untracked directories such as local virtualenvs, build caches, and downloaded models are usually not present there. If the benchmark needs setup or a package-manager runner, configure evo's runtime recipe instead of baking local paths into the benchmark command:
|
|
@@ -225,6 +305,14 @@ Dashboard live: http://127.0.0.1:8080 (pid 12345)
|
|
|
225
305
|
|
|
226
306
|
**Relay that line back to the user verbatim.** If port 8080 is busy, evo auto-increments -- show whatever port prints. The URL is how the user watches the run.
|
|
227
307
|
|
|
308
|
+
**Benchmark commands must be eval-only.** Do NOT wrap training and evaluation into a single benchmark command. If your benchmark command runs training before scoring, every gate revalidation and every `evo run --check` retrains from scratch, and the experiment budget burns on duplicated training instead of new experiments. Training is a separate step the agent invokes BEFORE `evo run`:
|
|
309
|
+
|
|
310
|
+
1. The agent makes changes (data curation, hyperparameter selection, technique choice, training code edits).
|
|
311
|
+
2. The agent runs training to produce a checkpoint at the experiment worktree's `final_model/` (or wherever the technique's recipe in `evo:finetuning/references/glue.md` specifies).
|
|
312
|
+
3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced checkpoint and emits a score.
|
|
313
|
+
|
|
314
|
+
The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs train+eval together with no eval-only mode, wrap it: add a `--skip-train` flag, or have the wrapper detect an existing checkpoint at `final_model/` and short-circuit the train step. Without this, evo's gate-recheck and re-score mechanics retrain repeatedly and the budget evaporates.
|
|
315
|
+
|
|
228
316
|
**Runtime environment.** If the benchmark needs keys or other runtime variables, configure them through evo rather than copying `.env` into worktrees or hand-editing `config.json`:
|
|
229
317
|
|
|
230
318
|
```bash
|
|
@@ -318,7 +406,31 @@ Paths below are relative to this `SKILL.md` file (resolve them against the skill
|
|
|
318
406
|
- Benchmark is Python or Node: paste `references/inline_instrumentation.py` (or `.js`) and call `log_task` / `logTask` per task, `write_result` / `writeResult` once at the end.
|
|
319
407
|
- Benchmark is any other language: port the contract from `references/instrumentation-contract.md` directly into that language (~10-15 lines: read the env vars, write each task trace, atomically publish the result). Do not add a Python/JS wrapper around it.
|
|
320
408
|
|
|
321
|
-
|
|
409
|
+
**Per-task emission is load-bearing.** If your benchmark evaluates N independent items (per-question math, per-test-case unit tests, per-document QA, per-sample reasoning trace), emit ONE `log_task` / `report` / trace file PER ITEM -- not one aggregate. Include the item's input, expected output, model output, and any per-item metadata as extras; that detail is what the dashboard's per-task panel + the verifier's reproducibility spot-check + the ideator's failure-clustering all rely on. Wrappers that compute the average score themselves and emit a single aggregate task entry look like they work but lose every diagnostic capability evo provides. The reference files have explicit USAGE EXAMPLES showing the per-item loop AND an ANTI-PATTERN block showing what NOT to do (see `references/inline_instrumentation.{py,js}` and `references/sdk_{python,node}.{py,js}`). Follow them. Single-aggregate emission is only valid when the benchmark really is one indivisible measurement (one e2e workflow, one perf number) -- and even then, attach every observable as extras.
|
|
410
|
+
|
|
411
|
+
**Runner-library wrappers are the common failure mode.** When the selected benchmark wraps a runner library (e.g. `inspect_evals`, `lm-eval-harness`, `evals`), the per-item loop is hidden inside the runner. The wrapper script's natural shape is to call the runner, read its aggregate output, and write a single `{"score": X}` to `result.json`. This is the anti-pattern. **Even when the runner library handles the per-sample loop internally, the wrapper script MUST parse the runner's per-sample output JSON and emit one `log_task(item_id, score=..., extras={...})` per item.** The runner typically already writes per-sample data to disk (most of these libraries do — `inspect_evals` writes per-sample logs, `lm-eval-harness` writes per-doc records); the wrapper just hasn't been forwarding it to evo. Without per-item forwarding, the dashboard's Tasks tab is empty, the verifier can't spot-check, and the agent has no way to diagnose WHICH items the model fails on — which is required for any RL-on-failures or curriculum strategy.
|
|
412
|
+
|
|
413
|
+
**Write traces for the future reader.** The only consumers of a committed experiment's traces look BACK at them -- a future orchestrator picking the next move, a verifier auditing for false progress, an ideator clustering failure modes. A trace that records `{score: 0}` and nothing else is unrecoverable: the future reader cannot tell whether the model produced wrong reasoning, unparseable format, or no output at all. Two rules of thumb:
|
|
414
|
+
|
|
415
|
+
- **`log_task` extras carry context** -- the item's input, expected output, the model's actual output (or first ~500 chars of it), parse outcome, any error. Cost: a few KB per task. Payoff: diagnosis is possible later without re-running the eval.
|
|
416
|
+
- **`evo annotate <exp_id>` is the human-readable summary** -- one or two factual lines after a run commits: which data sources were used, what the score actually represented, the failure mode if any. Annotations get loaded into future orchestrator decisions and ideator briefs, so write them as facts ("model emitted `\boxed{}`, eval prompt requested `ANSWER:`, format mismatch suspected"), not as plans or feelings.
|
|
417
|
+
|
|
418
|
+
### 10d. Audit with benchmark-reviewer (mandatory before baseline)
|
|
419
|
+
|
|
420
|
+
Before `evo run` is invoked for the first time, audit the harness via the bundled subagent:
|
|
421
|
+
|
|
422
|
+
```
|
|
423
|
+
Task(subagent_type="evo:benchmark-reviewer",
|
|
424
|
+
prompt="workspace=<absolute path to dir containing .evo/>\nbenchmark_command=<the literal --benchmark string from evo init>\nunit=<one-line description of an item, e.g. 'AIME problem', 'HumanEval task', 'BFCL turn'>")
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
The subagent returns a JSON report with `passed`, `findings[]`. Each finding has `severity: block|warn|note`.
|
|
428
|
+
|
|
429
|
+
**Gate the baseline on this.** If `passed: false`, address every `block` finding before re-invoking the reviewer. Do **not** proceed to `evo run` until the report comes back clean. Typical `block` findings: aggregate-only emission (most common -- see step 10c), training source that overlaps the held-out set, no real gate registered for a constructed benchmark, harness silently writes `{"score": 0.0}` on error instead of crashing.
|
|
430
|
+
|
|
431
|
+
`warn` and `note` findings are informational -- record them in `.evo/project.md` and proceed.
|
|
432
|
+
|
|
433
|
+
### 10e. Cheap validation run
|
|
322
434
|
|
|
323
435
|
Before the full baseline, validate the toolchain with the cheapest possible end-to-end run (single task, smallest split, dry-run flag -- whatever is fastest). Run the check from the main repo root:
|
|
324
436
|
|
|
@@ -339,7 +451,7 @@ This is the authoritative wiring check, and it is language-agnostic -- it runs t
|
|
|
339
451
|
|
|
340
452
|
Fix any issues and re-validate before proceeding.
|
|
341
453
|
|
|
342
|
-
###
|
|
454
|
+
### 10f. Commit inside the worktree
|
|
343
455
|
|
|
344
456
|
Logical commits are ideal but not required. Minimal acceptable:
|
|
345
457
|
|
|
@@ -405,9 +517,33 @@ End the skill by reporting in chat:
|
|
|
405
517
|
- The baseline experiment ID and score
|
|
406
518
|
- The chosen optimization dimension and why
|
|
407
519
|
- A one-liner on next steps: "Run `/evo:optimize` to start the optimization loop."
|
|
520
|
+
|
|
521
|
+
**Do not run experiments outside `/evo:optimize`.** Even if the workspace's resource profile forces serial execution (e.g. exclusive-GPU, width 1), you still go through `/evo:optimize` with `subagents=1`. The optimize loop's value isn't just parallelism -- it's the structured loop around every experiment (scan-subagent cross-cutting analysis, brief writing, verifier pre/post hooks, ideator spawning on stall, frontier reconciliation). Bypassing optimize to "drive experiments directly" loses all of that and reverts to ad-hoc iteration. If you are tempted to skip optimize because the workload is serial, read `plugins/evo/skills/optimize/SKILL.md` for how to configure it for serial work -- the answer is `subagents=1`, NOT "don't run optimize."
|
|
522
|
+
|
|
408
523
|
- **Resume after crash:** if the host, the shell, or the machine restarts mid-flow, re-invoke `evo:optimize`. Evo reads `.evo/` and resumes from the last committed experiment -- no special restore procedure.
|
|
409
524
|
- **State is local to this machine:** experiment commits on branches like `evo/run_0000/exp_*` survive `git push --all`, but orchestration state (graph, annotations, project notes) lives only in `.evo/`. If that history matters to you, back up `.evo/` separately (e.g., `tar -czf evo-state-$(date +%F).tar.gz .evo/`).
|
|
410
525
|
|
|
526
|
+
## Polling discipline
|
|
527
|
+
|
|
528
|
+
When waiting on a long-running background process (training, evaluation, batch generation, any externally-spawned long task), do NOT use `while true; do sleep N; tail file; done`. That loop never exits when the underlying process crashes -- the tail keeps reading the same dead file, the agent interprets "no growth" as "still working," and the agent blocks indefinitely.
|
|
529
|
+
|
|
530
|
+
Use `evo wait`. The CLI is the bounded, structured replacement:
|
|
531
|
+
|
|
532
|
+
```bash
|
|
533
|
+
# wait for a training subprocess to exit, OR its log to stall, OR the GPU to go idle,
|
|
534
|
+
# whichever first; 60-minute ceiling; structured JSON on stdout
|
|
535
|
+
evo wait --for process=$TRAIN_PID \
|
|
536
|
+
--for log-growth=$TRAIN_LOG \
|
|
537
|
+
--for gpu-idle \
|
|
538
|
+
--timeout 60m --stall-threshold 5m --json
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
Multiple `--for` flags combine; the wait returns on the first matching condition. The JSON output's `exit_reason` and `triggered_by` identify which condition fired (process-exited / log-stalled / gpu-idle / timed-out). Process / log-growth / gpu-* watches do not require an evo workspace; the workspace-anchored watches (`--for experiments`, `--for ideators`) still work for ideator-proposal and commit waits.
|
|
542
|
+
|
|
543
|
+
Full surface, exit codes, JSON shape, and examples in `references/evo-wait.md`.
|
|
544
|
+
|
|
545
|
+
If `evo wait` is not available for some reason (older CLI on PATH), fall back to a bounded poll loop that checks all three signals -- process liveness via `kill -0 $PID`, log growth via `wc -c` delta, GPU via `nvidia-smi --query-gpu=utilization.gpu` -- and exits on any one going negative. NEVER unbounded `while true`.
|
|
546
|
+
|
|
411
547
|
## Inspection commands (for debugging, reference only)
|
|
412
548
|
|
|
413
549
|
```bash
|
|
@@ -6,6 +6,13 @@
|
|
|
6
6
|
* - Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from process.env.
|
|
7
7
|
* - Writes traces/task_<id>.json per task.
|
|
8
8
|
* - Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
|
|
9
|
+
*
|
|
10
|
+
* **Per-task emission is the load-bearing discipline.** If your benchmark
|
|
11
|
+
* evaluates N independent items (per-test-case, per-document, per-sample),
|
|
12
|
+
* call logTask ONCE PER ITEM with as much detail as you have (input,
|
|
13
|
+
* expected, model output, timings). Do NOT roll up to one aggregate call --
|
|
14
|
+
* the dashboard's per-task panel and the verifier's reproducibility check
|
|
15
|
+
* both rely on per-item traces. See the USAGE EXAMPLE at the bottom.
|
|
9
16
|
*/
|
|
10
17
|
|
|
11
18
|
import {
|
|
@@ -95,3 +102,41 @@ export function writeResult(score) {
|
|
|
95
102
|
}
|
|
96
103
|
return score;
|
|
97
104
|
}
|
|
105
|
+
|
|
106
|
+
|
|
107
|
+
// === USAGE EXAMPLE (copy + adapt) ===
|
|
108
|
+
//
|
|
109
|
+
// For a benchmark scoring N independent items, emit one logTask per item.
|
|
110
|
+
// writeResult() with no arg aggregates from the per-task scores.
|
|
111
|
+
//
|
|
112
|
+
// async function main() {
|
|
113
|
+
// const cases = await loadTestCases(); // N items
|
|
114
|
+
// for (const [i, c] of cases.entries()) {
|
|
115
|
+
// const output = await runUnderTest(c.input);
|
|
116
|
+
// const correct = output === c.expected;
|
|
117
|
+
// logTask(
|
|
118
|
+
// `case_${String(i).padStart(2, "0")}`, // stable id per item
|
|
119
|
+
// correct ? 1.0 : 0.0, // per-item score
|
|
120
|
+
// {
|
|
121
|
+
// input: c.input, // extras land in the trace JSON
|
|
122
|
+
// expected: c.expected, // for diagnosis later
|
|
123
|
+
// actual: output,
|
|
124
|
+
// elapsed_ms: c.elapsed,
|
|
125
|
+
// }
|
|
126
|
+
// );
|
|
127
|
+
// }
|
|
128
|
+
// // No arg -> writeResult averages the per-task scores
|
|
129
|
+
// const final = writeResult();
|
|
130
|
+
// console.log(`final: ${final.toFixed(4)}`);
|
|
131
|
+
// }
|
|
132
|
+
//
|
|
133
|
+
// Anti-pattern: one logTask or writeResult with the aggregate score. Loses
|
|
134
|
+
// the per-item detail the dashboard and verifier rely on. Don't do this:
|
|
135
|
+
//
|
|
136
|
+
// const score = passed / total;
|
|
137
|
+
// logTask("eval_total", score); // <-- aggregate; useless
|
|
138
|
+
// writeResult(score);
|
|
139
|
+
//
|
|
140
|
+
// Exception: if the benchmark really is a single indivisible measurement
|
|
141
|
+
// (one e2e workflow, one perf number), emit one task with that score AND
|
|
142
|
+
// pass every observable as extras (timings, allocations, error log).
|
|
@@ -5,6 +5,18 @@ Contract:
|
|
|
5
5
|
- Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from env.
|
|
6
6
|
- Writes traces/task_<id>.json per task.
|
|
7
7
|
- Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
|
|
8
|
+
|
|
9
|
+
**Per-task emission is the load-bearing discipline.** If your benchmark
|
|
10
|
+
evaluates N independent items (per-question math, per-test-case unit
|
|
11
|
+
tests, per-document QA, per-sample reasoning trace), call `log_task`
|
|
12
|
+
ONCE PER ITEM with as much detail as you have (problem text, model
|
|
13
|
+
output, expected answer, intermediate reasoning) -- not one aggregate
|
|
14
|
+
call with the rolled-up score. The dashboard's per-task panel and the
|
|
15
|
+
verifier's reproducibility spot-check both rely on per-item traces;
|
|
16
|
+
emitting one aggregate `log_task("eval_total", score)` makes both
|
|
17
|
+
useless and loses the diagnostic value of the run. `write_result()`
|
|
18
|
+
aggregates from the per-task scores automatically -- you don't need to
|
|
19
|
+
roll up yourself. See the USAGE EXAMPLE at the bottom of this file.
|
|
8
20
|
"""
|
|
9
21
|
|
|
10
22
|
from __future__ import annotations
|
|
@@ -107,3 +119,41 @@ def write_result(score: float | None = None) -> float:
|
|
|
107
119
|
else:
|
|
108
120
|
print(payload)
|
|
109
121
|
return score
|
|
122
|
+
|
|
123
|
+
|
|
124
|
+
# === USAGE EXAMPLE (copy + adapt) ===
|
|
125
|
+
#
|
|
126
|
+
# For a benchmark scoring N independent items, emit one log_task per item.
|
|
127
|
+
# write_result() with no arg aggregates _SCORES into the final score.
|
|
128
|
+
#
|
|
129
|
+
# def main():
|
|
130
|
+
# problems = load_aime_problems() # list of N items
|
|
131
|
+
# model = load_model()
|
|
132
|
+
# for i, problem in enumerate(problems):
|
|
133
|
+
# output = model.generate(problem.question)
|
|
134
|
+
# correct = extract_answer(output) == problem.expected
|
|
135
|
+
# log_task(
|
|
136
|
+
# f"aime_q{i:02d}", # stable id per item
|
|
137
|
+
# score=1.0 if correct else 0.0, # per-item score
|
|
138
|
+
# question=problem.question, # everything else goes as **extra
|
|
139
|
+
# expected=problem.expected, # and lands in the trace JSON
|
|
140
|
+
# model_output=output, # for diagnosis later
|
|
141
|
+
# tokens_used=len(model.last_tokens),
|
|
142
|
+
# )
|
|
143
|
+
# # write_result() with no arg returns mean(_SCORES) over all logged tasks --
|
|
144
|
+
# # no need to compute the average yourself
|
|
145
|
+
# final_score = write_result()
|
|
146
|
+
# print(f"final: {final_score:.4f}")
|
|
147
|
+
#
|
|
148
|
+
# Anti-pattern: ONE log_task or write_result call with the aggregate. The
|
|
149
|
+
# dashboard's per-task panel + the verifier's reproducibility spot-check
|
|
150
|
+
# both need per-item traces; aggregate-only emission makes them useless.
|
|
151
|
+
#
|
|
152
|
+
# # DO NOT do this:
|
|
153
|
+
# score = sum(per_problem_correct) / len(per_problem_correct)
|
|
154
|
+
# log_task("eval_total", score) # <-- aggregate; loses detail
|
|
155
|
+
# write_result(score)
|
|
156
|
+
#
|
|
157
|
+
# Exception: if the benchmark really is a single indivisible measurement
|
|
158
|
+
# (one e2e workflow, one perf number), emit one task with that score AND
|
|
159
|
+
# include every observable as **extra (timings, allocations, error log).
|
|
@@ -3,6 +3,12 @@
|
|
|
3
3
|
// The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and
|
|
4
4
|
// $EVO_RESULT_PATH. Traces flush on each report() so the dashboard can
|
|
5
5
|
// stream progress live.
|
|
6
|
+
//
|
|
7
|
+
// **Per-task emission is the load-bearing discipline.** Loop over your N
|
|
8
|
+
// independent items and call run.report(id, {score, ...}) ONCE PER ITEM.
|
|
9
|
+
// Do NOT roll up to one aggregate `run.report("eval_total", ...)` call --
|
|
10
|
+
// dashboard panel + verifier reproducibility check both rely on per-item
|
|
11
|
+
// traces. The Anti-pattern at the bottom is what to avoid.
|
|
6
12
|
|
|
7
13
|
import { Run, Gate } from '@evo-hq/evo-agent';
|
|
8
14
|
|
|
@@ -26,3 +32,16 @@ for (const task of criticalTasks) {
|
|
|
26
32
|
gate.check(task.id, { score: result.score });
|
|
27
33
|
}
|
|
28
34
|
await gate.finish();
|
|
35
|
+
|
|
36
|
+
// ---- ANTI-PATTERN (do NOT do this) ----
|
|
37
|
+
//
|
|
38
|
+
// Reporting one aggregate task entry loses every diagnostic value of
|
|
39
|
+
// per-item traces. SDK aggregates from per-task reports automatically.
|
|
40
|
+
//
|
|
41
|
+
// // WRONG:
|
|
42
|
+
// const scores = await Promise.all(tasks.map(t => evaluate(t).then(r => r.score)));
|
|
43
|
+
// run.report("eval_total", { score: scores.reduce((a, b) => a + b) / scores.length });
|
|
44
|
+
// await run.finish();
|
|
45
|
+
//
|
|
46
|
+
// Exception: if your benchmark really is a single indivisible measurement,
|
|
47
|
+
// report one task AND attach every observable as extras.
|
|
@@ -5,6 +5,13 @@ Install `evo-hq-agent` with this project's package manager/runtime, for example
|
|
|
5
5
|
|
|
6
6
|
The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and $EVO_RESULT_PATH.
|
|
7
7
|
Traces flush on each report() so the dashboard can stream progress live.
|
|
8
|
+
|
|
9
|
+
**Per-task emission is the load-bearing discipline.** Loop over your N
|
|
10
|
+
independent items and call `run.report(task_id, score=..., ...)` ONCE PER
|
|
11
|
+
ITEM. Do NOT roll up to one aggregate `run.report("eval_total", score=avg)`
|
|
12
|
+
call -- the dashboard's per-task panel and the verifier's reproducibility
|
|
13
|
+
spot-check both rely on per-item traces. The example below shows the right
|
|
14
|
+
pattern. The Anti-pattern at the bottom is what to avoid.
|
|
8
15
|
"""
|
|
9
16
|
|
|
10
17
|
from evo_agent import Run, Gate
|
|
@@ -41,3 +48,21 @@ with Gate() as gate:
|
|
|
41
48
|
for task in critical_tasks:
|
|
42
49
|
result = evaluate(task, agent)
|
|
43
50
|
gate.check(task["id"], score=result.score)
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
# ---- ANTI-PATTERN (do NOT do this) ----
|
|
54
|
+
#
|
|
55
|
+
# Computing the aggregate yourself and reporting it as a single task
|
|
56
|
+
# entry loses every diagnostic value of the per-item traces -- dashboard
|
|
57
|
+
# can't show which problems failed, verifier can't spot-check, ideator
|
|
58
|
+
# can't cluster failure modes. The SDK aggregates from per-task reports
|
|
59
|
+
# automatically.
|
|
60
|
+
#
|
|
61
|
+
# # WRONG:
|
|
62
|
+
# scores = [evaluate(t, agent).score for t in tasks]
|
|
63
|
+
# run.report("eval_total", score=sum(scores) / len(scores))
|
|
64
|
+
# run.finish()
|
|
65
|
+
#
|
|
66
|
+
# Exception: if your benchmark really is a single indivisible measurement
|
|
67
|
+
# (one e2e workflow with one score number), report one task AND attach
|
|
68
|
+
# every observable as extras.
|
|
@@ -30,6 +30,14 @@ Read `.evo/project.md`'s resource-profile line first (discover records it). If i
|
|
|
30
30
|
|
|
31
31
|
When a run needs an exclusive resource, serializing benchmark *execution* (width 1) is correct even though the *edits* are independent — on the worktree backend `evo run` executes the benchmark in-place, so concurrent `evo run` means concurrent benchmark processes on that one resource.
|
|
32
32
|
|
|
33
|
+
**Latency / timing / throughput benchmarks deserve a per-workspace judgment call, not a fixed answer.** When the metric IS time, jitter, or rate, sibling-process CPU/cache/memory-bandwidth pressure can BIAS the measurement (not just add noise) — and the orchestrator may then promote a "winner" that's just a contention artifact. But this doesn't always happen, and harness softeners (warmup, min-over-N batches, outlier rejection) reduce the risk. Things to weigh case-by-case before picking width:
|
|
34
|
+
- How big is the optimization's expected effect vs. the variance the harness reports under parallel runs? If the effect is much larger than measurement jitter, modest parallelism is fine.
|
|
35
|
+
- How much of the benchmark's wall-clock is the actual timed section? Long edit/compile phases overlap safely; only the timed section needs isolation.
|
|
36
|
+
- Can a winner be cheaply re-confirmed solo before being promoted? If yes, going wider for exploration with a solo-confirm gate is reasonable.
|
|
37
|
+
- Does the harness already filter contention (e.g., reject batches with outlier jitter)?
|
|
38
|
+
|
|
39
|
+
If unsure, start narrower and widen once you've confirmed measurements are stable. Width 1 is the safe default for *unknown* timing-sensitive benchmarks; don't apply it reflexively when the workspace has data that says otherwise.
|
|
40
|
+
|
|
33
41
|
## 3. Depth — `budget` (iterations per subagent within its branch)
|
|
34
42
|
|
|
35
43
|
Depth trades exploration against spend, and keys off cost per run, not concurrency:
|
|
@@ -1,8 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: infra-setup
|
|
3
3
|
description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
|
|
4
|
-
|
|
5
|
-
evo_version: 0.4.5
|
|
4
|
+
evo_version: 0.5.0-alpha.11
|
|
6
5
|
---
|
|
7
6
|
|
|
8
7
|
# Infra Setup
|