@evo-hq/pi-evo 0.4.4-alpha.5 → 0.4.4-alpha.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/discover/SKILL.md +20 -40
- package/skills/discover/references/constructing-benchmark.md +1 -1
- package/skills/discover/references/instrumentation-contract.md +89 -0
- package/skills/infra-setup/SKILL.md +1 -1
- package/skills/optimize/SKILL.md +16 -14
- package/skills/optimize/references/sizing-the-round.md +47 -0
- package/skills/report/SKILL.md +1 -1
- package/skills/subagent/SKILL.md +1 -1
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.4.4-alpha.
|
|
5
|
+
evo_version: 0.4.4-alpha.6
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
@@ -40,20 +40,20 @@ evo --version
|
|
|
40
40
|
The output must be exactly:
|
|
41
41
|
|
|
42
42
|
```
|
|
43
|
-
evo-hq-cli 0.4.4-alpha.
|
|
43
|
+
evo-hq-cli 0.4.4-alpha.6
|
|
44
44
|
```
|
|
45
45
|
|
|
46
46
|
Three outcomes:
|
|
47
47
|
|
|
48
48
|
1. **Matches exactly** — continue to step 1.
|
|
49
49
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
50
|
-
> Your installed evo CLI is on a different version than this skill (`0.4.4-alpha.
|
|
50
|
+
> Your installed evo CLI is on a different version than this skill (`0.4.4-alpha.6`). Run:
|
|
51
51
|
> ```
|
|
52
|
-
> uv tool install --force evo-hq-cli==0.4.4-alpha.
|
|
52
|
+
> uv tool install --force evo-hq-cli==0.4.4-alpha.6
|
|
53
53
|
> ```
|
|
54
54
|
> Then re-invoke this skill.
|
|
55
55
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
56
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.4.4-alpha.
|
|
56
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.4.4-alpha.6` (or `pipx install evo-hq-cli==0.4.4-alpha.6`). Then re-invoke this skill.
|
|
57
57
|
|
|
58
58
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
59
59
|
|
|
@@ -123,8 +123,14 @@ For cases 2 and 3, ask once:
|
|
|
123
123
|
|
|
124
124
|
> "I can wire up the benchmark in one of two ways:
|
|
125
125
|
>
|
|
126
|
-
> 1. **SDK mode** -- install the evo agent SDK with this project's package manager/runtime (
|
|
127
|
-
> 2. **Inline mode** --
|
|
126
|
+
> 1. **SDK mode** -- install the evo agent SDK with this project's package manager/runtime (`uv add --dev evo-hq-agent`, `python -m pip install evo-hq-agent`, or `npm install @evo-hq/evo-agent`). ~5 lines of user code, with incremental per-task logging handled for you. **Python and Node only** -- the SDK ships for those two runtimes.
|
|
127
|
+
> 2. **Inline mode** -- implement the trace/result contract directly in the benchmark, in the benchmark's own language. Zero new dependencies, same data.
|
|
128
|
+
>
|
|
129
|
+
> Recommended: SDK mode."
|
|
130
|
+
|
|
131
|
+
Inline mode is language-native and lives entirely in the user's setup: whatever the benchmark is written in is what the instrumentation is written in, with no evo package added to the project. Do not introduce a Python (or any other) sidecar script to wrap a benchmark written in another language -- that is the friction this avoids. For a Python or Node benchmark, the ready-made paste-in helper (`references/inline_instrumentation.py` / `.js`) is the inline implementation. For any other language, port the ~10-15 line contract from `references/instrumentation-contract.md` into that language. Either way the mode is `inline`.
|
|
132
|
+
|
|
133
|
+
Order the options SDK first, inline second, and suggest SDK as the recommended default when it's available -- it's the managed path with per-task logging handled for you. Inline stays a first-class choice with the same data contract, though: if the user declines the SDK for any reason -- they don't want a new dependency, can't add evo to their project's tree, internal policy, or plain preference -- honor it without pushback. SDK mode is only *available* when the benchmark runs on Python or Node; when the benchmark's language has no SDK there's nothing to suggest, so go straight to inline.
|
|
128
134
|
|
|
129
135
|
Pass the answer to `evo init` via `--instrumentation-mode <sdk|inline>` in step 7. **Never install packages without this confirmation.** If you skip the question (case 1), still pass the detected mode to `evo init` so optimize/subagent runs see a consistent value.
|
|
130
136
|
|
|
@@ -303,14 +309,14 @@ Patterns to scan for:
|
|
|
303
309
|
|
|
304
310
|
### 10c. Apply instrumentation
|
|
305
311
|
|
|
306
|
-
|
|
312
|
+
Both modes satisfy the same file-and-env contract: per-task `task_<id>.json` written to `$EVO_TRACES_DIR`, a result JSON with a numeric `score` written to `$EVO_RESULT_PATH` (stdout if unset). `evo run` reads those files; it never inspects the language. The full spec is `references/instrumentation-contract.md`.
|
|
307
313
|
|
|
308
314
|
Paths below are relative to this `SKILL.md` file (resolve them against the skill directory).
|
|
309
315
|
|
|
310
|
-
- **SDK mode
|
|
311
|
-
- **Inline mode**:
|
|
312
|
-
|
|
313
|
-
|
|
316
|
+
- **SDK mode** (Python/Node only): add `from evo_agent import Run` (Python) or `import { Run } from '@evo-hq/evo-agent'` (Node) to the benchmark script. Wrap the eval loop per `references/sdk_python.py` or `references/sdk_node.js`.
|
|
317
|
+
- **Inline mode**: implement the contract in the benchmark's own language.
|
|
318
|
+
- Benchmark is Python or Node: paste `references/inline_instrumentation.py` (or `.js`) and call `log_task` / `logTask` per task, `write_result` / `writeResult` once at the end.
|
|
319
|
+
- Benchmark is any other language: port the contract from `references/instrumentation-contract.md` directly into that language (~10-15 lines: read the env vars, write each task trace, atomically publish the result). Do not add a Python/JS wrapper around it.
|
|
314
320
|
|
|
315
321
|
### 10d. Cheap validation run
|
|
316
322
|
|
|
@@ -325,7 +331,7 @@ evo gate check exp_0000
|
|
|
325
331
|
|
|
326
332
|
Use `evo gate check <exp_id>` when only gate wiring changed or when you need to validate inherited gates without running the benchmark. It writes a `gate_check.json` artifact under the same checks directory and also does not mutate experiment state.
|
|
327
333
|
|
|
328
|
-
The check asserts `result.json` exists, is non-empty, and is a JSON object with a numeric `score`. Also verify:
|
|
334
|
+
This is the authoritative wiring check, and it is language-agnostic -- it runs the real benchmark command and inspects the JSON artifacts, so a native inline implementation in any language is validated the same way a Python one is. The check asserts `result.json` exists, is non-empty, and is a JSON object with a numeric `score`. Also verify:
|
|
329
335
|
|
|
330
336
|
- All dependencies resolve and the command completes.
|
|
331
337
|
- Traces appear in `$EVO_TRACES_DIR` (if applicable).
|
|
@@ -386,37 +392,11 @@ Document:
|
|
|
386
392
|
- `uses LLMs with temp=0` -- expected to be deterministic in practice; flag if it isn't
|
|
387
393
|
- `sampling-based, variance expected` -- inherent noise; optimize will need multi-run strategies
|
|
388
394
|
- Environment requirements discovered during validation
|
|
395
|
+
- **Resource profile (for run sizing)** -- the binding resource one benchmark run needs (exclusive GPU/accelerator, peak memory, an exclusive port, a shared DB/fixture, or an external API rate limit / $-per-run), whether concurrent benchmark runs are safe on this backend or must serialize, and rough time/cost per run. `/evo:optimize` reads this to size each round (see `plugins/evo/skills/optimize/references/sizing-the-round.md`)
|
|
389
396
|
- What each gate protects
|
|
390
397
|
- Benchmark gaming risks identified during the Goodhart check
|
|
391
398
|
- Future experiment candidates (the non-picked dimensions from step 3)
|
|
392
399
|
|
|
393
|
-
## 12a. Confirm how the optimize loop should run
|
|
394
|
-
|
|
395
|
-
Ask the user once how they want `/evo:optimize` to behave. These are run-behavior defaults stored on the workspace; they don't affect discover itself. Ask as a single, light question (use your host's structured multi-choice tool if you have one; otherwise plain text), and make clear both are optional — the defaults apply if the user has no preference:
|
|
396
|
-
|
|
397
|
-
- **Autonomous loop** — should evo's internal wiring keep the loop running on its own, re-engaging the agent at every turn boundary until the run stalls (`autonomous`)? Default off: evo does not auto-continue the loop.
|
|
398
|
-
- **Orchestrator edits** — push every edit through subagents, steering the orchestrator away from editing directly (`subagents-only`)? Default off: the orchestrator may also edit directly if it chooses.
|
|
399
|
-
|
|
400
|
-
**Pre-fill from the user's remembered choice.** Before asking, read their cross-project defaults and use each as the suggested answer (so a returning user just confirms):
|
|
401
|
-
|
|
402
|
-
```bash
|
|
403
|
-
evo defaults get autonomous --json # → true | false | null
|
|
404
|
-
evo defaults get subagents-only --json
|
|
405
|
-
```
|
|
406
|
-
|
|
407
|
-
If a value is non-null, present it as the default in the question (e.g. "autonomous was on last time — keep it?"). Always still ask — never apply a remembered value silently.
|
|
408
|
-
|
|
409
|
-
Persist the answer to both the workspace (this project) and the user-level store (remembered for next project):
|
|
410
|
-
|
|
411
|
-
```bash
|
|
412
|
-
evo config set default-autonomous on|off
|
|
413
|
-
evo config set default-subagents-only on|off
|
|
414
|
-
evo defaults set autonomous on|off
|
|
415
|
-
evo defaults set subagents-only on|off
|
|
416
|
-
```
|
|
417
|
-
|
|
418
|
-
If the user has no opinion and no remembered value exists, or you skip the question, leave both off — the defaults: the loop stops naturally after each round, and the orchestrator may edit directly. Do NOT infer these from the user's earlier free-form messages; only set `on` when the user clearly chooses it here. `/evo:optimize` reads these defaults at startup (workspace first, then user-level), and a bare-word `autonomous` / `subagents-only` on the invocation overrides the stored default for that run.
|
|
419
|
-
|
|
420
400
|
## 13. Report to the user
|
|
421
401
|
|
|
422
402
|
End the skill by reporting in chat:
|
|
@@ -54,7 +54,7 @@ Output contract (same as existing evo benchmarks):
|
|
|
54
54
|
- **stdout / stderr:** free for user output (logs, progress, debug)
|
|
55
55
|
- **exit code:** 0 on successful completion (even if score is low); non-zero only on infrastructure failure (import error, missing data, etc.)
|
|
56
56
|
|
|
57
|
-
Use the SDK or inline instrumentation depending on the user's earlier choice (recorded in `.evo/meta.json` as `instrumentation_mode`).
|
|
57
|
+
Use the SDK or inline instrumentation depending on the user's earlier choice (recorded in `.evo/meta.json` as `instrumentation_mode`). Write the harness in the language that drives the target -- a Rust target gets a Rust harness, a Go target a Go harness. Inline mode is language-native: implement the trace/result contract in that same language (`instrumentation-contract.md`), don't wrap the benchmark in a Python or JS sidecar. SDK mode is Python/Node only.
|
|
58
58
|
|
|
59
59
|
## 4. Goodhart check
|
|
60
60
|
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
# Instrumentation contract
|
|
2
|
+
|
|
3
|
+
Inline instrumentation is a file-and-env contract, not a library. `evo run` sets two environment variables, runs your benchmark command as a subprocess in whatever language it's written in, and reads back JSON files. Any language that can read an env var and write a JSON file satisfies it — implement it directly in the benchmark's own language.
|
|
4
|
+
|
|
5
|
+
The `inline_instrumentation.py` and `inline_instrumentation.js` helpers in this directory are ready-made reference implementations of this contract. For a Python or Node benchmark, paste one in. For any other language (Go, Rust, Ruby, Java, C++, shell, ...), implement the contract below in that language — it's ~10-15 lines.
|
|
6
|
+
|
|
7
|
+
## Environment (set by `evo run`)
|
|
8
|
+
|
|
9
|
+
| Variable | Meaning | If unset |
|
|
10
|
+
|---|---|---|
|
|
11
|
+
| `EVO_RESULT_PATH` | Absolute path to write the final result JSON | Print the result JSON to stdout instead |
|
|
12
|
+
| `EVO_TRACES_DIR` | Directory to write per-task traces into | Skip traces (score still works; the dashboard just shows less) |
|
|
13
|
+
| `EVO_EXPERIMENT_ID` | The experiment id, for stamping traces | Use `"unknown"` |
|
|
14
|
+
|
|
15
|
+
## Result JSON (required)
|
|
16
|
+
|
|
17
|
+
Write one JSON object — to `$EVO_RESULT_PATH`, or stdout if that's unset:
|
|
18
|
+
|
|
19
|
+
```json
|
|
20
|
+
{
|
|
21
|
+
"score": 0.6,
|
|
22
|
+
"tasks": {"t1": 0.8, "t2": 0.4},
|
|
23
|
+
"started_at": "2026-05-30T09:00:00+00:00",
|
|
24
|
+
"ended_at": "2026-05-30T09:01:00+00:00"
|
|
25
|
+
}
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
- `score` (number) is the only required field. `evo run --check` fails if the result is missing, empty, or has no numeric `score`.
|
|
29
|
+
- `tasks` (map of task id -> score) drives per-task selection strategies. Include it when the benchmark has discrete tasks.
|
|
30
|
+
- `started_at` / `ended_at` are optional ISO-8601 timestamps.
|
|
31
|
+
- Add `tasks_meta` only when a task's optimization direction differs from the benchmark's top-level `--metric`: `"tasks_meta": {"latency_ms": {"direction": "min"}}`.
|
|
32
|
+
|
|
33
|
+
**Exit code:** 0 on successful completion, even when the score is low. Exit non-zero only on infrastructure failure (missing data, import/compile error). A low score is data, not a crash. (Score-threshold *gates* are the exception — see SKILL.md step 8.)
|
|
34
|
+
|
|
35
|
+
## Per-task trace (recommended)
|
|
36
|
+
|
|
37
|
+
For each task, write `$EVO_TRACES_DIR/task_<id>.json` as the task finishes, so the dashboard streams progress live:
|
|
38
|
+
|
|
39
|
+
```json
|
|
40
|
+
{
|
|
41
|
+
"experiment_id": "exp_0000",
|
|
42
|
+
"task_id": "t1",
|
|
43
|
+
"score": 0.8,
|
|
44
|
+
"ended_at": "2026-05-30T09:00:30+00:00"
|
|
45
|
+
}
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
- `task_id` and `score` are the substance; `experiment_id` and `ended_at` are for display.
|
|
49
|
+
- Optional fields, include what helps debugging: `status` (a `"passed"`/`"failed"` display label — set it from the benchmark's own pass criterion, whatever that is for this score scale; the engine does not read it or impose a threshold), `summary`, `failure_reason`, `log`, `direction`.
|
|
50
|
+
|
|
51
|
+
## Atomic result publish (the one non-obvious rule)
|
|
52
|
+
|
|
53
|
+
Publish the result file with claim-then-rename, not a plain write:
|
|
54
|
+
|
|
55
|
+
1. Atomically create-exclusive the result path to claim it (`O_EXCL` / open mode `"wx"`). If it already exists, fail loudly — a second writer in the same attempt is a wiring bug, not something to overwrite.
|
|
56
|
+
2. Write the payload to `<result_path>.tmp`.
|
|
57
|
+
3. Rename `.tmp` over the result path.
|
|
58
|
+
|
|
59
|
+
This makes duplicate writers fail fast and ensures a crash mid-publish leaves an empty claimed file (which `load_result` treats as a failed run) rather than a half-written JSON that parses wrong. The `.py`/`.js` helpers show the exact calls. Per-task traces don't need this — they're write-once-per-id.
|
|
60
|
+
|
|
61
|
+
## Validate the wiring
|
|
62
|
+
|
|
63
|
+
The contract is language-agnostic, and so is the check. From the main repo root:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
evo run <exp_id> --check
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
This runs the real benchmark command and asserts the result artifact exists, is non-empty, and carries a numeric `score` — regardless of what language produced it. It's the authoritative "is the wiring correct" test; passing it means evo can read your harness. Inspect the artifacts with `evo show <exp_id>`.
|
|
70
|
+
|
|
71
|
+
## Minimal shell reference
|
|
72
|
+
|
|
73
|
+
A complete inline implementation for a shell harness, to show how small the contract is:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# ... benchmark runs, computes per-task scores ...
|
|
77
|
+
mkdir -p "$EVO_TRACES_DIR"
|
|
78
|
+
printf '{"task_id":"t1","score":0.8}' > "$EVO_TRACES_DIR/task_t1.json"
|
|
79
|
+
|
|
80
|
+
# atomic publish of the result
|
|
81
|
+
score=0.8
|
|
82
|
+
if [ -n "$EVO_RESULT_PATH" ]; then
|
|
83
|
+
( set -o noclobber; : > "$EVO_RESULT_PATH" ) || { echo "result already claimed" >&2; exit 1; }
|
|
84
|
+
printf '{"score":%s,"tasks":{"t1":%s}}' "$score" "$score" > "$EVO_RESULT_PATH.tmp"
|
|
85
|
+
mv "$EVO_RESULT_PATH.tmp" "$EVO_RESULT_PATH"
|
|
86
|
+
else
|
|
87
|
+
printf '{"score":%s,"tasks":{"t1":%s}}' "$score" "$score"
|
|
88
|
+
fi
|
|
89
|
+
```
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: infra-setup
|
|
3
3
|
description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
|
|
4
4
|
disable-model-invocation: true
|
|
5
|
-
evo_version: 0.4.4-alpha.
|
|
5
|
+
evo_version: 0.4.4-alpha.6
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Infra Setup
|
package/skills/optimize/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: optimize
|
|
3
3
|
description: Run the evo optimization loop with parallel subagents until interrupted.
|
|
4
4
|
argument-hint: "[subagents=N] [budget=N] [stall=N]"
|
|
5
|
-
evo_version: 0.4.4-alpha.
|
|
5
|
+
evo_version: 0.4.4-alpha.6
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
|
|
@@ -30,15 +30,22 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
|
|
|
30
30
|
|
|
31
31
|
## Configuration
|
|
32
32
|
|
|
33
|
-
|
|
33
|
+
The orchestrator **sizes the round to the benchmark's resource profile** instead of using a flat default. Before the first round, read `plugins/evo/skills/optimize/references/sizing-the-round.md` and apply it. Short version:
|
|
34
34
|
|
|
35
|
-
- **subagents
|
|
36
|
-
- **budget
|
|
37
|
-
- **stall**: consecutive rounds with no improvement before auto-stopping (default: 5)
|
|
38
|
-
- **autonomous**: opt-in to the keep-going loop (default: off). See below.
|
|
39
|
-
- **subagents-only**: opt-in to gate orchestrator edits, nudging all edits through subagents (default: off — orchestrator edits allowed). See below.
|
|
35
|
+
- **subagents** (round width): bounded by the backend (worktree = one shared machine, pool = slot count, remote = provider quota) and by whatever resource one benchmark run saturates. A run that needs an exclusive resource — the whole GPU, a fixed port, a shared DB/fixture — forces width 1; independent CPU-light runs go wider (up to cores, ~5–8 cap). Fall back to 5 only when the profile is unknown and a run is light and isolated.
|
|
36
|
+
- **budget** (iterations per subagent within its branch): deeper for cheap/fast/deterministic benchmarks, shallower for expensive/slow/noisy ones so the loop re-plans sooner. ~5 is a reasonable midpoint.
|
|
37
|
+
- **stall**: consecutive rounds with no improvement before auto-stopping (default: 5).
|
|
40
38
|
|
|
41
|
-
|
|
39
|
+
A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over the heuristic.
|
|
40
|
+
|
|
41
|
+
- **autonomous**: the keep-going loop. **Default: on** — evo is autoresearch; it runs unattended. Turn off for a run with `evo autonomous off`.
|
|
42
|
+
- **subagents-only**: gate orchestrator edits, pushing all edits through subagents. **Default: on**. Turn off for a run with `evo subagents-only off`.
|
|
43
|
+
|
|
44
|
+
**Resolving autonomous / subagents-only at startup.** Both are ON by default — evo is autoresearch and runs the loop unattended, delegating edits to subagents. Do not ask the user about either. Resolve each through a cascade, most specific first:
|
|
45
|
+
|
|
46
|
+
1. **An explicit instruction from the user this run wins — on or off.** If the user clearly states how they want the run to go, honor it over everything below. "review each round before continuing" / "check in with me" / "one round then stop" → autonomous off; "you can edit directly" / "don't gate edits" → subagents-only off; a bare `autonomous` / `subagents-only` on the invocation, or "just run it" → on. Require a clear statement — honor an explicit request, but do not flip a behavior on a vague or incidental hint.
|
|
47
|
+
2. **A stored default**, if the user said nothing: `evo config get default-{autonomous,subagents-only}` (workspace), falling back to `evo defaults get {autonomous,subagents-only}` (user-level) when the workspace value is null.
|
|
48
|
+
3. **Otherwise → on** — the framework default.
|
|
42
49
|
|
|
43
50
|
```bash
|
|
44
51
|
evo config get default-autonomous --json # workspace → true | false | null
|
|
@@ -47,12 +54,7 @@ evo config get default-subagents-only --json
|
|
|
47
54
|
evo defaults get subagents-only --json
|
|
48
55
|
```
|
|
49
56
|
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
- `autonomous` resolved on → run `evo autonomous on`.
|
|
53
|
-
- `subagents-only` resolved on → run `evo subagents-only on`.
|
|
54
|
-
|
|
55
|
-
If a value comes from a stored default (not a bare word on this invocation), say so in your opening message — e.g. "autonomous on (from your saved default)" — so an inherited setting is never invisible. Never infer either from the user's free-form task description; only the invocation argument or a stored default may turn them on.
|
|
57
|
+
As your **very first actions, before the loop**, resolve each and arm it: run `evo autonomous on` / `evo subagents-only on` when it resolves on, or `evo autonomous off` / `evo subagents-only off` when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.
|
|
56
58
|
|
|
57
59
|
**Autonomous mode.** Off lets you stop naturally at a turn boundary — finish a round, report, and stop. On arms the stop-nudge: at every turn boundary you are re-prompted to keep driving the loop until the **stall** limit is hit or the user interrupts. Without it, the loop does NOT force-continue across turn boundaries. To stop an autonomous run, the user runs `evo autonomous off` or `evo exit-optimize-mode`.
|
|
58
60
|
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
# Sizing the round
|
|
2
|
+
|
|
3
|
+
How to pick **round width** (`subagents`) and **per-branch depth** (`budget`) for a round. Apply this before the first round and re-check it if the backend or benchmark changes.
|
|
4
|
+
|
|
5
|
+
The governing idea: **round width is resource-bound.** Spawning N subagents means up to N benchmark runs executing at once. If one run already saturates the binding resource, more runs don't go faster — they contend, thrash, or OOM. So:
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
width = min(backend_ceiling, resource_width)
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
## 1. Backend ceiling
|
|
12
|
+
|
|
13
|
+
Check `evo config backend show`.
|
|
14
|
+
|
|
15
|
+
- **worktree** (default): every experiment runs on the *same machine*. The ceiling is what that one machine can run concurrently without contention — set by `resource_width` below, not by a fixed number.
|
|
16
|
+
- **pool**: hard ceiling = slot count (`evo workspace status`). Exceeding it makes later `evo new` calls fail with `PoolExhausted`. Never set width above the slots.
|
|
17
|
+
- **remote**: each experiment runs in its own container, so the local machine isn't the limit. Ceiling is provider quota / concurrency cap / cost tolerance.
|
|
18
|
+
|
|
19
|
+
## 2. Resource width — what one benchmark run saturates
|
|
20
|
+
|
|
21
|
+
Read `.evo/project.md`'s resource-profile line first (discover records it). If it's missing or thin, infer from the benchmark command, the target's imports, and a quick probe (`nvidia-smi`, core count, the benchmark's own docs). Then size to whatever one run consumes:
|
|
22
|
+
|
|
23
|
+
| Binding resource of one run | Round width on a shared machine |
|
|
24
|
+
|---|---|
|
|
25
|
+
| Exclusive accelerator — needs the whole GPU/TPU, or a pinned device | **1** with one device; **K** with K devices that can be pinned per worktree |
|
|
26
|
+
| Memory-heavy | `floor(total_RAM / per_run_peak)` with headroom |
|
|
27
|
+
| Exclusive port, singleton service, shared DB, or a shared mutable fixture | **1**, unless the harness parameterizes that resource per worktree |
|
|
28
|
+
| External API rate limit or real $-per-run | cap width to stay under the limit / within budget |
|
|
29
|
+
| CPU-light, in-memory, fully isolated | wider — up to core count, capped ~5–8 to keep the round legible |
|
|
30
|
+
|
|
31
|
+
When a run needs an exclusive resource, serializing benchmark *execution* (width 1) is correct even though the *edits* are independent — on the worktree backend `evo run` executes the benchmark in-place, so concurrent `evo run` means concurrent benchmark processes on that one resource.
|
|
32
|
+
|
|
33
|
+
## 3. Depth — `budget` (iterations per subagent within its branch)
|
|
34
|
+
|
|
35
|
+
Depth trades exploration against spend, and keys off cost per run, not concurrency:
|
|
36
|
+
|
|
37
|
+
- Cheap, fast, deterministic benchmark → larger budget; let a promising branch iterate several times before the orchestrator re-plans.
|
|
38
|
+
- Expensive, slow, or noisy benchmark → smaller budget; re-plan sooner so spend tracks signal. For noisy benchmarks, deep single-branch iteration over-fits to lucky runs.
|
|
39
|
+
- ~5 is a reasonable midpoint when nothing argues otherwise.
|
|
40
|
+
|
|
41
|
+
## 4. Default and override
|
|
42
|
+
|
|
43
|
+
- **Unknown profile, light isolated run** → fall back to width 5, budget 5.
|
|
44
|
+
- **Unsure, but a shared exclusive resource is plausible** (the benchmark touches a GPU, a port, a DB) → serialize (width 1). Under-subscribing wastes a little wall-clock; over-subscribing corrupts results or crashes the round.
|
|
45
|
+
- **The user gave an explicit value** (`/optimize subagents=N budget=N`, or said it in plain language) → honor it, over the heuristic. They know their hardware.
|
|
46
|
+
|
|
47
|
+
State the width/budget you chose and the one-line reason in your opening message, so the sizing is visible (e.g. "width 1 — benchmark needs the single GPU; budget 6 — runs are cheap and deterministic").
|
package/skills/report/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
3
|
description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
|
|
4
|
-
evo_version: 0.4.4-alpha.
|
|
4
|
+
evo_version: 0.4.4-alpha.6
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
3
|
description: Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.
|
|
4
|
-
evo_version: 0.4.4-alpha.
|
|
4
|
+
evo_version: 0.4.4-alpha.6
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|