eve 0.6.0-beta.10 → 0.6.0-beta.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +13 -0
- package/README.md +1 -1
- package/dist/docs/evals-v2-plan.md +108 -108
- package/dist/docs/public/advanced/evals.md +51 -23
- package/dist/docs/public/reference/cli.md +17 -17
- package/dist/docs/public/reference/typescript-api.md +2 -2
- package/dist/src/chunks/{use-eve-agent-DCZbkLG7.js → use-eve-agent-BD79SFqg.js} +82 -29
- package/dist/src/chunks/{use-eve-agent-DoheC4_o.js → use-eve-agent-DT2A6VJb.js} +82 -29
- package/dist/src/cli/run.d.ts +1 -1
- package/dist/src/cli/run.js +1 -1
- package/dist/src/client/file-parts.d.ts +18 -0
- package/dist/src/client/file-parts.js +1 -0
- package/dist/src/client/index.d.ts +3 -2
- package/dist/src/client/index.js +1 -1
- package/dist/src/client/message-response.js +1 -1
- package/dist/src/client/open-stream.d.ts +6 -0
- package/dist/src/client/open-stream.js +1 -1
- package/dist/src/client/session-utils.d.ts +5 -0
- package/dist/src/client/session-utils.js +1 -1
- package/dist/src/client/session.js +1 -1
- package/dist/src/client/types.d.ts +5 -1
- package/dist/src/evals/checks/checks.d.ts +1 -1
- package/dist/src/evals/checks/index.d.ts +1 -1
- package/dist/src/evals/checks/match.d.ts +1 -1
- package/dist/src/evals/cli/eval.d.ts +2 -2
- package/dist/src/evals/cli/eval.js +1 -1
- package/dist/src/evals/{define-eval-suite.d.ts → define-eval.d.ts} +6 -6
- package/dist/src/evals/define-eval.js +1 -0
- package/dist/src/evals/index.d.ts +3 -2
- package/dist/src/evals/index.js +1 -1
- package/dist/src/evals/runner/artifacts.d.ts +6 -6
- package/dist/src/evals/runner/artifacts.js +1 -1
- package/dist/src/evals/runner/discover.d.ts +10 -10
- package/dist/src/evals/runner/discover.js +1 -1
- package/dist/src/evals/runner/execute-case.d.ts +4 -7
- package/dist/src/evals/runner/execute-case.js +1 -1
- package/dist/src/evals/runner/{execute-suite.d.ts → execute-eval.d.ts} +6 -6
- package/dist/src/evals/runner/execute-eval.js +1 -0
- package/dist/src/evals/runner/reporters/braintrust.d.ts +4 -4
- package/dist/src/evals/runner/reporters/braintrust.js +2 -2
- package/dist/src/evals/runner/reporters/console.d.ts +3 -3
- package/dist/src/evals/runner/reporters/console.js +1 -1
- package/dist/src/evals/runner/reporters/types.d.ts +6 -6
- package/dist/src/evals/scorers/autoevals.d.ts +7 -7
- package/dist/src/evals/scorers/autoevals.js +1 -1
- package/dist/src/evals/scorers/model-marker.d.ts +3 -3
- package/dist/src/evals/session.d.ts +46 -0
- package/dist/src/evals/session.js +1 -0
- package/dist/src/evals/types.d.ts +154 -57
- package/dist/src/execution/tool-auth.js +1 -1
- package/dist/src/internal/application/package.js +1 -1
- package/dist/src/protocol/message.d.ts +8 -0
- package/dist/src/protocol/message.js +2 -2
- package/dist/src/runtime/connections/mcp-client.js +1 -1
- package/dist/src/runtime/connections/scoped-authorization.d.ts +9 -5
- package/dist/src/runtime/connections/scoped-authorization.js +1 -1
- package/dist/src/runtime/connections/types.d.ts +24 -0
- package/dist/src/setup/boxes/deploy-project.js +1 -1
- package/dist/src/setup/scaffold/create/project.js +1 -1
- package/dist/src/setup/scaffold/update/channels.js +2 -2
- package/dist/src/setup/scaffold/update/connections.js +1 -1
- package/dist/src/svelte/index.js +1 -1
- package/dist/src/svelte/use-eve-agent.js +1 -1
- package/dist/src/vue/index.js +1 -1
- package/dist/src/vue/use-eve-agent.js +1 -1
- package/package.json +1 -1
- package/dist/src/evals/define-eval-suite.js +0 -1
- package/dist/src/evals/runner/execute-suite.js +0 -1
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,18 @@
|
|
|
1
1
|
# eve
|
|
2
2
|
|
|
3
|
+
## 0.6.0-beta.11
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- 9bb6371: Adds the evals interaction API: scripted cases, `task.run`, `EveEvalSession` helpers for HITL responses and file attachments, multi-session event capture, and client-level HITL request results with retry handling for stream registration races.
|
|
8
|
+
- 22dda94: Rename the eval authoring helper from `defineEvalSuite` to `defineEval` and update the public eval types, reporter hooks, CLI wording, and JSON result shape to use eval terminology consistently.
|
|
9
|
+
|
|
10
|
+
### Patch Changes
|
|
11
|
+
|
|
12
|
+
- a0ca3bb: Scaffolded projects now pin `@vercel/connect@0.2.2`. The previously pinned 0.1.1 predates the `@vercel/connect/eve` entrypoint, so generated channels and connections failed to build on deploy with `Package subpath './eve' is not defined by "exports"`.
|
|
13
|
+
- a0ca3bb: Fix interactive post-setup deploys failing with "The `projectSettings` object is required for new projects". The setup deploy now passes `--yes` to `vercel deploy --prod` in interactive runs too, so the Vercel API auto-detects framework settings for projects that eve provisioned through the projects API.
|
|
14
|
+
- 380693d: Token eviction on a rejected bearer now cascades to the authorization strategy's own cache, not just Eve's per-step cache. `AuthorizationDefinition` gains an optional `evict(opts)` hook, and the shared `evictScopedToken` path (used by both authored tools and MCP connections) invokes it after dropping the per-step entry. This lets `@vercel/connect`-backed connections purge their in-process token cache so a revoked-but-unexpired grant is genuinely re-fetched instead of re-read from a lower cache layer.
|
|
15
|
+
|
|
3
16
|
## 0.6.0-beta.10
|
|
4
17
|
|
|
5
18
|
### Minor Changes
|
package/README.md
CHANGED
|
@@ -52,7 +52,7 @@ Every authored directory has a typed helper. Import each from the matching subpa
|
|
|
52
52
|
| `eveChannel(...)`, `slackChannel(...)`, `vercelOidc(...)` | `eve/channels/eve`, `/slack`, `/auth` | reused from `channels/<name>.ts` |
|
|
53
53
|
| `defineSandbox(...)` | `eve/sandbox` | `sandbox.ts` (or `sandbox/sandbox.ts`) |
|
|
54
54
|
| `defineSchedule(...)` | `eve/schedules` | `schedules/<name>.ts` (or `schedules/<name>.md`) |
|
|
55
|
-
| `
|
|
55
|
+
| `defineEval(...)` | `eve/evals` | `evals/<name>.eval.ts` |
|
|
56
56
|
|
|
57
57
|
Runtime accessors live on the subpath that owns the concern:
|
|
58
58
|
|
|
@@ -6,7 +6,7 @@ Scope: `packages/eve` (evals, client, CLI), `e2e/`, CI
|
|
|
6
6
|
|
|
7
7
|
## Summary
|
|
8
8
|
|
|
9
|
-
Eve's
|
|
9
|
+
Eve's evals (`defineEval` + `eve eval`) and the hand-rolled e2e smoke
|
|
10
10
|
surface (`e2e/tests/**` + `ExampleClient`) are two implementations of the same
|
|
11
11
|
idea: drive a real agent over HTTP and judge what happened. The eval runner has
|
|
12
12
|
the right bones — filesystem discovery, target resolution, stream capture,
|
|
@@ -23,16 +23,16 @@ missing:
|
|
|
23
23
|
|
|
24
24
|
1. **An imperative interaction API** — `run(ctx)` with a typed `EvalSession`
|
|
25
25
|
driver for multi-turn control flow, HITL responses, approvals, structured
|
|
26
|
-
output, attachments, and multi-session scenarios. Available at
|
|
26
|
+
output, attachments, and multi-session scenarios. Available at eval level
|
|
27
27
|
(`task.run`, the shared default for dataset evals) and per case
|
|
28
|
-
(`case.run`), so one
|
|
28
|
+
(`case.run`), so one eval groups many distinct scripted behaviors the way
|
|
29
29
|
a test file groups `it` blocks.
|
|
30
30
|
2. **A hard-assertion tier** — `checks`, distinct from `scores`. Check failures
|
|
31
31
|
fail the case and the process; scores remain soft, thresholded data.
|
|
32
32
|
3. **A URL-shaped target model with verified requirements** — a target is
|
|
33
33
|
always just a URL. `eve eval` obtains one (boots the dev server, as today)
|
|
34
34
|
or `--url` brings your own (a built server in CI, a preview deployment).
|
|
35
|
-
|
|
35
|
+
Evals never declare _how_ to provision the agent — they declare
|
|
36
36
|
`requires` assumptions (mock models, dev routes, sidecar env) that the
|
|
37
37
|
runner verifies against the live target and skips or errors on, visibly.
|
|
38
38
|
Provisioning (build, start, env injection, sidecars, secondary agents)
|
|
@@ -41,7 +41,7 @@ missing:
|
|
|
41
41
|
`e2e/tests/basic-runtime/evals.ts` already proves out today.
|
|
42
42
|
|
|
43
43
|
The end state: every behavior currently proven by `e2e/tests/**` (except the
|
|
44
|
-
TUI tests, see Non-goals) is an eval
|
|
44
|
+
TUI tests, see Non-goals) is an eval in a fixture app's `evals/`
|
|
45
45
|
directory, CI runs `eve eval --strict` per fixture app per matrix mode, and
|
|
46
46
|
`e2e/lib/client.ts` plus the per-file harness code are deleted. Users get the
|
|
47
47
|
same machinery for their own agents: smoke-test your agent the way Eve
|
|
@@ -49,7 +49,7 @@ smoke-tests itself.
|
|
|
49
49
|
|
|
50
50
|
## Why not a `describe`/`it` runner
|
|
51
51
|
|
|
52
|
-
We considered replacing `
|
|
52
|
+
We considered replacing `defineEval` with a vitest/`node:test`-shaped API.
|
|
53
53
|
Rejected, for these reasons:
|
|
54
54
|
|
|
55
55
|
1. **Evals have semantics tests don't.** Scores in `[0,1]`, per-scorer
|
|
@@ -58,8 +58,8 @@ Rejected, for these reasons:
|
|
|
58
58
|
pass/fail; we would immediately reinvent scorers and reporters _inside_ `it`
|
|
59
59
|
blocks and lose the structured result model that powers `--json` and the
|
|
60
60
|
Braintrust reporter.
|
|
61
|
-
2. **
|
|
62
|
-
`scripts/guard-invariants.mjs`). One
|
|
61
|
+
2. **Eval identity is path-derived** (repo principle 5, enforced by
|
|
62
|
+
`scripts/guard-invariants.mjs`). One eval per `evals/<path>.eval.ts` file
|
|
63
63
|
maps cleanly onto the filesystem-first philosophy; nested describe blocks
|
|
64
64
|
fight it.
|
|
65
65
|
3. **The dataset model is the right default.** "Load 200 cases from YAML, fan
|
|
@@ -67,7 +67,7 @@ Rejected, for these reasons:
|
|
|
67
67
|
A block-based runner makes that the awkward case (loops generating `it`s).
|
|
68
68
|
4. **What e2e actually needs is not blocks** — it is control flow _inside a
|
|
69
69
|
case_, a typed session driver, and assertion helpers over the event stream.
|
|
70
|
-
All three fit inside the existing
|
|
70
|
+
All three fit inside the existing eval shape: `run` at the task and case
|
|
71
71
|
level provides the control flow, and scripted cases give the same grouping
|
|
72
72
|
a `describe` file gives `it` blocks (see "Scripted cases" below).
|
|
73
73
|
5. **Wrapping vitest violates principle 3** (wrap third-party deps; don't
|
|
@@ -88,15 +88,15 @@ first-class rather than by switching paradigms.
|
|
|
88
88
|
arbitrary event predicates.
|
|
89
89
|
- Hard pass/fail semantics suitable for CI gating, coexisting with soft scores.
|
|
90
90
|
- One target model: any URL — the runner-booted dev server, a locally built
|
|
91
|
-
`eve start` process, or a deployed instance — with
|
|
92
|
-
requirements verified against the live target instead of
|
|
91
|
+
`eve start` process, or a deployed instance — with eval-declared
|
|
92
|
+
requirements verified against the live target instead of eval-owned
|
|
93
93
|
provisioning.
|
|
94
94
|
- Drive surfaces beyond the session route: channels (webhook ingress) and
|
|
95
|
-
schedules (dev dispatch), with stream consumption for sessions the
|
|
95
|
+
schedules (dev dispatch), with stream consumption for sessions the eval did
|
|
96
96
|
not create.
|
|
97
97
|
- One HTTP client: `eve/client` absorbs everything `e2e/lib/client.ts` does;
|
|
98
98
|
`ExampleClient` is deleted.
|
|
99
|
-
- Replace `e2e/tests/**` (minus TUI) with
|
|
99
|
+
- Replace `e2e/tests/**` (minus TUI) with evals; CI becomes
|
|
100
100
|
`eve eval --strict` runs.
|
|
101
101
|
|
|
102
102
|
## Non-goals
|
|
@@ -108,13 +108,13 @@ first-class rather than by switching paradigms.
|
|
|
108
108
|
tier only.
|
|
109
109
|
- **Braintrust/autoevals integration** is unchanged in shape (still wrapped
|
|
110
110
|
per principle 3).
|
|
111
|
-
- **No authored
|
|
111
|
+
- **No authored eval `id`/`name`.** Identity stays path-derived; the
|
|
112
112
|
invariant guard keeps enforcing it.
|
|
113
113
|
|
|
114
114
|
## Current state (abridged; see code for detail)
|
|
115
115
|
|
|
116
|
-
-
|
|
117
|
-
reporters, ... })` in `packages/eve/src/evals/define-eval
|
|
116
|
+
- Eval API: `defineEval({ cases | load, task, scores, model, thresholds,
|
|
117
|
+
reporters, ... })` in `packages/eve/src/evals/define-eval.ts` and
|
|
118
118
|
`types.ts`. `task` is `prompt(case)` or `messages(case) => string[]` — a
|
|
119
119
|
static list, no branching, no HITL (`runner/execute-case.ts:38` only ever
|
|
120
120
|
sends `{ message }`).
|
|
@@ -140,17 +140,17 @@ reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
|
|
|
140
140
|
|
|
141
141
|
## The v2 API
|
|
142
142
|
|
|
143
|
-
###
|
|
143
|
+
### Eval shape
|
|
144
144
|
|
|
145
145
|
```ts
|
|
146
|
-
import {
|
|
146
|
+
import { defineEval } from "eve/evals";
|
|
147
147
|
import { Checks } from "eve/evals/checks";
|
|
148
148
|
import { Run, Text } from "eve/evals/scores";
|
|
149
149
|
|
|
150
|
-
export default
|
|
150
|
+
export default defineEval({
|
|
151
151
|
description: "HITL approval flows: park, approve, deny, persist.",
|
|
152
152
|
|
|
153
|
-
//
|
|
153
|
+
// Eval-level checks apply to every case.
|
|
154
154
|
checks: [Checks.didNotFail()],
|
|
155
155
|
scores: [Run.didNotFail()],
|
|
156
156
|
|
|
@@ -180,20 +180,20 @@ export default defineEvalSuite({
|
|
|
180
180
|
});
|
|
181
181
|
```
|
|
182
182
|
|
|
183
|
-
Changes to `
|
|
183
|
+
Changes to `EveEvalInput` (`packages/eve/src/evals/types.ts`):
|
|
184
184
|
|
|
185
|
-
| Field | Change
|
|
186
|
-
| ---------------- |
|
|
187
|
-
| `task.run` | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`.
|
|
188
|
-
| scripted cases | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the
|
|
189
|
-
| `checks` | New optional array of `EveEvalCheck`, at
|
|
190
|
-
| `requires` | New optional requirement list (
|
|
191
|
-
| `model` | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the
|
|
192
|
-
| `trials` | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed
|
|
193
|
-
| `tags` filtering | Case and
|
|
185
|
+
| Field | Change |
|
|
186
|
+
| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
187
|
+
| `task.run` | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`. |
|
|
188
|
+
| scripted cases | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the eval `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `eval.task`. See "Scripted cases" below. |
|
|
189
|
+
| `checks` | New optional array of `EveEvalCheck`, at eval level and per case (case-level appends to eval-level). Hard assertions; any failure marks the case failed and flips the CLI exit code. |
|
|
190
|
+
| `requires` | New optional requirement list (eval- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one eval runs against the dev server, a local build, and a deployed URL. See "Targets". Evals own no provisioning config — no kind/env/setup. |
|
|
191
|
+
| `model` | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the eval provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
|
|
192
|
+
| `trials` | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed evals. |
|
|
193
|
+
| `tags` filtering | Case and eval `tags` become functional (CLI `--tag`). |
|
|
194
194
|
|
|
195
195
|
Everything else (`cases`/`load`, `scores`, `thresholds`, `reporters`,
|
|
196
|
-
`maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing
|
|
196
|
+
`maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing evals keep
|
|
197
197
|
working except where noted in Breaking changes.
|
|
198
198
|
|
|
199
199
|
### `task.run(ctx)` and the run context
|
|
@@ -214,7 +214,7 @@ export interface EveEvalRunContext {
|
|
|
214
214
|
newSession(): EveEvalSession;
|
|
215
215
|
/** Handle to the agent server under test (channels, schedules, raw routes). */
|
|
216
216
|
readonly target: EveEvalTargetHandle;
|
|
217
|
-
/** Case timeout signal (from
|
|
217
|
+
/** Case timeout signal (from eval/CLI `timeoutMs`). */
|
|
218
218
|
readonly signal: AbortSignal;
|
|
219
219
|
/** Structured logger; lines land in the case artifact, and on stdout with `--verbose`. */
|
|
220
220
|
readonly log: (message: string) => void;
|
|
@@ -234,13 +234,13 @@ Semantics:
|
|
|
234
234
|
whole interaction. `derived` facts are computed over the primary session by
|
|
235
235
|
default, with per-session access on the result.
|
|
236
236
|
|
|
237
|
-
### Scripted cases: making the
|
|
237
|
+
### Scripted cases: making the eval a real grouping
|
|
238
238
|
|
|
239
|
-
|
|
239
|
+
An eval-level `task` alone means "one interaction script × many data rows" —
|
|
240
240
|
the dataset shape. The e2e surface is the inverse: many distinct scripts, one
|
|
241
241
|
execution each (`tool-approval`, `tool-denial`, and `ask-question-flow` are
|
|
242
242
|
three different `run` functions, not three rows). Forcing each script into its
|
|
243
|
-
own
|
|
243
|
+
own eval file with a single dummy-input case would make "eval" a misnomer
|
|
244
244
|
and multiply provisioning cost (a provisioned target per file instead of per
|
|
245
245
|
behavior family).
|
|
246
246
|
|
|
@@ -249,13 +249,13 @@ So `EveEvalCase` becomes a union:
|
|
|
249
249
|
```ts
|
|
250
250
|
export type EveEvalCase = EveEvalDataCase | EveEvalScriptedCase;
|
|
251
251
|
|
|
252
|
-
/** Today's shape: data routed through the
|
|
252
|
+
/** Today's shape: data routed through the eval-level task. */
|
|
253
253
|
export interface EveEvalDataCase {
|
|
254
254
|
readonly id: string;
|
|
255
255
|
readonly input: string | Record<string, unknown>;
|
|
256
256
|
readonly expected?: unknown;
|
|
257
|
-
readonly checks?: readonly EveEvalCheck[]; // appended to
|
|
258
|
-
readonly scores?: readonly EveEvalScorer[]; // appended to
|
|
257
|
+
readonly checks?: readonly EveEvalCheck[]; // appended to eval-level
|
|
258
|
+
readonly scores?: readonly EveEvalScorer[]; // appended to eval-level
|
|
259
259
|
readonly tags?: readonly string[];
|
|
260
260
|
readonly metadata?: Readonly<Record<string, unknown>>;
|
|
261
261
|
}
|
|
@@ -274,26 +274,26 @@ export interface EveEvalScriptedCase {
|
|
|
274
274
|
|
|
275
275
|
Resolution rules:
|
|
276
276
|
|
|
277
|
-
- `case.run` wins over `
|
|
277
|
+
- `case.run` wins over `eval.task`; a data case without an eval `task` falls
|
|
278
278
|
back to today's default (send `input` verbatim).
|
|
279
|
-
- Case-level `checks`/`scores` **append** to
|
|
279
|
+
- Case-level `checks`/`scores` **append** to eval-level ones; eval-level
|
|
280
280
|
expresses invariants ("never fails"), case-level expresses the specific
|
|
281
281
|
behavior under test.
|
|
282
|
-
-
|
|
283
|
-
quality
|
|
282
|
+
- An eval may freely mix data cases and scripted cases, though in practice
|
|
283
|
+
quality evals are all-data and smoke evals are all-scripted.
|
|
284
284
|
|
|
285
|
-
The conceptual model this lands on: **the
|
|
285
|
+
The conceptual model this lands on: **the eval file is the `describe`, cases
|
|
286
286
|
are the `it`s** — grouping, one shared target, shared baseline checks and
|
|
287
|
-
requirements — without a block API, and with path-derived identity intact. The
|
|
287
|
+
requirements — without a block API, and with path-derived identity intact. The eval-level `task` keeps its role as the shared default for
|
|
288
288
|
dataset evals; it is no longer the only way to define behavior.
|
|
289
289
|
|
|
290
290
|
Execution semantics are uniform across both case kinds:
|
|
291
291
|
|
|
292
292
|
- **Concurrency**: scripted cases join the same bounded pool as data cases
|
|
293
|
-
(each owns its sessions, so they parallelize safely).
|
|
293
|
+
(each owns its sessions, so they parallelize safely). Evals whose cases
|
|
294
294
|
mutate shared target state (e.g. `defineState` persistence tests) set
|
|
295
295
|
`maxConcurrency: 1`.
|
|
296
|
-
- **Timeout**:
|
|
296
|
+
- **Timeout**: eval/CLI `timeoutMs` applies per case per trial; the signal is
|
|
297
297
|
`ctx.signal` inside `run` and aborts in-flight sends.
|
|
298
298
|
- **Trials**: apply identically — a scripted case under `trials: 3` runs its
|
|
299
299
|
`run` three times against three fresh primary sessions.
|
|
@@ -358,7 +358,7 @@ Notes:
|
|
|
358
358
|
|
|
359
359
|
- `send` does **not** throw on `turn.failed`/`session.failed` by default —
|
|
360
360
|
failure handling belongs to checks (`Checks.completed()`) or explicit
|
|
361
|
-
`turn.expectOk()`. This keeps negative-path
|
|
361
|
+
`turn.expectOk()`. This keeps negative-path evals (e.g. today's
|
|
362
362
|
`remote-agent-start-failure.ts`, `tool-throw-recover.ts`) natural to write.
|
|
363
363
|
- HITL coverage: `needsApproval` approvals (`approve`/`deny` option ids),
|
|
364
364
|
framework `ask_question` selects (`optionId`), freeform answers
|
|
@@ -401,7 +401,7 @@ Built-ins, exported from `eve/evals/checks`:
|
|
|
401
401
|
| Check | Asserts |
|
|
402
402
|
| ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
403
403
|
| `Checks.completed()` | Final status is `"completed"` (not `"waiting"`, not `"failed"`) |
|
|
404
|
-
| `Checks.waiting()` | Final status is `"waiting"` (for park-shaped
|
|
404
|
+
| `Checks.waiting()` | Final status is `"waiting"` (for park-shaped evals) |
|
|
405
405
|
| `Checks.didNotFail()` | Status is not `"failed"` and no `turn.failed`/`step.failed` events |
|
|
406
406
|
| `Checks.messageIncludes(token)` | Joined `message.completed` text contains `token` (string or RegExp) |
|
|
407
407
|
| `Checks.outputEquals(value)` / `Checks.outputMatches(schema)` | Deep-equal / Standard Schema validation of `result.output` |
|
|
@@ -419,7 +419,7 @@ Pass/fail policy:
|
|
|
419
419
|
- Failed cases always produce a non-zero `eve eval` exit code.
|
|
420
420
|
- Scores keep today's semantics: thresholded, reported, never gate the exit
|
|
421
421
|
code — **unless** `--strict` is passed, which additionally fails the process
|
|
422
|
-
when any case scores below threshold. CI for fixture smoke
|
|
422
|
+
when any case scores below threshold. CI for fixture smoke evals runs
|
|
423
423
|
`--strict`; users running exploratory quality evals don't.
|
|
424
424
|
|
|
425
425
|
### Derived facts v2 (breaking)
|
|
@@ -477,8 +477,8 @@ export interface EveEvalCaseResult {
|
|
|
477
477
|
readonly skipReason?: string; // unmet requirement, when skipped
|
|
478
478
|
}
|
|
479
479
|
|
|
480
|
-
export interface
|
|
481
|
-
readonly
|
|
480
|
+
export interface EveEvalResult {
|
|
481
|
+
readonly id: string;
|
|
482
482
|
readonly target: EveEvalTarget;
|
|
483
483
|
readonly cases: readonly EveEvalCaseResult[];
|
|
484
484
|
readonly startedAt: string;
|
|
@@ -498,9 +498,9 @@ Downstream effects:
|
|
|
498
498
|
requirement inline); failed checks print their `message` indented under the
|
|
499
499
|
case line (replacing the bespoke error prose e2e tests hand-craft today).
|
|
500
500
|
Summary adds check and skip totals.
|
|
501
|
-
- **`EvalReporter` interface**:
|
|
502
|
-
`onCaseComplete` / `
|
|
503
|
-
flows through
|
|
501
|
+
- **`EvalReporter` interface**: lifecycle hooks are `onEvalStart` /
|
|
502
|
+
`onCaseComplete` / `onEvalComplete`; the richer `EveEvalCaseResult`
|
|
503
|
+
flows through case-completion hooks.
|
|
504
504
|
- **Braintrust reporter**: checks log as binary scores under a `check:` name
|
|
505
505
|
prefix (e.g. `check:toolCalled(bash)`) so experiments diff check regressions
|
|
506
506
|
the same way they diff score regressions; `verdict` and failed-check
|
|
@@ -511,16 +511,16 @@ Downstream effects:
|
|
|
511
511
|
file per session, keyed by session id.
|
|
512
512
|
- **Reporter throughput**: `onCaseComplete` is currently awaited inline inside
|
|
513
513
|
the case pool, so a slow reporter throttles execution; v2 queues reporter
|
|
514
|
-
callbacks off the hot path (ordering preserved per
|
|
514
|
+
callbacks off the hot path (ordering preserved per eval).
|
|
515
515
|
|
|
516
516
|
### Targets: a target is a URL
|
|
517
517
|
|
|
518
|
-
There is no
|
|
519
|
-
URL, and the
|
|
518
|
+
There is no eval-owned provisioning config. A target is always just a base
|
|
519
|
+
URL, and the eval's job is to interact and assert — never to describe how the
|
|
520
520
|
agent gets built or started. This is deliberate: properties like the agent's
|
|
521
521
|
model or the mock-model adapter are **build/start-time properties of the
|
|
522
|
-
server process**.
|
|
523
|
-
|
|
522
|
+
server process**. An eval cannot control them at run time, so an API that lets
|
|
523
|
+
an eval "declare" them is a footgun — it works only when the runner happens to
|
|
524
524
|
be the thing booting the server, and silently means nothing otherwise.
|
|
525
525
|
|
|
526
526
|
There are exactly two ways to get a target:
|
|
@@ -565,9 +565,9 @@ export interface EveEvalTargetHandle {
|
|
|
565
565
|
}
|
|
566
566
|
```
|
|
567
567
|
|
|
568
|
-
This covers channel
|
|
568
|
+
This covers channel evals (POST a signed webhook via `target.fetch`, assert
|
|
569
569
|
on an externally provisioned fake provider, attach to the created session) and
|
|
570
|
-
schedule
|
|
570
|
+
schedule evals (`dispatchSchedule` + `attachSession`).
|
|
571
571
|
|
|
572
572
|
The runner always performs the readiness/identity handshake regardless of who
|
|
573
573
|
provisioned the target: `/eve/v1/health` polling, then `/eve/v1/info`
|
|
@@ -577,12 +577,12 @@ capabilities are discovered, not assumed.
|
|
|
577
577
|
|
|
578
578
|
### Requirements: `requires`, verified against the live target
|
|
579
579
|
|
|
580
|
-
|
|
580
|
+
Evals cannot control the target, but their assertions still _assume_ things
|
|
581
581
|
about it — determinism via mock models, dev-only routes, a sidecar URL in the
|
|
582
582
|
environment. v1 gives those assumptions exactly one surface:
|
|
583
583
|
|
|
584
584
|
```ts
|
|
585
|
-
// Per
|
|
585
|
+
// Per eval (applies to all cases) and per case (additive):
|
|
586
586
|
readonly requires?: readonly EveEvalRequirement[];
|
|
587
587
|
|
|
588
588
|
type EveEvalRequirement =
|
|
@@ -598,24 +598,24 @@ Rules:
|
|
|
598
598
|
`env:<NAME>` against its own process environment. It never tries to make a
|
|
599
599
|
requirement true.
|
|
600
600
|
2. **Unmet requirement → skip, visibly.** The case (or every case, for
|
|
601
|
-
|
|
601
|
+
eval-level `requires`) gets `verdict: "skipped"` with the unmet
|
|
602
602
|
requirement as the reason — reported in console, `--json`, and artifacts;
|
|
603
603
|
never silently dropped, never failed. `--no-skips` turns skips into
|
|
604
604
|
failures for legs that must prove full coverage. Skips don't otherwise
|
|
605
605
|
affect the exit code.
|
|
606
606
|
3. **One convenience, because the runner boots the dev server:** plain
|
|
607
|
-
`eve eval` with
|
|
608
|
-
`--mock-models` or sees those
|
|
607
|
+
`eve eval` with evals requiring `"mockModels"` either passes
|
|
608
|
+
`--mock-models` or sees those evals skip with a message naming the flag.
|
|
609
609
|
No auto-magic in v1 — explicit and predictable beats clever.
|
|
610
610
|
4. **Runtime guards back the declarations.** `target.dispatchSchedule` throws
|
|
611
611
|
a requirement error when called by a case that didn't declare
|
|
612
612
|
`"devRoutes"` — so undeclared dependencies surface as named failures on
|
|
613
613
|
the local leg, not as mystery flakes on the remote leg.
|
|
614
614
|
|
|
615
|
-
The deferred idea —
|
|
615
|
+
The deferred idea — an eval-owned `environment` block where the runner builds
|
|
616
616
|
and starts targets, injects env, and manages sidecar lifecycles — is recorded
|
|
617
617
|
under Open questions. It is not in v1: it duplicated CLI concerns, and it let
|
|
618
|
-
|
|
618
|
+
evals express build-time properties they cannot actually own.
|
|
619
619
|
|
|
620
620
|
### The external provisioner pattern
|
|
621
621
|
|
|
@@ -636,38 +636,38 @@ node e2e/provision/<group>.ts & # build, sidecars, env, eve start, health-poll
|
|
|
636
636
|
pnpm --filter <fixture-app> exec eve eval --strict --url "http://127.0.0.1:$PORT"
|
|
637
637
|
```
|
|
638
638
|
|
|
639
|
-
|
|
640
|
-
requirements): a channel
|
|
641
|
-
fake Bot API captured; a subagent
|
|
639
|
+
Evals consume provisioner outputs through env (declared via `env:<NAME>`
|
|
640
|
+
requirements): a channel eval reads `TELEGRAM_PROBE_URL` to query what the
|
|
641
|
+
fake Bot API captured; a subagent eval reads `EVE_WEATHER_AGENT_HOST` to
|
|
642
642
|
assert on `subagent.called` remote URLs. The probe/stub helpers
|
|
643
643
|
(`startHttpProbe`, `startMcpStub`, generalized from `e2e/lib/`) ship in
|
|
644
644
|
`eve/evals/environment` for provisioners — including users' own — to use, with
|
|
645
|
-
an HTTP inspection endpoint so
|
|
645
|
+
an HTTP inspection endpoint so evals can assert on captured requests across
|
|
646
646
|
the process boundary.
|
|
647
647
|
|
|
648
648
|
### CLI v2
|
|
649
649
|
|
|
650
650
|
```
|
|
651
|
-
eve eval [
|
|
651
|
+
eve eval [evalId...]
|
|
652
652
|
--url <url> run against an existing target (built local server,
|
|
653
653
|
preview deployment); without it, boots the dev server
|
|
654
654
|
--mock-models boot the dev server with the deterministic mock
|
|
655
655
|
adapter (invalid with --url; the target's mock state
|
|
656
656
|
is discovered, not set)
|
|
657
|
-
--tag <tag...> run only cases (or
|
|
657
|
+
--tag <tag...> run only cases (or evals) carrying a tag
|
|
658
658
|
--case <id...> run only specific case ids
|
|
659
659
|
--strict sub-threshold scores also fail the exit code
|
|
660
660
|
--no-skips requirement skips fail instead of skipping
|
|
661
|
-
--trials <n> override
|
|
661
|
+
--trials <n> override eval trials
|
|
662
662
|
--timeout <ms> per-case timeout (existing)
|
|
663
663
|
--max-concurrency <n> (existing)
|
|
664
664
|
--json structured stdout (existing)
|
|
665
|
-
--skip-report skip
|
|
666
|
-
--list print discovered
|
|
665
|
+
--skip-report skip eval reporters (existing)
|
|
666
|
+
--list print discovered evals/cases without running
|
|
667
667
|
--verbose stream per-case ctx.log and event summaries
|
|
668
668
|
```
|
|
669
669
|
|
|
670
|
-
- Positional
|
|
670
|
+
- Positional eval ids select evals; the dead `--all` flag is removed
|
|
671
671
|
(no filter already means all).
|
|
672
672
|
- Exit codes: `0` all cases passed checks (and thresholds under `--strict`);
|
|
673
673
|
`1` any case failed (check failure, run throw, execution error, or strict
|
|
@@ -702,7 +702,7 @@ into the framework, then it dies:
|
|
|
702
702
|
3. **Turn failure surfacing**: export a typed
|
|
703
703
|
`isTurnFailureEvent(event)` narrowing helper and the
|
|
704
704
|
`EveEvalTurn.expectOk()` driver method instead of `TurnFailedError`-style
|
|
705
|
-
throw-by-default (negative-path
|
|
705
|
+
throw-by-default (negative-path evals need non-throwing sends).
|
|
706
706
|
4. **Multimodal sugar** (`sendTextWithImage`): becomes
|
|
707
707
|
`EvalSession.sendFile`; the data-URL/`FilePart` encoding helper is exported
|
|
708
708
|
from `eve/client` for general use.
|
|
@@ -710,20 +710,20 @@ into the framework, then it dies:
|
|
|
710
710
|
|
|
711
711
|
## Replacing the e2e surface
|
|
712
712
|
|
|
713
|
-
### Where
|
|
713
|
+
### Where evals live
|
|
714
714
|
|
|
715
|
-
Each fixture app keeps owning its coverage:
|
|
715
|
+
Each fixture app keeps owning its coverage: evals move into
|
|
716
716
|
`e2e/fixtures/agent-*/evals/*.eval.ts` and
|
|
717
717
|
`apps/fixtures/weather-fixture/evals/*.eval.ts`. Discovery already scans
|
|
718
|
-
`<appRoot>/evals/`. The area-policy module becomes unnecessary —
|
|
718
|
+
`<appRoot>/evals/`. The area-policy module becomes unnecessary — an eval can
|
|
719
719
|
only target its own app, by construction. Provisioning scripts live next to
|
|
720
720
|
the fixtures (`e2e/provision/`), built from today's `e2e/lib/server.ts` and
|
|
721
721
|
`e2e/target/` logic rather than rewriting it.
|
|
722
722
|
|
|
723
723
|
### Coverage mapping
|
|
724
724
|
|
|
725
|
-
Scripted cases let one
|
|
726
|
-
consolidate into roughly one
|
|
725
|
+
Scripted cases let one eval absorb a whole e2e group: today's 78 script files
|
|
726
|
+
consolidate into roughly one eval per behavior family (e.g.
|
|
727
727
|
`agent-tools-hitl/evals/hitl.eval.ts` with `approve-then-persist`,
|
|
728
728
|
`deny-regates`, `ask-question`, and `tool-auth` cases), each sharing one
|
|
729
729
|
provisioned target. `--case` becomes the day-to-day tool for re-running a
|
|
@@ -734,11 +734,11 @@ single behavior while debugging.
|
|
|
734
734
|
| `basic-runtime/*` (basic, multi-turn history, client context, output schema, image, define-state) | `task.run` + `Checks.messageIncludes` / `outputMatches`; `sendFile` for image; `send({ clientContext })`; multi-turn token recall is two `send`s and one check |
|
|
735
735
|
| `tools/*` (14 dynamic-tool files, MCP, multi-step loop, narrowing, throw-recover) | `Checks.toolCalled(name, { input, output, isError })`, `Checks.toolOrder`, `Checks.event` for ordering edge cases; MCP stub started by the provisioner (`startMcpStub`), addressed via `env:` requirement |
|
|
736
736
|
| `tools-hitl/*` (approval, denial, ask-question, tool auth) | `expectInputRequests` + `respond`/`respondAll`; auth flows keep the IdP emulator as a provisioner sidecar |
|
|
737
|
-
| `tools-sandbox/*` | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot
|
|
737
|
+
| `tools-sandbox/*` | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot eval tagged `requires-credentials`, excluded in CI via `--tag` |
|
|
738
738
|
| `channels/*` | provisioner starts the fake provider (one shared `startHttpProbe`) and exports `*_API_BASE_URL`; case does `target.fetch` webhook ingress, asserts on probe captures via its inspection endpoint, `attachSession` for the created session |
|
|
739
739
|
| `schedules/*` | `target.dispatchSchedule` + `attachSession`; stream-resume test asserts `attachSession({ startIndex })` replay |
|
|
740
740
|
| `subagents/*` (incl. remote delegation, callbacks, failures) | provisioner boots both agents and wires `EVE_WEATHER_AGENT_HOST`; `Checks.subagentCalled(name, { remoteUrl })`; callback retry/bypass probes as provisioner sidecars |
|
|
741
|
-
| `codemode/*` | Unchanged
|
|
741
|
+
| `codemode/*` | Unchanged eval bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server |
|
|
742
742
|
| `tui-client/*` | **Stays a script harness** (non-goal) |
|
|
743
743
|
|
|
744
744
|
### Worked example: porting `remote-agent-delegation`
|
|
@@ -746,15 +746,15 @@ single behavior while debugging.
|
|
|
746
746
|
Today's `e2e/tests/subagents/remote-agent-delegation.ts` is 101 lines: manual
|
|
747
747
|
port arithmetic, two `resolveTarget` calls threading `startEnv` by hand, and a
|
|
748
748
|
58-line `assertRemoteDelegation` function of filter/narrow/throw prose. The v2
|
|
749
|
-
|
|
749
|
+
eval, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
|
|
750
750
|
|
|
751
751
|
```ts
|
|
752
|
-
import {
|
|
752
|
+
import { defineEval } from "eve/evals";
|
|
753
753
|
import { Checks } from "eve/evals/checks";
|
|
754
754
|
|
|
755
755
|
const CITY = "Lisbon";
|
|
756
756
|
|
|
757
|
-
export default
|
|
757
|
+
export default defineEval({
|
|
758
758
|
description: "Remote subagent delegation over HTTP to a second local agent.",
|
|
759
759
|
|
|
760
760
|
// Assumptions about the target, verified by the runner. The provisioner
|
|
@@ -795,8 +795,8 @@ weather fixture with mocks, boot the parent with mocks +
|
|
|
795
795
|
|
|
796
796
|
What the diff buys, beyond line count:
|
|
797
797
|
|
|
798
|
-
- **Clean separation** — the
|
|
799
|
-
topology lives in one provisioner shared by every subagent case. The
|
|
798
|
+
- **Clean separation** — the eval holds only interaction and assertions; the
|
|
799
|
+
topology lives in one provisioner shared by every subagent case. The eval
|
|
800
800
|
states its assumptions (`requires`) and the runner enforces them, so running
|
|
801
801
|
it against an unprovisioned target skips with a named reason instead of
|
|
802
802
|
failing mysteriously.
|
|
@@ -807,14 +807,14 @@ What the diff buys, beyond line count:
|
|
|
807
807
|
configured) Braintrust like any other eval, and `--case
|
|
808
808
|
weather-result-reaches-parent` reruns it in isolation.
|
|
809
809
|
- **Room to grow** — `remote-agent-callback-retry`, `-bypass`, and
|
|
810
|
-
`-start-failure` become sibling cases in the same
|
|
810
|
+
`-start-failure` become sibling cases in the same eval, sharing one
|
|
811
811
|
provisioned topology instead of re-spawning per file.
|
|
812
812
|
|
|
813
813
|
### CI
|
|
814
814
|
|
|
815
815
|
`.github/workflows/smoke.yml` discovery changes from globbing
|
|
816
816
|
`e2e/tests/*/*.ts` to globbing fixture apps with `evals/` directories. Each
|
|
817
|
-
matrix leg provisions, then runs the
|
|
817
|
+
matrix leg provisions, then runs the evals against the resulting URL:
|
|
818
818
|
|
|
819
819
|
```sh
|
|
820
820
|
node e2e/provision/<group>.ts & # build + sidecars + env + eve start
|
|
@@ -822,7 +822,7 @@ pnpm --filter <fixture-app> exec eve eval --strict --json --url "$TARGET_URL"
|
|
|
822
822
|
```
|
|
823
823
|
|
|
824
824
|
twice (direct / `EVE_EXPERIMENTAL_CODE_MODE=1` set by the provisioner),
|
|
825
|
-
`fail-fast: false`, JUnit reporter for annotations. Per-
|
|
825
|
+
`fail-fast: false`, JUnit reporter for annotations. Per-eval artifacts under
|
|
826
826
|
`.eve/evals/` upload on failure — strictly better debuggability than today's
|
|
827
827
|
stdout scraping. A post-deploy leg is the same invocation pointed at a preview
|
|
828
828
|
deployment; requirement-incompatible cases skip visibly.
|
|
@@ -871,7 +871,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
|
|
|
871
871
|
8. `EveEvalTargetHandle`: `baseUrl`, `fetch`, `info`, `capabilities`,
|
|
872
872
|
`dispatchSchedule`, `attachSession`; `--mock-models` for the runner-booted
|
|
873
873
|
dev server; readiness/identity handshake for `--url` targets.
|
|
874
|
-
9. Requirements:
|
|
874
|
+
9. Requirements: eval/case `requires` (`mockModels` / `devRoutes` /
|
|
875
875
|
`env:<NAME>`), `skipped` verdict + `--no-skips`, runtime guards on
|
|
876
876
|
requirement-gated handle methods, and `/eve/v1/info` reporting mock-model
|
|
877
877
|
state so requirements are verified against the live target.
|
|
@@ -883,7 +883,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
|
|
|
883
883
|
|
|
884
884
|
### Phase 4 — Migration and deletion
|
|
885
885
|
|
|
886
|
-
12. Port
|
|
886
|
+
12. Port evals group by group in the order: `basic-runtime` → `tools` →
|
|
887
887
|
`tools-hitl` → `subagents` → `schedules` → `channels` → `codemode` →
|
|
888
888
|
`tools-sandbox`. Each ported group flips its CI matrix entry from
|
|
889
889
|
`node e2e/tests/...` to provision + `eve eval --strict --url` in the same
|
|
@@ -892,7 +892,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
|
|
|
892
892
|
area policy; shrink `e2e/lib`/`e2e/target` into `e2e/provision/`; update
|
|
893
893
|
`e2e/README.md` and AGENTS.md smoke-test guidance to point at `eve eval`.
|
|
894
894
|
14. Optional follow-on: add a post-deploy `--url` leg against preview
|
|
895
|
-
deployments for the requirement-compatible subset of fixture
|
|
895
|
+
deployments for the requirement-compatible subset of fixture evals.
|
|
896
896
|
|
|
897
897
|
## Breaking changes
|
|
898
898
|
|
|
@@ -903,37 +903,37 @@ changesets for each:
|
|
|
903
903
|
typed records.
|
|
904
904
|
- `EveEvalCase` becomes a data/scripted union; `input` is no longer required
|
|
905
905
|
on scripted cases. Existing data cases are unaffected.
|
|
906
|
-
- `model` becomes optional;
|
|
907
|
-
-
|
|
908
|
-
- Exit-code semantics gain check failures (
|
|
906
|
+
- `model` becomes optional; evals passing dummy models can drop them.
|
|
907
|
+
- Eval selection uses positional ids; `--all` removed.
|
|
908
|
+
- Exit-code semantics gain check failures (evals without `checks` see no
|
|
909
909
|
change unless `--strict`).
|
|
910
910
|
|
|
911
911
|
## Risks and open questions
|
|
912
912
|
|
|
913
|
-
- **Cost/flakiness of model-backed
|
|
913
|
+
- **Cost/flakiness of model-backed evals in CI.** Mitigation: the
|
|
914
914
|
`"mockModels"` requirement makes determinism a declared, verified property
|
|
915
915
|
instead of a per-file accident; `trials` + `--strict` thresholds handle the
|
|
916
|
-
|
|
916
|
+
evals that must use real models. The migration should explicitly decide,
|
|
917
917
|
per ported smoke, mock vs real — today's split is undocumented.
|
|
918
918
|
- **The stream-registration race** may not be fully fixable server-side in
|
|
919
919
|
Phase 2; the client-level bounded retry is the documented fallback. Track it
|
|
920
920
|
as its own issue rather than letting retry constants drift again.
|
|
921
|
-
- **Channel
|
|
922
|
-
sandbox snapshot
|
|
921
|
+
- **Channel evals with real credentials** (`slack-thread-context`) and
|
|
922
|
+
sandbox snapshot evals stay tag-gated (`requires-credentials`) and excluded
|
|
923
923
|
from default CI, as today.
|
|
924
924
|
- **Provisioner drift on `--url` legs.** Beyond what `/info` exposes
|
|
925
925
|
(identity, mode, mock-model state) and `env:<NAME>` presence checks, the
|
|
926
|
-
runner trusts the provisioner.
|
|
926
|
+
runner trusts the provisioner. An eval can pass locally and skip-or-fail
|
|
927
927
|
remotely because the target was provisioned differently; requirements make
|
|
928
928
|
this visible but cannot make it impossible.
|
|
929
|
-
- **Open: should the runner ever own provisioning?**
|
|
929
|
+
- **Open: should the runner ever own provisioning?** An eval-owned
|
|
930
930
|
`environment` block (runner builds/starts targets, injects env, manages
|
|
931
931
|
sidecar lifecycles, boots secondary agents) was considered and deliberately
|
|
932
|
-
cut from v1: it duplicated CLI concerns and let
|
|
932
|
+
cut from v1: it duplicated CLI concerns and let evals declare build-time
|
|
933
933
|
properties they cannot own at run time. Revisit only with concrete friction
|
|
934
934
|
data from the migration — the likely v2 shape, if any, is a _project-level_
|
|
935
935
|
provisioning config consumed by `eve eval` (like Playwright's `webServer`),
|
|
936
|
-
not per-
|
|
936
|
+
not per-eval config.
|
|
937
937
|
- **User-facing naming**: `checks` vs `scores` vs `thresholds` vs `requires`
|
|
938
938
|
needs a docs pass so the soft/hard/assumption distinction is obvious; the
|
|
939
939
|
evals doc gets a "smoke-testing your agent" section once Phase 2 lands.
|