eve 0.6.0-beta.10 → 0.6.0-beta.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/CHANGELOG.md +13 -0
  2. package/README.md +1 -1
  3. package/dist/docs/evals-v2-plan.md +108 -108
  4. package/dist/docs/public/advanced/evals.md +51 -23
  5. package/dist/docs/public/reference/cli.md +17 -17
  6. package/dist/docs/public/reference/typescript-api.md +2 -2
  7. package/dist/src/chunks/{use-eve-agent-DCZbkLG7.js → use-eve-agent-BD79SFqg.js} +82 -29
  8. package/dist/src/chunks/{use-eve-agent-DoheC4_o.js → use-eve-agent-DT2A6VJb.js} +82 -29
  9. package/dist/src/cli/run.d.ts +1 -1
  10. package/dist/src/cli/run.js +1 -1
  11. package/dist/src/client/file-parts.d.ts +18 -0
  12. package/dist/src/client/file-parts.js +1 -0
  13. package/dist/src/client/index.d.ts +3 -2
  14. package/dist/src/client/index.js +1 -1
  15. package/dist/src/client/message-response.js +1 -1
  16. package/dist/src/client/open-stream.d.ts +6 -0
  17. package/dist/src/client/open-stream.js +1 -1
  18. package/dist/src/client/session-utils.d.ts +5 -0
  19. package/dist/src/client/session-utils.js +1 -1
  20. package/dist/src/client/session.js +1 -1
  21. package/dist/src/client/types.d.ts +5 -1
  22. package/dist/src/evals/checks/checks.d.ts +1 -1
  23. package/dist/src/evals/checks/index.d.ts +1 -1
  24. package/dist/src/evals/checks/match.d.ts +1 -1
  25. package/dist/src/evals/cli/eval.d.ts +2 -2
  26. package/dist/src/evals/cli/eval.js +1 -1
  27. package/dist/src/evals/{define-eval-suite.d.ts → define-eval.d.ts} +6 -6
  28. package/dist/src/evals/define-eval.js +1 -0
  29. package/dist/src/evals/index.d.ts +3 -2
  30. package/dist/src/evals/index.js +1 -1
  31. package/dist/src/evals/runner/artifacts.d.ts +6 -6
  32. package/dist/src/evals/runner/artifacts.js +1 -1
  33. package/dist/src/evals/runner/discover.d.ts +10 -10
  34. package/dist/src/evals/runner/discover.js +1 -1
  35. package/dist/src/evals/runner/execute-case.d.ts +4 -7
  36. package/dist/src/evals/runner/execute-case.js +1 -1
  37. package/dist/src/evals/runner/{execute-suite.d.ts → execute-eval.d.ts} +6 -6
  38. package/dist/src/evals/runner/execute-eval.js +1 -0
  39. package/dist/src/evals/runner/reporters/braintrust.d.ts +4 -4
  40. package/dist/src/evals/runner/reporters/braintrust.js +2 -2
  41. package/dist/src/evals/runner/reporters/console.d.ts +3 -3
  42. package/dist/src/evals/runner/reporters/console.js +1 -1
  43. package/dist/src/evals/runner/reporters/types.d.ts +6 -6
  44. package/dist/src/evals/scorers/autoevals.d.ts +7 -7
  45. package/dist/src/evals/scorers/autoevals.js +1 -1
  46. package/dist/src/evals/scorers/model-marker.d.ts +3 -3
  47. package/dist/src/evals/session.d.ts +46 -0
  48. package/dist/src/evals/session.js +1 -0
  49. package/dist/src/evals/types.d.ts +154 -57
  50. package/dist/src/execution/tool-auth.js +1 -1
  51. package/dist/src/internal/application/package.js +1 -1
  52. package/dist/src/protocol/message.d.ts +8 -0
  53. package/dist/src/protocol/message.js +2 -2
  54. package/dist/src/runtime/connections/mcp-client.js +1 -1
  55. package/dist/src/runtime/connections/scoped-authorization.d.ts +9 -5
  56. package/dist/src/runtime/connections/scoped-authorization.js +1 -1
  57. package/dist/src/runtime/connections/types.d.ts +24 -0
  58. package/dist/src/setup/boxes/deploy-project.js +1 -1
  59. package/dist/src/setup/scaffold/create/project.js +1 -1
  60. package/dist/src/setup/scaffold/update/channels.js +2 -2
  61. package/dist/src/setup/scaffold/update/connections.js +1 -1
  62. package/dist/src/svelte/index.js +1 -1
  63. package/dist/src/svelte/use-eve-agent.js +1 -1
  64. package/dist/src/vue/index.js +1 -1
  65. package/dist/src/vue/use-eve-agent.js +1 -1
  66. package/package.json +1 -1
  67. package/dist/src/evals/define-eval-suite.js +0 -1
  68. package/dist/src/evals/runner/execute-suite.js +0 -1
package/CHANGELOG.md CHANGED
@@ -1,5 +1,18 @@
1
1
  # eve
2
2
 
3
+ ## 0.6.0-beta.11
4
+
5
+ ### Minor Changes
6
+
7
+ - 9bb6371: Adds the evals interaction API: scripted cases, `task.run`, `EveEvalSession` helpers for HITL responses and file attachments, multi-session event capture, and client-level HITL request results with retry handling for stream registration races.
8
+ - 22dda94: Rename the eval authoring helper from `defineEvalSuite` to `defineEval` and update the public eval types, reporter hooks, CLI wording, and JSON result shape to use eval terminology consistently.
9
+
10
+ ### Patch Changes
11
+
12
+ - a0ca3bb: Scaffolded projects now pin `@vercel/connect@0.2.2`. The previously pinned 0.1.1 predates the `@vercel/connect/eve` entrypoint, so generated channels and connections failed to build on deploy with `Package subpath './eve' is not defined by "exports"`.
13
+ - a0ca3bb: Fix interactive post-setup deploys failing with "The `projectSettings` object is required for new projects". The setup deploy now passes `--yes` to `vercel deploy --prod` in interactive runs too, so the Vercel API auto-detects framework settings for projects that eve provisioned through the projects API.
14
+ - 380693d: Token eviction on a rejected bearer now cascades to the authorization strategy's own cache, not just Eve's per-step cache. `AuthorizationDefinition` gains an optional `evict(opts)` hook, and the shared `evictScopedToken` path (used by both authored tools and MCP connections) invokes it after dropping the per-step entry. This lets `@vercel/connect`-backed connections purge their in-process token cache so a revoked-but-unexpired grant is genuinely re-fetched instead of re-read from a lower cache layer.
15
+
3
16
  ## 0.6.0-beta.10
4
17
 
5
18
  ### Minor Changes
package/README.md CHANGED
@@ -52,7 +52,7 @@ Every authored directory has a typed helper. Import each from the matching subpa
52
52
  | `eveChannel(...)`, `slackChannel(...)`, `vercelOidc(...)` | `eve/channels/eve`, `/slack`, `/auth` | reused from `channels/<name>.ts` |
53
53
  | `defineSandbox(...)` | `eve/sandbox` | `sandbox.ts` (or `sandbox/sandbox.ts`) |
54
54
  | `defineSchedule(...)` | `eve/schedules` | `schedules/<name>.ts` (or `schedules/<name>.md`) |
55
- | `defineEvalSuite(...)` | `eve/evals` | `evals/<name>.eval.ts` |
55
+ | `defineEval(...)` | `eve/evals` | `evals/<name>.eval.ts` |
56
56
 
57
57
  Runtime accessors live on the subpath that owns the concern:
58
58
 
@@ -6,7 +6,7 @@ Scope: `packages/eve` (evals, client, CLI), `e2e/`, CI
6
6
 
7
7
  ## Summary
8
8
 
9
- Eve's eval suites (`defineEvalSuite` + `eve eval`) and the hand-rolled e2e smoke
9
+ Eve's evals (`defineEval` + `eve eval`) and the hand-rolled e2e smoke
10
10
  surface (`e2e/tests/**` + `ExampleClient`) are two implementations of the same
11
11
  idea: drive a real agent over HTTP and judge what happened. The eval runner has
12
12
  the right bones — filesystem discovery, target resolution, stream capture,
@@ -23,16 +23,16 @@ missing:
23
23
 
24
24
  1. **An imperative interaction API** — `run(ctx)` with a typed `EvalSession`
25
25
  driver for multi-turn control flow, HITL responses, approvals, structured
26
- output, attachments, and multi-session scenarios. Available at suite level
26
+ output, attachments, and multi-session scenarios. Available at eval level
27
27
  (`task.run`, the shared default for dataset evals) and per case
28
- (`case.run`), so one suite groups many distinct scripted behaviors the way
28
+ (`case.run`), so one eval groups many distinct scripted behaviors the way
29
29
  a test file groups `it` blocks.
30
30
  2. **A hard-assertion tier** — `checks`, distinct from `scores`. Check failures
31
31
  fail the case and the process; scores remain soft, thresholded data.
32
32
  3. **A URL-shaped target model with verified requirements** — a target is
33
33
  always just a URL. `eve eval` obtains one (boots the dev server, as today)
34
34
  or `--url` brings your own (a built server in CI, a preview deployment).
35
- Suites never declare _how_ to provision the agent — they declare
35
+ Evals never declare _how_ to provision the agent — they declare
36
36
  `requires` assumptions (mock models, dev routes, sidecar env) that the
37
37
  runner verifies against the live target and skips or errors on, visibly.
38
38
  Provisioning (build, start, env injection, sidecars, secondary agents)
@@ -41,7 +41,7 @@ missing:
41
41
  `e2e/tests/basic-runtime/evals.ts` already proves out today.
42
42
 
43
43
  The end state: every behavior currently proven by `e2e/tests/**` (except the
44
- TUI tests, see Non-goals) is an eval suite in a fixture app's `evals/`
44
+ TUI tests, see Non-goals) is an eval in a fixture app's `evals/`
45
45
  directory, CI runs `eve eval --strict` per fixture app per matrix mode, and
46
46
  `e2e/lib/client.ts` plus the per-file harness code are deleted. Users get the
47
47
  same machinery for their own agents: smoke-test your agent the way Eve
@@ -49,7 +49,7 @@ smoke-tests itself.
49
49
 
50
50
  ## Why not a `describe`/`it` runner
51
51
 
52
- We considered replacing `defineEvalSuite` with a vitest/`node:test`-shaped API.
52
+ We considered replacing `defineEval` with a vitest/`node:test`-shaped API.
53
53
  Rejected, for these reasons:
54
54
 
55
55
  1. **Evals have semantics tests don't.** Scores in `[0,1]`, per-scorer
@@ -58,8 +58,8 @@ Rejected, for these reasons:
58
58
  pass/fail; we would immediately reinvent scorers and reporters _inside_ `it`
59
59
  blocks and lose the structured result model that powers `--json` and the
60
60
  Braintrust reporter.
61
- 2. **Suite identity is path-derived** (repo principle 5, enforced by
62
- `scripts/guard-invariants.mjs`). One suite per `evals/<path>.eval.ts` file
61
+ 2. **Eval identity is path-derived** (repo principle 5, enforced by
62
+ `scripts/guard-invariants.mjs`). One eval per `evals/<path>.eval.ts` file
63
63
  maps cleanly onto the filesystem-first philosophy; nested describe blocks
64
64
  fight it.
65
65
  3. **The dataset model is the right default.** "Load 200 cases from YAML, fan
@@ -67,7 +67,7 @@ Rejected, for these reasons:
67
67
  A block-based runner makes that the awkward case (loops generating `it`s).
68
68
  4. **What e2e actually needs is not blocks** — it is control flow _inside a
69
69
  case_, a typed session driver, and assertion helpers over the event stream.
70
- All three fit inside the existing suite shape: `run` at the task and case
70
+ All three fit inside the existing eval shape: `run` at the task and case
71
71
  level provides the control flow, and scripted cases give the same grouping
72
72
  a `describe` file gives `it` blocks (see "Scripted cases" below).
73
73
  5. **Wrapping vitest violates principle 3** (wrap third-party deps; don't
@@ -88,15 +88,15 @@ first-class rather than by switching paradigms.
88
88
  arbitrary event predicates.
89
89
  - Hard pass/fail semantics suitable for CI gating, coexisting with soft scores.
90
90
  - One target model: any URL — the runner-booted dev server, a locally built
91
- `eve start` process, or a deployed instance — with suite-declared
92
- requirements verified against the live target instead of suite-owned
91
+ `eve start` process, or a deployed instance — with eval-declared
92
+ requirements verified against the live target instead of eval-owned
93
93
  provisioning.
94
94
  - Drive surfaces beyond the session route: channels (webhook ingress) and
95
- schedules (dev dispatch), with stream consumption for sessions the suite did
95
+ schedules (dev dispatch), with stream consumption for sessions the eval did
96
96
  not create.
97
97
  - One HTTP client: `eve/client` absorbs everything `e2e/lib/client.ts` does;
98
98
  `ExampleClient` is deleted.
99
- - Replace `e2e/tests/**` (minus TUI) with eval suites; CI becomes
99
+ - Replace `e2e/tests/**` (minus TUI) with evals; CI becomes
100
100
  `eve eval --strict` runs.
101
101
 
102
102
  ## Non-goals
@@ -108,13 +108,13 @@ first-class rather than by switching paradigms.
108
108
  tier only.
109
109
  - **Braintrust/autoevals integration** is unchanged in shape (still wrapped
110
110
  per principle 3).
111
- - **No authored suite `id`/`name`.** Identity stays path-derived; the
111
+ - **No authored eval `id`/`name`.** Identity stays path-derived; the
112
112
  invariant guard keeps enforcing it.
113
113
 
114
114
  ## Current state (abridged; see code for detail)
115
115
 
116
- - Suite API: `defineEvalSuite({ cases | load, task, scores, model, thresholds,
117
- reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
116
+ - Eval API: `defineEval({ cases | load, task, scores, model, thresholds,
117
+ reporters, ... })` in `packages/eve/src/evals/define-eval.ts` and
118
118
  `types.ts`. `task` is `prompt(case)` or `messages(case) => string[]` — a
119
119
  static list, no branching, no HITL (`runner/execute-case.ts:38` only ever
120
120
  sends `{ message }`).
@@ -140,17 +140,17 @@ reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
140
140
 
141
141
  ## The v2 API
142
142
 
143
- ### Suite shape
143
+ ### Eval shape
144
144
 
145
145
  ```ts
146
- import { defineEvalSuite } from "eve/evals";
146
+ import { defineEval } from "eve/evals";
147
147
  import { Checks } from "eve/evals/checks";
148
148
  import { Run, Text } from "eve/evals/scores";
149
149
 
150
- export default defineEvalSuite({
150
+ export default defineEval({
151
151
  description: "HITL approval flows: park, approve, deny, persist.",
152
152
 
153
- // Suite-level checks apply to every case.
153
+ // Eval-level checks apply to every case.
154
154
  checks: [Checks.didNotFail()],
155
155
  scores: [Run.didNotFail()],
156
156
 
@@ -180,20 +180,20 @@ export default defineEvalSuite({
180
180
  });
181
181
  ```
182
182
 
183
- Changes to `EveEvalSuiteInput` (`packages/eve/src/evals/types.ts`):
183
+ Changes to `EveEvalInput` (`packages/eve/src/evals/types.ts`):
184
184
 
185
- | Field | Change |
186
- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
187
- | `task.run` | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`. |
188
- | scripted cases | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the suite `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `suite.task`. See "Scripted cases" below. |
189
- | `checks` | New optional array of `EveEvalCheck`, at suite level and per case (case-level appends to suite-level). Hard assertions; any failure marks the case failed and flips the CLI exit code. |
190
- | `requires` | New optional requirement list (suite- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one suite runs against the dev server, a local build, and a deployed URL. See "Targets". Suites own no provisioning config — no kind/env/setup. |
191
- | `model` | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the suite provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
192
- | `trials` | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed suites. |
193
- | `tags` filtering | Case and suite `tags` become functional (CLI `--tag`). |
185
+ | Field | Change |
186
+ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
187
+ | `task.run` | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`. |
188
+ | scripted cases | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the eval `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `eval.task`. See "Scripted cases" below. |
189
+ | `checks` | New optional array of `EveEvalCheck`, at eval level and per case (case-level appends to eval-level). Hard assertions; any failure marks the case failed and flips the CLI exit code. |
190
+ | `requires` | New optional requirement list (eval- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one eval runs against the dev server, a local build, and a deployed URL. See "Targets". Evals own no provisioning config — no kind/env/setup. |
191
+ | `model` | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the eval provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
192
+ | `trials` | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed evals. |
193
+ | `tags` filtering | Case and eval `tags` become functional (CLI `--tag`). |
194
194
 
195
195
  Everything else (`cases`/`load`, `scores`, `thresholds`, `reporters`,
196
- `maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing suites keep
196
+ `maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing evals keep
197
197
  working except where noted in Breaking changes.
198
198
 
199
199
  ### `task.run(ctx)` and the run context
@@ -214,7 +214,7 @@ export interface EveEvalRunContext {
214
214
  newSession(): EveEvalSession;
215
215
  /** Handle to the agent server under test (channels, schedules, raw routes). */
216
216
  readonly target: EveEvalTargetHandle;
217
- /** Case timeout signal (from suite/CLI `timeoutMs`). */
217
+ /** Case timeout signal (from eval/CLI `timeoutMs`). */
218
218
  readonly signal: AbortSignal;
219
219
  /** Structured logger; lines land in the case artifact, and on stdout with `--verbose`. */
220
220
  readonly log: (message: string) => void;
@@ -234,13 +234,13 @@ Semantics:
234
234
  whole interaction. `derived` facts are computed over the primary session by
235
235
  default, with per-session access on the result.
236
236
 
237
- ### Scripted cases: making the suite a suite
237
+ ### Scripted cases: making the eval a real grouping
238
238
 
239
- A suite-level `task` alone means "one interaction script × many data rows" —
239
+ An eval-level `task` alone means "one interaction script × many data rows" —
240
240
  the dataset shape. The e2e surface is the inverse: many distinct scripts, one
241
241
  execution each (`tool-approval`, `tool-denial`, and `ask-question-flow` are
242
242
  three different `run` functions, not three rows). Forcing each script into its
243
- own suite file with a single dummy-input case would make "suite" a misnomer
243
+ own eval file with a single dummy-input case would make "eval" a misnomer
244
244
  and multiply provisioning cost (a provisioned target per file instead of per
245
245
  behavior family).
246
246
 
@@ -249,13 +249,13 @@ So `EveEvalCase` becomes a union:
249
249
  ```ts
250
250
  export type EveEvalCase = EveEvalDataCase | EveEvalScriptedCase;
251
251
 
252
- /** Today's shape: data routed through the suite-level task. */
252
+ /** Today's shape: data routed through the eval-level task. */
253
253
  export interface EveEvalDataCase {
254
254
  readonly id: string;
255
255
  readonly input: string | Record<string, unknown>;
256
256
  readonly expected?: unknown;
257
- readonly checks?: readonly EveEvalCheck[]; // appended to suite-level
258
- readonly scores?: readonly EveEvalScorer[]; // appended to suite-level
257
+ readonly checks?: readonly EveEvalCheck[]; // appended to eval-level
258
+ readonly scores?: readonly EveEvalScorer[]; // appended to eval-level
259
259
  readonly tags?: readonly string[];
260
260
  readonly metadata?: Readonly<Record<string, unknown>>;
261
261
  }
@@ -274,26 +274,26 @@ export interface EveEvalScriptedCase {
274
274
 
275
275
  Resolution rules:
276
276
 
277
- - `case.run` wins over `suite.task`; a data case without a suite `task` falls
277
+ - `case.run` wins over `eval.task`; a data case without an eval `task` falls
278
278
  back to today's default (send `input` verbatim).
279
- - Case-level `checks`/`scores` **append** to suite-level ones; suite-level
279
+ - Case-level `checks`/`scores` **append** to eval-level ones; eval-level
280
280
  expresses invariants ("never fails"), case-level expresses the specific
281
281
  behavior under test.
282
- - A suite may freely mix data cases and scripted cases, though in practice
283
- quality suites are all-data and smoke suites are all-scripted.
282
+ - An eval may freely mix data cases and scripted cases, though in practice
283
+ quality evals are all-data and smoke evals are all-scripted.
284
284
 
285
- The conceptual model this lands on: **the suite file is the `describe`, cases
285
+ The conceptual model this lands on: **the eval file is the `describe`, cases
286
286
  are the `it`s** — grouping, one shared target, shared baseline checks and
287
- requirements — without a block API, and with path-derived identity intact. The suite-level `task` keeps its role as the shared default for
287
+ requirements — without a block API, and with path-derived identity intact. The eval-level `task` keeps its role as the shared default for
288
288
  dataset evals; it is no longer the only way to define behavior.
289
289
 
290
290
  Execution semantics are uniform across both case kinds:
291
291
 
292
292
  - **Concurrency**: scripted cases join the same bounded pool as data cases
293
- (each owns its sessions, so they parallelize safely). Suites whose cases
293
+ (each owns its sessions, so they parallelize safely). Evals whose cases
294
294
  mutate shared target state (e.g. `defineState` persistence tests) set
295
295
  `maxConcurrency: 1`.
296
- - **Timeout**: suite/CLI `timeoutMs` applies per case per trial; the signal is
296
+ - **Timeout**: eval/CLI `timeoutMs` applies per case per trial; the signal is
297
297
  `ctx.signal` inside `run` and aborts in-flight sends.
298
298
  - **Trials**: apply identically — a scripted case under `trials: 3` runs its
299
299
  `run` three times against three fresh primary sessions.
@@ -358,7 +358,7 @@ Notes:
358
358
 
359
359
  - `send` does **not** throw on `turn.failed`/`session.failed` by default —
360
360
  failure handling belongs to checks (`Checks.completed()`) or explicit
361
- `turn.expectOk()`. This keeps negative-path suites (e.g. today's
361
+ `turn.expectOk()`. This keeps negative-path evals (e.g. today's
362
362
  `remote-agent-start-failure.ts`, `tool-throw-recover.ts`) natural to write.
363
363
  - HITL coverage: `needsApproval` approvals (`approve`/`deny` option ids),
364
364
  framework `ask_question` selects (`optionId`), freeform answers
@@ -401,7 +401,7 @@ Built-ins, exported from `eve/evals/checks`:
401
401
  | Check | Asserts |
402
402
  | ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
403
403
  | `Checks.completed()` | Final status is `"completed"` (not `"waiting"`, not `"failed"`) |
404
- | `Checks.waiting()` | Final status is `"waiting"` (for park-shaped suites) |
404
+ | `Checks.waiting()` | Final status is `"waiting"` (for park-shaped evals) |
405
405
  | `Checks.didNotFail()` | Status is not `"failed"` and no `turn.failed`/`step.failed` events |
406
406
  | `Checks.messageIncludes(token)` | Joined `message.completed` text contains `token` (string or RegExp) |
407
407
  | `Checks.outputEquals(value)` / `Checks.outputMatches(schema)` | Deep-equal / Standard Schema validation of `result.output` |
@@ -419,7 +419,7 @@ Pass/fail policy:
419
419
  - Failed cases always produce a non-zero `eve eval` exit code.
420
420
  - Scores keep today's semantics: thresholded, reported, never gate the exit
421
421
  code — **unless** `--strict` is passed, which additionally fails the process
422
- when any case scores below threshold. CI for fixture smoke suites runs
422
+ when any case scores below threshold. CI for fixture smoke evals runs
423
423
  `--strict`; users running exploratory quality evals don't.
424
424
 
425
425
  ### Derived facts v2 (breaking)
@@ -477,8 +477,8 @@ export interface EveEvalCaseResult {
477
477
  readonly skipReason?: string; // unmet requirement, when skipped
478
478
  }
479
479
 
480
- export interface EveEvalSuiteResult {
481
- readonly suite: string;
480
+ export interface EveEvalResult {
481
+ readonly id: string;
482
482
  readonly target: EveEvalTarget;
483
483
  readonly cases: readonly EveEvalCaseResult[];
484
484
  readonly startedAt: string;
@@ -498,9 +498,9 @@ Downstream effects:
498
498
  requirement inline); failed checks print their `message` indented under the
499
499
  case line (replacing the bespoke error prose e2e tests hand-craft today).
500
500
  Summary adds check and skip totals.
501
- - **`EvalReporter` interface**: unchanged shape (`onSuiteStart` /
502
- `onCaseComplete` / `onSuiteComplete`) the richer `EveEvalCaseResult`
503
- flows through existing hooks, so custom reporters keep compiling.
501
+ - **`EvalReporter` interface**: lifecycle hooks are `onEvalStart` /
502
+ `onCaseComplete` / `onEvalComplete`; the richer `EveEvalCaseResult`
503
+ flows through case-completion hooks.
504
504
  - **Braintrust reporter**: checks log as binary scores under a `check:` name
505
505
  prefix (e.g. `check:toolCalled(bash)`) so experiments diff check regressions
506
506
  the same way they diff score regressions; `verdict` and failed-check
@@ -511,16 +511,16 @@ Downstream effects:
511
511
  file per session, keyed by session id.
512
512
  - **Reporter throughput**: `onCaseComplete` is currently awaited inline inside
513
513
  the case pool, so a slow reporter throttles execution; v2 queues reporter
514
- callbacks off the hot path (ordering preserved per suite).
514
+ callbacks off the hot path (ordering preserved per eval).
515
515
 
516
516
  ### Targets: a target is a URL
517
517
 
518
- There is no suite-owned provisioning config. A target is always just a base
519
- URL, and the suite's job is to interact and assert — never to describe how the
518
+ There is no eval-owned provisioning config. A target is always just a base
519
+ URL, and the eval's job is to interact and assert — never to describe how the
520
520
  agent gets built or started. This is deliberate: properties like the agent's
521
521
  model or the mock-model adapter are **build/start-time properties of the
522
- server process**. A suite cannot control them at run time, so an API that lets
523
- a suite "declare" them is a footgun — it works only when the runner happens to
522
+ server process**. An eval cannot control them at run time, so an API that lets
523
+ an eval "declare" them is a footgun — it works only when the runner happens to
524
524
  be the thing booting the server, and silently means nothing otherwise.
525
525
 
526
526
  There are exactly two ways to get a target:
@@ -565,9 +565,9 @@ export interface EveEvalTargetHandle {
565
565
  }
566
566
  ```
567
567
 
568
- This covers channel suites (POST a signed webhook via `target.fetch`, assert
568
+ This covers channel evals (POST a signed webhook via `target.fetch`, assert
569
569
  on an externally provisioned fake provider, attach to the created session) and
570
- schedule suites (`dispatchSchedule` + `attachSession`).
570
+ schedule evals (`dispatchSchedule` + `attachSession`).
571
571
 
572
572
  The runner always performs the readiness/identity handshake regardless of who
573
573
  provisioned the target: `/eve/v1/health` polling, then `/eve/v1/info`
@@ -577,12 +577,12 @@ capabilities are discovered, not assumed.
577
577
 
578
578
  ### Requirements: `requires`, verified against the live target
579
579
 
580
- Suites cannot control the target, but their assertions still _assume_ things
580
+ Evals cannot control the target, but their assertions still _assume_ things
581
581
  about it — determinism via mock models, dev-only routes, a sidecar URL in the
582
582
  environment. v1 gives those assumptions exactly one surface:
583
583
 
584
584
  ```ts
585
- // Per suite (applies to all cases) and per case (additive):
585
+ // Per eval (applies to all cases) and per case (additive):
586
586
  readonly requires?: readonly EveEvalRequirement[];
587
587
 
588
588
  type EveEvalRequirement =
@@ -598,24 +598,24 @@ Rules:
598
598
  `env:<NAME>` against its own process environment. It never tries to make a
599
599
  requirement true.
600
600
  2. **Unmet requirement → skip, visibly.** The case (or every case, for
601
- suite-level `requires`) gets `verdict: "skipped"` with the unmet
601
+ eval-level `requires`) gets `verdict: "skipped"` with the unmet
602
602
  requirement as the reason — reported in console, `--json`, and artifacts;
603
603
  never silently dropped, never failed. `--no-skips` turns skips into
604
604
  failures for legs that must prove full coverage. Skips don't otherwise
605
605
  affect the exit code.
606
606
  3. **One convenience, because the runner boots the dev server:** plain
607
- `eve eval` with suites requiring `"mockModels"` either passes
608
- `--mock-models` or sees those suites skip with a message naming the flag.
607
+ `eve eval` with evals requiring `"mockModels"` either passes
608
+ `--mock-models` or sees those evals skip with a message naming the flag.
609
609
  No auto-magic in v1 — explicit and predictable beats clever.
610
610
  4. **Runtime guards back the declarations.** `target.dispatchSchedule` throws
611
611
  a requirement error when called by a case that didn't declare
612
612
  `"devRoutes"` — so undeclared dependencies surface as named failures on
613
613
  the local leg, not as mystery flakes on the remote leg.
614
614
 
615
- The deferred idea — a suite-owned `environment` block where the runner builds
615
+ The deferred idea — an eval-owned `environment` block where the runner builds
616
616
  and starts targets, injects env, and manages sidecar lifecycles — is recorded
617
617
  under Open questions. It is not in v1: it duplicated CLI concerns, and it let
618
- suites express build-time properties they cannot actually own.
618
+ evals express build-time properties they cannot actually own.
619
619
 
620
620
  ### The external provisioner pattern
621
621
 
@@ -636,38 +636,38 @@ node e2e/provision/<group>.ts & # build, sidecars, env, eve start, health-poll
636
636
  pnpm --filter <fixture-app> exec eve eval --strict --url "http://127.0.0.1:$PORT"
637
637
  ```
638
638
 
639
- Suites consume provisioner outputs through env (declared via `env:<NAME>`
640
- requirements): a channel suite reads `TELEGRAM_PROBE_URL` to query what the
641
- fake Bot API captured; a subagent suite reads `EVE_WEATHER_AGENT_HOST` to
639
+ Evals consume provisioner outputs through env (declared via `env:<NAME>`
640
+ requirements): a channel eval reads `TELEGRAM_PROBE_URL` to query what the
641
+ fake Bot API captured; a subagent eval reads `EVE_WEATHER_AGENT_HOST` to
642
642
  assert on `subagent.called` remote URLs. The probe/stub helpers
643
643
  (`startHttpProbe`, `startMcpStub`, generalized from `e2e/lib/`) ship in
644
644
  `eve/evals/environment` for provisioners — including users' own — to use, with
645
- an HTTP inspection endpoint so suites can assert on captured requests across
645
+ an HTTP inspection endpoint so evals can assert on captured requests across
646
646
  the process boundary.
647
647
 
648
648
  ### CLI v2
649
649
 
650
650
  ```
651
- eve eval [suiteId...]
651
+ eve eval [evalId...]
652
652
  --url <url> run against an existing target (built local server,
653
653
  preview deployment); without it, boots the dev server
654
654
  --mock-models boot the dev server with the deterministic mock
655
655
  adapter (invalid with --url; the target's mock state
656
656
  is discovered, not set)
657
- --tag <tag...> run only cases (or suites) carrying a tag
657
+ --tag <tag...> run only cases (or evals) carrying a tag
658
658
  --case <id...> run only specific case ids
659
659
  --strict sub-threshold scores also fail the exit code
660
660
  --no-skips requirement skips fail instead of skipping
661
- --trials <n> override suite trials
661
+ --trials <n> override eval trials
662
662
  --timeout <ms> per-case timeout (existing)
663
663
  --max-concurrency <n> (existing)
664
664
  --json structured stdout (existing)
665
- --skip-report skip suite reporters (existing)
666
- --list print discovered suites/cases without running
665
+ --skip-report skip eval reporters (existing)
666
+ --list print discovered evals/cases without running
667
667
  --verbose stream per-case ctx.log and event summaries
668
668
  ```
669
669
 
670
- - Positional suite ids replace `--suite`; the dead `--all` flag is removed
670
+ - Positional eval ids select evals; the dead `--all` flag is removed
671
671
  (no filter already means all).
672
672
  - Exit codes: `0` all cases passed checks (and thresholds under `--strict`);
673
673
  `1` any case failed (check failure, run throw, execution error, or strict
@@ -702,7 +702,7 @@ into the framework, then it dies:
702
702
  3. **Turn failure surfacing**: export a typed
703
703
  `isTurnFailureEvent(event)` narrowing helper and the
704
704
  `EveEvalTurn.expectOk()` driver method instead of `TurnFailedError`-style
705
- throw-by-default (negative-path suites need non-throwing sends).
705
+ throw-by-default (negative-path evals need non-throwing sends).
706
706
  4. **Multimodal sugar** (`sendTextWithImage`): becomes
707
707
  `EvalSession.sendFile`; the data-URL/`FilePart` encoding helper is exported
708
708
  from `eve/client` for general use.
@@ -710,20 +710,20 @@ into the framework, then it dies:
710
710
 
711
711
  ## Replacing the e2e surface
712
712
 
713
- ### Where suites live
713
+ ### Where evals live
714
714
 
715
- Each fixture app keeps owning its coverage: suites move into
715
+ Each fixture app keeps owning its coverage: evals move into
716
716
  `e2e/fixtures/agent-*/evals/*.eval.ts` and
717
717
  `apps/fixtures/weather-fixture/evals/*.eval.ts`. Discovery already scans
718
- `<appRoot>/evals/`. The area-policy module becomes unnecessary — a suite can
718
+ `<appRoot>/evals/`. The area-policy module becomes unnecessary — an eval can
719
719
  only target its own app, by construction. Provisioning scripts live next to
720
720
  the fixtures (`e2e/provision/`), built from today's `e2e/lib/server.ts` and
721
721
  `e2e/target/` logic rather than rewriting it.
722
722
 
723
723
  ### Coverage mapping
724
724
 
725
- Scripted cases let one suite absorb a whole e2e group: today's 78 script files
726
- consolidate into roughly one suite per behavior family (e.g.
725
+ Scripted cases let one eval absorb a whole e2e group: today's 78 script files
726
+ consolidate into roughly one eval per behavior family (e.g.
727
727
  `agent-tools-hitl/evals/hitl.eval.ts` with `approve-then-persist`,
728
728
  `deny-regates`, `ask-question`, and `tool-auth` cases), each sharing one
729
729
  provisioned target. `--case` becomes the day-to-day tool for re-running a
@@ -734,11 +734,11 @@ single behavior while debugging.
734
734
  | `basic-runtime/*` (basic, multi-turn history, client context, output schema, image, define-state) | `task.run` + `Checks.messageIncludes` / `outputMatches`; `sendFile` for image; `send({ clientContext })`; multi-turn token recall is two `send`s and one check |
735
735
  | `tools/*` (14 dynamic-tool files, MCP, multi-step loop, narrowing, throw-recover) | `Checks.toolCalled(name, { input, output, isError })`, `Checks.toolOrder`, `Checks.event` for ordering edge cases; MCP stub started by the provisioner (`startMcpStub`), addressed via `env:` requirement |
736
736
  | `tools-hitl/*` (approval, denial, ask-question, tool auth) | `expectInputRequests` + `respond`/`respondAll`; auth flows keep the IdP emulator as a provisioner sidecar |
737
- | `tools-sandbox/*` | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot suite tagged `requires-credentials`, excluded in CI via `--tag` |
737
+ | `tools-sandbox/*` | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot eval tagged `requires-credentials`, excluded in CI via `--tag` |
738
738
  | `channels/*` | provisioner starts the fake provider (one shared `startHttpProbe`) and exports `*_API_BASE_URL`; case does `target.fetch` webhook ingress, asserts on probe captures via its inspection endpoint, `attachSession` for the created session |
739
739
  | `schedules/*` | `target.dispatchSchedule` + `attachSession`; stream-resume test asserts `attachSession({ startIndex })` replay |
740
740
  | `subagents/*` (incl. remote delegation, callbacks, failures) | provisioner boots both agents and wires `EVE_WEATHER_AGENT_HOST`; `Checks.subagentCalled(name, { remoteUrl })`; callback retry/bypass probes as provisioner sidecars |
741
- | `codemode/*` | Unchanged suite bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server |
741
+ | `codemode/*` | Unchanged eval bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server |
742
742
  | `tui-client/*` | **Stays a script harness** (non-goal) |
743
743
 
744
744
  ### Worked example: porting `remote-agent-delegation`
@@ -746,15 +746,15 @@ single behavior while debugging.
746
746
  Today's `e2e/tests/subagents/remote-agent-delegation.ts` is 101 lines: manual
747
747
  port arithmetic, two `resolveTarget` calls threading `startEnv` by hand, and a
748
748
  58-line `assertRemoteDelegation` function of filter/narrow/throw prose. The v2
749
- suite, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
749
+ eval, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
750
750
 
751
751
  ```ts
752
- import { defineEvalSuite } from "eve/evals";
752
+ import { defineEval } from "eve/evals";
753
753
  import { Checks } from "eve/evals/checks";
754
754
 
755
755
  const CITY = "Lisbon";
756
756
 
757
- export default defineEvalSuite({
757
+ export default defineEval({
758
758
  description: "Remote subagent delegation over HTTP to a second local agent.",
759
759
 
760
760
  // Assumptions about the target, verified by the runner. The provisioner
@@ -795,8 +795,8 @@ weather fixture with mocks, boot the parent with mocks +
795
795
 
796
796
  What the diff buys, beyond line count:
797
797
 
798
- - **Clean separation** — the suite holds only interaction and assertions; the
799
- topology lives in one provisioner shared by every subagent case. The suite
798
+ - **Clean separation** — the eval holds only interaction and assertions; the
799
+ topology lives in one provisioner shared by every subagent case. The eval
800
800
  states its assumptions (`requires`) and the runner enforces them, so running
801
801
  it against an unprovisioned target skips with a named reason instead of
802
802
  failing mysteriously.
@@ -807,14 +807,14 @@ What the diff buys, beyond line count:
807
807
  configured) Braintrust like any other eval, and `--case
808
808
  weather-result-reaches-parent` reruns it in isolation.
809
809
  - **Room to grow** — `remote-agent-callback-retry`, `-bypass`, and
810
- `-start-failure` become sibling cases in the same suite, sharing one
810
+ `-start-failure` become sibling cases in the same eval, sharing one
811
811
  provisioned topology instead of re-spawning per file.
812
812
 
813
813
  ### CI
814
814
 
815
815
  `.github/workflows/smoke.yml` discovery changes from globbing
816
816
  `e2e/tests/*/*.ts` to globbing fixture apps with `evals/` directories. Each
817
- matrix leg provisions, then runs the suites against the resulting URL:
817
+ matrix leg provisions, then runs the evals against the resulting URL:
818
818
 
819
819
  ```sh
820
820
  node e2e/provision/<group>.ts & # build + sidecars + env + eve start
@@ -822,7 +822,7 @@ pnpm --filter <fixture-app> exec eve eval --strict --json --url "$TARGET_URL"
822
822
  ```
823
823
 
824
824
  twice (direct / `EVE_EXPERIMENTAL_CODE_MODE=1` set by the provisioner),
825
- `fail-fast: false`, JUnit reporter for annotations. Per-suite artifacts under
825
+ `fail-fast: false`, JUnit reporter for annotations. Per-eval artifacts under
826
826
  `.eve/evals/` upload on failure — strictly better debuggability than today's
827
827
  stdout scraping. A post-deploy leg is the same invocation pointed at a preview
828
828
  deployment; requirement-incompatible cases skip visibly.
@@ -871,7 +871,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
871
871
  8. `EveEvalTargetHandle`: `baseUrl`, `fetch`, `info`, `capabilities`,
872
872
  `dispatchSchedule`, `attachSession`; `--mock-models` for the runner-booted
873
873
  dev server; readiness/identity handshake for `--url` targets.
874
- 9. Requirements: suite/case `requires` (`mockModels` / `devRoutes` /
874
+ 9. Requirements: eval/case `requires` (`mockModels` / `devRoutes` /
875
875
  `env:<NAME>`), `skipped` verdict + `--no-skips`, runtime guards on
876
876
  requirement-gated handle methods, and `/eve/v1/info` reporting mock-model
877
877
  state so requirements are verified against the live target.
@@ -883,7 +883,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
883
883
 
884
884
  ### Phase 4 — Migration and deletion
885
885
 
886
- 12. Port suites group by group in the order: `basic-runtime` → `tools` →
886
+ 12. Port evals group by group in the order: `basic-runtime` → `tools` →
887
887
  `tools-hitl` → `subagents` → `schedules` → `channels` → `codemode` →
888
888
  `tools-sandbox`. Each ported group flips its CI matrix entry from
889
889
  `node e2e/tests/...` to provision + `eve eval --strict --url` in the same
@@ -892,7 +892,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
892
892
  area policy; shrink `e2e/lib`/`e2e/target` into `e2e/provision/`; update
893
893
  `e2e/README.md` and AGENTS.md smoke-test guidance to point at `eve eval`.
894
894
  14. Optional follow-on: add a post-deploy `--url` leg against preview
895
- deployments for the requirement-compatible subset of fixture suites.
895
+ deployments for the requirement-compatible subset of fixture evals.
896
896
 
897
897
  ## Breaking changes
898
898
 
@@ -903,37 +903,37 @@ changesets for each:
903
903
  typed records.
904
904
  - `EveEvalCase` becomes a data/scripted union; `input` is no longer required
905
905
  on scripted cases. Existing data cases are unaffected.
906
- - `model` becomes optional; suites passing dummy models can drop them.
907
- - `--suite` replaced by positional ids; `--all` removed.
908
- - Exit-code semantics gain check failures (suites without `checks` see no
906
+ - `model` becomes optional; evals passing dummy models can drop them.
907
+ - Eval selection uses positional ids; `--all` removed.
908
+ - Exit-code semantics gain check failures (evals without `checks` see no
909
909
  change unless `--strict`).
910
910
 
911
911
  ## Risks and open questions
912
912
 
913
- - **Cost/flakiness of model-backed suites in CI.** Mitigation: the
913
+ - **Cost/flakiness of model-backed evals in CI.** Mitigation: the
914
914
  `"mockModels"` requirement makes determinism a declared, verified property
915
915
  instead of a per-file accident; `trials` + `--strict` thresholds handle the
916
- suites that must use real models. The migration should explicitly decide,
916
+ evals that must use real models. The migration should explicitly decide,
917
917
  per ported smoke, mock vs real — today's split is undocumented.
918
918
  - **The stream-registration race** may not be fully fixable server-side in
919
919
  Phase 2; the client-level bounded retry is the documented fallback. Track it
920
920
  as its own issue rather than letting retry constants drift again.
921
- - **Channel suites with real credentials** (`slack-thread-context`) and
922
- sandbox snapshot suites stay tag-gated (`requires-credentials`) and excluded
921
+ - **Channel evals with real credentials** (`slack-thread-context`) and
922
+ sandbox snapshot evals stay tag-gated (`requires-credentials`) and excluded
923
923
  from default CI, as today.
924
924
  - **Provisioner drift on `--url` legs.** Beyond what `/info` exposes
925
925
  (identity, mode, mock-model state) and `env:<NAME>` presence checks, the
926
- runner trusts the provisioner. A suite can pass locally and skip-or-fail
926
+ runner trusts the provisioner. An eval can pass locally and skip-or-fail
927
927
  remotely because the target was provisioned differently; requirements make
928
928
  this visible but cannot make it impossible.
929
- - **Open: should the runner ever own provisioning?** A suite-owned
929
+ - **Open: should the runner ever own provisioning?** An eval-owned
930
930
  `environment` block (runner builds/starts targets, injects env, manages
931
931
  sidecar lifecycles, boots secondary agents) was considered and deliberately
932
- cut from v1: it duplicated CLI concerns and let suites declare build-time
932
+ cut from v1: it duplicated CLI concerns and let evals declare build-time
933
933
  properties they cannot own at run time. Revisit only with concrete friction
934
934
  data from the migration — the likely v2 shape, if any, is a _project-level_
935
935
  provisioning config consumed by `eve eval` (like Playwright's `webServer`),
936
- not per-suite config.
936
+ not per-eval config.
937
937
  - **User-facing naming**: `checks` vs `scores` vs `thresholds` vs `requires`
938
938
  needs a docs pass so the soft/hard/assumption distinction is obvious; the
939
939
  evals doc gets a "smoke-testing your agent" section once Phase 2 lands.