npm - eve - Versions diffs - 0.6.0-beta.10 → 0.6.0-beta.11 - Mend

eve 0.6.0-beta.10 → 0.6.0-beta.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

package/CHANGELOG.md +13 -0
package/README.md +1 -1
package/dist/docs/evals-v2-plan.md +108 -108
package/dist/docs/public/advanced/evals.md +51 -23
package/dist/docs/public/reference/cli.md +17 -17
package/dist/docs/public/reference/typescript-api.md +2 -2
package/dist/src/chunks/{use-eve-agent-DCZbkLG7.js → use-eve-agent-BD79SFqg.js} +82 -29
package/dist/src/chunks/{use-eve-agent-DoheC4_o.js → use-eve-agent-DT2A6VJb.js} +82 -29
package/dist/src/cli/run.d.ts +1 -1
package/dist/src/cli/run.js +1 -1
package/dist/src/client/file-parts.d.ts +18 -0
package/dist/src/client/file-parts.js +1 -0
package/dist/src/client/index.d.ts +3 -2
package/dist/src/client/index.js +1 -1
package/dist/src/client/message-response.js +1 -1
package/dist/src/client/open-stream.d.ts +6 -0
package/dist/src/client/open-stream.js +1 -1
package/dist/src/client/session-utils.d.ts +5 -0
package/dist/src/client/session-utils.js +1 -1
package/dist/src/client/session.js +1 -1
package/dist/src/client/types.d.ts +5 -1
package/dist/src/evals/checks/checks.d.ts +1 -1
package/dist/src/evals/checks/index.d.ts +1 -1
package/dist/src/evals/checks/match.d.ts +1 -1
package/dist/src/evals/cli/eval.d.ts +2 -2
package/dist/src/evals/cli/eval.js +1 -1
package/dist/src/evals/{define-eval-suite.d.ts → define-eval.d.ts} +6 -6
package/dist/src/evals/define-eval.js +1 -0
package/dist/src/evals/index.d.ts +3 -2
package/dist/src/evals/index.js +1 -1
package/dist/src/evals/runner/artifacts.d.ts +6 -6
package/dist/src/evals/runner/artifacts.js +1 -1
package/dist/src/evals/runner/discover.d.ts +10 -10
package/dist/src/evals/runner/discover.js +1 -1
package/dist/src/evals/runner/execute-case.d.ts +4 -7
package/dist/src/evals/runner/execute-case.js +1 -1
package/dist/src/evals/runner/{execute-suite.d.ts → execute-eval.d.ts} +6 -6
package/dist/src/evals/runner/execute-eval.js +1 -0
package/dist/src/evals/runner/reporters/braintrust.d.ts +4 -4
package/dist/src/evals/runner/reporters/braintrust.js +2 -2
package/dist/src/evals/runner/reporters/console.d.ts +3 -3
package/dist/src/evals/runner/reporters/console.js +1 -1
package/dist/src/evals/runner/reporters/types.d.ts +6 -6
package/dist/src/evals/scorers/autoevals.d.ts +7 -7
package/dist/src/evals/scorers/autoevals.js +1 -1
package/dist/src/evals/scorers/model-marker.d.ts +3 -3
package/dist/src/evals/session.d.ts +46 -0
package/dist/src/evals/session.js +1 -0
package/dist/src/evals/types.d.ts +154 -57
package/dist/src/execution/tool-auth.js +1 -1
package/dist/src/internal/application/package.js +1 -1
package/dist/src/protocol/message.d.ts +8 -0
package/dist/src/protocol/message.js +2 -2
package/dist/src/runtime/connections/mcp-client.js +1 -1
package/dist/src/runtime/connections/scoped-authorization.d.ts +9 -5
package/dist/src/runtime/connections/scoped-authorization.js +1 -1
package/dist/src/runtime/connections/types.d.ts +24 -0
package/dist/src/setup/boxes/deploy-project.js +1 -1
package/dist/src/setup/scaffold/create/project.js +1 -1
package/dist/src/setup/scaffold/update/channels.js +2 -2
package/dist/src/setup/scaffold/update/connections.js +1 -1
package/dist/src/svelte/index.js +1 -1
package/dist/src/svelte/use-eve-agent.js +1 -1
package/dist/src/vue/index.js +1 -1
package/dist/src/vue/use-eve-agent.js +1 -1
package/package.json +1 -1
package/dist/src/evals/define-eval-suite.js +0 -1
package/dist/src/evals/runner/execute-suite.js +0 -1

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,18 @@
 # eve
+## 0.6.0-beta.11
+### Minor Changes
+- 9bb6371: Adds the evals interaction API: scripted cases, `task.run`, `EveEvalSession` helpers for HITL responses and file attachments, multi-session event capture, and client-level HITL request results with retry handling for stream registration races.
+- 22dda94: Rename the eval authoring helper from `defineEvalSuite` to `defineEval` and update the public eval types, reporter hooks, CLI wording, and JSON result shape to use eval terminology consistently.
+### Patch Changes
+- a0ca3bb: Scaffolded projects now pin `@vercel/connect@0.2.2`. The previously pinned 0.1.1 predates the `@vercel/connect/eve` entrypoint, so generated channels and connections failed to build on deploy with `Package subpath './eve' is not defined by "exports"`.
+- a0ca3bb: Fix interactive post-setup deploys failing with "The `projectSettings` object is required for new projects". The setup deploy now passes `--yes` to `vercel deploy --prod` in interactive runs too, so the Vercel API auto-detects framework settings for projects that eve provisioned through the projects API.
+- 380693d: Token eviction on a rejected bearer now cascades to the authorization strategy's own cache, not just Eve's per-step cache. `AuthorizationDefinition` gains an optional `evict(opts)` hook, and the shared `evictScopedToken` path (used by both authored tools and MCP connections) invokes it after dropping the per-step entry. This lets `@vercel/connect`-backed connections purge their in-process token cache so a revoked-but-unexpired grant is genuinely re-fetched instead of re-read from a lower cache layer.
 ## 0.6.0-beta.10
 ### Minor Changes

package/README.md CHANGED Viewed

@@ -52,7 +52,7 @@ Every authored directory has a typed helper. Import each from the matching subpa
 | `eveChannel(...)`, `slackChannel(...)`, `vercelOidc(...)`                                                           | `eve/channels/eve`, `/slack`, `/auth` | reused from `channels/<name>.ts`                 |
 | `defineSandbox(...)`                                                                                                | `eve/sandbox`                         | `sandbox.ts` (or `sandbox/sandbox.ts`)           |
 | `defineSchedule(...)`                                                                                               | `eve/schedules`                       | `schedules/<name>.ts` (or `schedules/<name>.md`) |
-| `defineEvalSuite(...)`                                                                                              | `eve/evals`                           | `evals/<name>.eval.ts`                           |
+| `defineEval(...)`                                                                                                   | `eve/evals`                           | `evals/<name>.eval.ts`                           |
 Runtime accessors live on the subpath that owns the concern:

package/dist/docs/evals-v2-plan.md CHANGED Viewed

@@ -6,7 +6,7 @@ Scope: `packages/eve` (evals, client, CLI), `e2e/`, CI
 ## Summary
-Eve's eval suites (`defineEvalSuite` + `eve eval`) and the hand-rolled e2e smoke
+Eve's evals (`defineEval` + `eve eval`) and the hand-rolled e2e smoke
 surface (`e2e/tests/**` + `ExampleClient`) are two implementations of the same
 idea: drive a real agent over HTTP and judge what happened. The eval runner has
 the right bones — filesystem discovery, target resolution, stream capture,
@@ -23,16 +23,16 @@ missing:
 1. **An imperative interaction API** — `run(ctx)` with a typed `EvalSession`
    driver for multi-turn control flow, HITL responses, approvals, structured
-   output, attachments, and multi-session scenarios. Available at suite level
+   output, attachments, and multi-session scenarios. Available at eval level
    (`task.run`, the shared default for dataset evals) and per case
-   (`case.run`), so one suite groups many distinct scripted behaviors the way
+   (`case.run`), so one eval groups many distinct scripted behaviors the way
    a test file groups `it` blocks.
 2. **A hard-assertion tier** — `checks`, distinct from `scores`. Check failures
    fail the case and the process; scores remain soft, thresholded data.
 3. **A URL-shaped target model with verified requirements** — a target is
    always just a URL. `eve eval` obtains one (boots the dev server, as today)
    or `--url` brings your own (a built server in CI, a preview deployment).
-   Suites never declare _how_ to provision the agent — they declare
+   Evals never declare _how_ to provision the agent — they declare
    `requires` assumptions (mock models, dev routes, sidecar env) that the
    runner verifies against the live target and skips or errors on, visibly.
    Provisioning (build, start, env injection, sidecars, secondary agents)
@@ -41,7 +41,7 @@ missing:
    `e2e/tests/basic-runtime/evals.ts` already proves out today.
 The end state: every behavior currently proven by `e2e/tests/**` (except the
-TUI tests, see Non-goals) is an eval suite in a fixture app's `evals/`
+TUI tests, see Non-goals) is an eval in a fixture app's `evals/`
 directory, CI runs `eve eval --strict` per fixture app per matrix mode, and
 `e2e/lib/client.ts` plus the per-file harness code are deleted. Users get the
 same machinery for their own agents: smoke-test your agent the way Eve
@@ -49,7 +49,7 @@ smoke-tests itself.
 ## Why not a `describe`/`it` runner
-We considered replacing `defineEvalSuite` with a vitest/`node:test`-shaped API.
+We considered replacing `defineEval` with a vitest/`node:test`-shaped API.
 Rejected, for these reasons:
 1. **Evals have semantics tests don't.** Scores in `[0,1]`, per-scorer
@@ -58,8 +58,8 @@ Rejected, for these reasons:
    pass/fail; we would immediately reinvent scorers and reporters _inside_ `it`
    blocks and lose the structured result model that powers `--json` and the
    Braintrust reporter.
-2. **Suite identity is path-derived** (repo principle 5, enforced by
-   `scripts/guard-invariants.mjs`). One suite per `evals/<path>.eval.ts` file
+2. **Eval identity is path-derived** (repo principle 5, enforced by
+   `scripts/guard-invariants.mjs`). One eval per `evals/<path>.eval.ts` file
    maps cleanly onto the filesystem-first philosophy; nested describe blocks
    fight it.
 3. **The dataset model is the right default.** "Load 200 cases from YAML, fan
@@ -67,7 +67,7 @@ Rejected, for these reasons:
    A block-based runner makes that the awkward case (loops generating `it`s).
 4. **What e2e actually needs is not blocks** — it is control flow _inside a
    case_, a typed session driver, and assertion helpers over the event stream.
-   All three fit inside the existing suite shape: `run` at the task and case
+   All three fit inside the existing eval shape: `run` at the task and case
    level provides the control flow, and scripted cases give the same grouping
    a `describe` file gives `it` blocks (see "Scripted cases" below).
 5. **Wrapping vitest violates principle 3** (wrap third-party deps; don't
@@ -88,15 +88,15 @@ first-class rather than by switching paradigms.
   arbitrary event predicates.
 - Hard pass/fail semantics suitable for CI gating, coexisting with soft scores.
 - One target model: any URL — the runner-booted dev server, a locally built
-  `eve start` process, or a deployed instance — with suite-declared
-  requirements verified against the live target instead of suite-owned
+  `eve start` process, or a deployed instance — with eval-declared
+  requirements verified against the live target instead of eval-owned
   provisioning.
 - Drive surfaces beyond the session route: channels (webhook ingress) and
-  schedules (dev dispatch), with stream consumption for sessions the suite did
+  schedules (dev dispatch), with stream consumption for sessions the eval did
   not create.
 - One HTTP client: `eve/client` absorbs everything `e2e/lib/client.ts` does;
   `ExampleClient` is deleted.
-- Replace `e2e/tests/**` (minus TUI) with eval suites; CI becomes
+- Replace `e2e/tests/**` (minus TUI) with evals; CI becomes
   `eve eval --strict` runs.
 ## Non-goals
@@ -108,13 +108,13 @@ first-class rather than by switching paradigms.
   tier only.
 - **Braintrust/autoevals integration** is unchanged in shape (still wrapped
   per principle 3).
-- **No authored suite `id`/`name`.** Identity stays path-derived; the
+- **No authored eval `id`/`name`.** Identity stays path-derived; the
   invariant guard keeps enforcing it.
 ## Current state (abridged; see code for detail)
-- Suite API: `defineEvalSuite({ cases | load, task, scores, model, thresholds,
-reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
+- Eval API: `defineEval({ cases | load, task, scores, model, thresholds,
+reporters, ... })` in `packages/eve/src/evals/define-eval.ts` and
   `types.ts`. `task` is `prompt(case)` or `messages(case) => string[]` — a
   static list, no branching, no HITL (`runner/execute-case.ts:38` only ever
   sends `{ message }`).
@@ -140,17 +140,17 @@ reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
 ## The v2 API
-### Suite shape
+### Eval shape
 ```ts
-import { defineEvalSuite } from "eve/evals";
+import { defineEval } from "eve/evals";
 import { Checks } from "eve/evals/checks";
 import { Run, Text } from "eve/evals/scores";
-export default defineEvalSuite({
+export default defineEval({
   description: "HITL approval flows: park, approve, deny, persist.",
-  // Suite-level checks apply to every case.
+  // Eval-level checks apply to every case.
   checks: [Checks.didNotFail()],
   scores: [Run.didNotFail()],
@@ -180,20 +180,20 @@ export default defineEvalSuite({
 });
 ```
-Changes to `EveEvalSuiteInput` (`packages/eve/src/evals/types.ts`):
+Changes to `EveEvalInput` (`packages/eve/src/evals/types.ts`):
-| Field            | Change                                                                                                                                                                                                                                                                                                                                                           |
-| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `task.run`       | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`.                                                                                                                                                                                                                           |
-| scripted cases   | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the suite `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `suite.task`. See "Scripted cases" below.                                                                                                                            |
-| `checks`         | New optional array of `EveEvalCheck`, at suite level and per case (case-level appends to suite-level). Hard assertions; any failure marks the case failed and flips the CLI exit code.                                                                                                                                                                           |
-| `requires`       | New optional requirement list (suite- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one suite runs against the dev server, a local build, and a deployed URL. See "Targets". Suites own no provisioning config — no kind/env/setup.  |
-| `model`          | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the suite provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
-| `trials`         | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed suites.                                                                                                                                              |
-| `tags` filtering | Case and suite `tags` become functional (CLI `--tag`).                                                                                                                                                                                                                                                                                                           |
+| Field            | Change                                                                                                                                                                                                                                                                                                                                                          |
+| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `task.run`       | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`.                                                                                                                                                                                                                          |
+| scripted cases   | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the eval `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `eval.task`. See "Scripted cases" below.                                                                                                                             |
+| `checks`         | New optional array of `EveEvalCheck`, at eval level and per case (case-level appends to eval-level). Hard assertions; any failure marks the case failed and flips the CLI exit code.                                                                                                                                                                            |
+| `requires`       | New optional requirement list (eval- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one eval runs against the dev server, a local build, and a deployed URL. See "Targets". Evals own no provisioning config — no kind/env/setup.    |
+| `model`          | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the eval provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
+| `trials`         | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed evals.                                                                                                                                              |
+| `tags` filtering | Case and eval `tags` become functional (CLI `--tag`).                                                                                                                                                                                                                                                                                                           |
 Everything else (`cases`/`load`, `scores`, `thresholds`, `reporters`,
-`maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing suites keep
+`maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing evals keep
 working except where noted in Breaking changes.
 ### `task.run(ctx)` and the run context
@@ -214,7 +214,7 @@ export interface EveEvalRunContext {
   newSession(): EveEvalSession;
   /** Handle to the agent server under test (channels, schedules, raw routes). */
   readonly target: EveEvalTargetHandle;
-  /** Case timeout signal (from suite/CLI `timeoutMs`). */
+  /** Case timeout signal (from eval/CLI `timeoutMs`). */
   readonly signal: AbortSignal;
   /** Structured logger; lines land in the case artifact, and on stdout with `--verbose`. */
   readonly log: (message: string) => void;
@@ -234,13 +234,13 @@ Semantics:
   whole interaction. `derived` facts are computed over the primary session by
   default, with per-session access on the result.
-### Scripted cases: making the suite a suite
+### Scripted cases: making the eval a real grouping
-A suite-level `task` alone means "one interaction script × many data rows" —
+An eval-level `task` alone means "one interaction script × many data rows" —
 the dataset shape. The e2e surface is the inverse: many distinct scripts, one
 execution each (`tool-approval`, `tool-denial`, and `ask-question-flow` are
 three different `run` functions, not three rows). Forcing each script into its
-own suite file with a single dummy-input case would make "suite" a misnomer
+own eval file with a single dummy-input case would make "eval" a misnomer
 and multiply provisioning cost (a provisioned target per file instead of per
 behavior family).
@@ -249,13 +249,13 @@ So `EveEvalCase` becomes a union:
 ```ts
 export type EveEvalCase = EveEvalDataCase | EveEvalScriptedCase;
-/** Today's shape: data routed through the suite-level task. */
+/** Today's shape: data routed through the eval-level task. */
 export interface EveEvalDataCase {
   readonly id: string;
   readonly input: string | Record<string, unknown>;
   readonly expected?: unknown;
-  readonly checks?: readonly EveEvalCheck[]; // appended to suite-level
-  readonly scores?: readonly EveEvalScorer[]; // appended to suite-level
+  readonly checks?: readonly EveEvalCheck[]; // appended to eval-level
+  readonly scores?: readonly EveEvalScorer[]; // appended to eval-level
   readonly tags?: readonly string[];
   readonly metadata?: Readonly<Record<string, unknown>>;
 }
@@ -274,26 +274,26 @@ export interface EveEvalScriptedCase {
 Resolution rules:
-- `case.run` wins over `suite.task`; a data case without a suite `task` falls
+- `case.run` wins over `eval.task`; a data case without an eval `task` falls
   back to today's default (send `input` verbatim).
-- Case-level `checks`/`scores` **append** to suite-level ones; suite-level
+- Case-level `checks`/`scores` **append** to eval-level ones; eval-level
   expresses invariants ("never fails"), case-level expresses the specific
   behavior under test.
-- A suite may freely mix data cases and scripted cases, though in practice
-  quality suites are all-data and smoke suites are all-scripted.
+- An eval may freely mix data cases and scripted cases, though in practice
+  quality evals are all-data and smoke evals are all-scripted.
-The conceptual model this lands on: **the suite file is the `describe`, cases
+The conceptual model this lands on: **the eval file is the `describe`, cases
 are the `it`s** — grouping, one shared target, shared baseline checks and
-requirements — without a block API, and with path-derived identity intact. The suite-level `task` keeps its role as the shared default for
+requirements — without a block API, and with path-derived identity intact. The eval-level `task` keeps its role as the shared default for
 dataset evals; it is no longer the only way to define behavior.
 Execution semantics are uniform across both case kinds:
 - **Concurrency**: scripted cases join the same bounded pool as data cases
-  (each owns its sessions, so they parallelize safely). Suites whose cases
+  (each owns its sessions, so they parallelize safely). Evals whose cases
   mutate shared target state (e.g. `defineState` persistence tests) set
   `maxConcurrency: 1`.
-- **Timeout**: suite/CLI `timeoutMs` applies per case per trial; the signal is
+- **Timeout**: eval/CLI `timeoutMs` applies per case per trial; the signal is
   `ctx.signal` inside `run` and aborts in-flight sends.
 - **Trials**: apply identically — a scripted case under `trials: 3` runs its
   `run` three times against three fresh primary sessions.
@@ -358,7 +358,7 @@ Notes:
 - `send` does **not** throw on `turn.failed`/`session.failed` by default —
   failure handling belongs to checks (`Checks.completed()`) or explicit
-  `turn.expectOk()`. This keeps negative-path suites (e.g. today's
+  `turn.expectOk()`. This keeps negative-path evals (e.g. today's
   `remote-agent-start-failure.ts`, `tool-throw-recover.ts`) natural to write.
 - HITL coverage: `needsApproval` approvals (`approve`/`deny` option ids),
   framework `ask_question` selects (`optionId`), freeform answers
@@ -401,7 +401,7 @@ Built-ins, exported from `eve/evals/checks`:
 | Check                                                         | Asserts                                                                                                                                                                                                                            |
 | ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `Checks.completed()`                                          | Final status is `"completed"` (not `"waiting"`, not `"failed"`)                                                                                                                                                                    |
-| `Checks.waiting()`                                            | Final status is `"waiting"` (for park-shaped suites)                                                                                                                                                                               |
+| `Checks.waiting()`                                            | Final status is `"waiting"` (for park-shaped evals)                                                                                                                                                                                |
 | `Checks.didNotFail()`                                         | Status is not `"failed"` and no `turn.failed`/`step.failed` events                                                                                                                                                                 |
 | `Checks.messageIncludes(token)`                               | Joined `message.completed` text contains `token` (string or RegExp)                                                                                                                                                                |
 | `Checks.outputEquals(value)` / `Checks.outputMatches(schema)` | Deep-equal / Standard Schema validation of `result.output`                                                                                                                                                                         |
@@ -419,7 +419,7 @@ Pass/fail policy:
 - Failed cases always produce a non-zero `eve eval` exit code.
 - Scores keep today's semantics: thresholded, reported, never gate the exit
   code — **unless** `--strict` is passed, which additionally fails the process
-  when any case scores below threshold. CI for fixture smoke suites runs
+  when any case scores below threshold. CI for fixture smoke evals runs
   `--strict`; users running exploratory quality evals don't.
 ### Derived facts v2 (breaking)
@@ -477,8 +477,8 @@ export interface EveEvalCaseResult {
   readonly skipReason?: string; // unmet requirement, when skipped
 }
-export interface EveEvalSuiteResult {
-  readonly suite: string;
+export interface EveEvalResult {
+  readonly id: string;
   readonly target: EveEvalTarget;
   readonly cases: readonly EveEvalCaseResult[];
   readonly startedAt: string;
@@ -498,9 +498,9 @@ Downstream effects:
   requirement inline); failed checks print their `message` indented under the
   case line (replacing the bespoke error prose e2e tests hand-craft today).
   Summary adds check and skip totals.
-- **`EvalReporter` interface**: unchanged shape (`onSuiteStart` /
-  `onCaseComplete` / `onSuiteComplete`) — the richer `EveEvalCaseResult`
-  flows through existing hooks, so custom reporters keep compiling.
+- **`EvalReporter` interface**: lifecycle hooks are `onEvalStart` /
+  `onCaseComplete` / `onEvalComplete`; the richer `EveEvalCaseResult`
+  flows through case-completion hooks.
 - **Braintrust reporter**: checks log as binary scores under a `check:` name
   prefix (e.g. `check:toolCalled(bash)`) so experiments diff check regressions
   the same way they diff score regressions; `verdict` and failed-check
@@ -511,16 +511,16 @@ Downstream effects:
   file per session, keyed by session id.
 - **Reporter throughput**: `onCaseComplete` is currently awaited inline inside
   the case pool, so a slow reporter throttles execution; v2 queues reporter
-  callbacks off the hot path (ordering preserved per suite).
+  callbacks off the hot path (ordering preserved per eval).
 ### Targets: a target is a URL
-There is no suite-owned provisioning config. A target is always just a base
-URL, and the suite's job is to interact and assert — never to describe how the
+There is no eval-owned provisioning config. A target is always just a base
+URL, and the eval's job is to interact and assert — never to describe how the
 agent gets built or started. This is deliberate: properties like the agent's
 model or the mock-model adapter are **build/start-time properties of the
-server process**. A suite cannot control them at run time, so an API that lets
-a suite "declare" them is a footgun — it works only when the runner happens to
+server process**. An eval cannot control them at run time, so an API that lets
+an eval "declare" them is a footgun — it works only when the runner happens to
 be the thing booting the server, and silently means nothing otherwise.
 There are exactly two ways to get a target:
@@ -565,9 +565,9 @@ export interface EveEvalTargetHandle {
 }
 ```
-This covers channel suites (POST a signed webhook via `target.fetch`, assert
+This covers channel evals (POST a signed webhook via `target.fetch`, assert
 on an externally provisioned fake provider, attach to the created session) and
-schedule suites (`dispatchSchedule` + `attachSession`).
+schedule evals (`dispatchSchedule` + `attachSession`).
 The runner always performs the readiness/identity handshake regardless of who
 provisioned the target: `/eve/v1/health` polling, then `/eve/v1/info`
@@ -577,12 +577,12 @@ capabilities are discovered, not assumed.
 ### Requirements: `requires`, verified against the live target
-Suites cannot control the target, but their assertions still _assume_ things
+Evals cannot control the target, but their assertions still _assume_ things
 about it — determinism via mock models, dev-only routes, a sidecar URL in the
 environment. v1 gives those assumptions exactly one surface:
 ```ts
-// Per suite (applies to all cases) and per case (additive):
+// Per eval (applies to all cases) and per case (additive):
 readonly requires?: readonly EveEvalRequirement[];
 type EveEvalRequirement =
@@ -598,24 +598,24 @@ Rules:
    `env:<NAME>` against its own process environment. It never tries to make a
    requirement true.
 2. **Unmet requirement → skip, visibly.** The case (or every case, for
-   suite-level `requires`) gets `verdict: "skipped"` with the unmet
+   eval-level `requires`) gets `verdict: "skipped"` with the unmet
    requirement as the reason — reported in console, `--json`, and artifacts;
    never silently dropped, never failed. `--no-skips` turns skips into
    failures for legs that must prove full coverage. Skips don't otherwise
    affect the exit code.
 3. **One convenience, because the runner boots the dev server:** plain
-   `eve eval` with suites requiring `"mockModels"` either passes
-   `--mock-models` or sees those suites skip with a message naming the flag.
+   `eve eval` with evals requiring `"mockModels"` either passes
+   `--mock-models` or sees those evals skip with a message naming the flag.
    No auto-magic in v1 — explicit and predictable beats clever.
 4. **Runtime guards back the declarations.** `target.dispatchSchedule` throws
    a requirement error when called by a case that didn't declare
    `"devRoutes"` — so undeclared dependencies surface as named failures on
    the local leg, not as mystery flakes on the remote leg.
-The deferred idea — a suite-owned `environment` block where the runner builds
+The deferred idea — an eval-owned `environment` block where the runner builds
 and starts targets, injects env, and manages sidecar lifecycles — is recorded
 under Open questions. It is not in v1: it duplicated CLI concerns, and it let
-suites express build-time properties they cannot actually own.
+evals express build-time properties they cannot actually own.
 ### The external provisioner pattern
@@ -636,38 +636,38 @@ node e2e/provision/<group>.ts &   # build, sidecars, env, eve start, health-poll
 pnpm --filter <fixture-app> exec eve eval --strict --url "http://127.0.0.1:$PORT"
 ```
-Suites consume provisioner outputs through env (declared via `env:<NAME>`
-requirements): a channel suite reads `TELEGRAM_PROBE_URL` to query what the
-fake Bot API captured; a subagent suite reads `EVE_WEATHER_AGENT_HOST` to
+Evals consume provisioner outputs through env (declared via `env:<NAME>`
+requirements): a channel eval reads `TELEGRAM_PROBE_URL` to query what the
+fake Bot API captured; a subagent eval reads `EVE_WEATHER_AGENT_HOST` to
 assert on `subagent.called` remote URLs. The probe/stub helpers
 (`startHttpProbe`, `startMcpStub`, generalized from `e2e/lib/`) ship in
 `eve/evals/environment` for provisioners — including users' own — to use, with
-an HTTP inspection endpoint so suites can assert on captured requests across
+an HTTP inspection endpoint so evals can assert on captured requests across
 the process boundary.
 ### CLI v2
 ```
-eve eval [suiteId...]
+eve eval [evalId...]
   --url <url>              run against an existing target (built local server,
                            preview deployment); without it, boots the dev server
   --mock-models            boot the dev server with the deterministic mock
                            adapter (invalid with --url; the target's mock state
                            is discovered, not set)
-  --tag <tag...>           run only cases (or suites) carrying a tag
+  --tag <tag...>           run only cases (or evals) carrying a tag
   --case <id...>           run only specific case ids
   --strict                 sub-threshold scores also fail the exit code
   --no-skips               requirement skips fail instead of skipping
-  --trials <n>             override suite trials
+  --trials <n>             override eval trials
   --timeout <ms>           per-case timeout (existing)
   --max-concurrency <n>    (existing)
   --json                   structured stdout (existing)
-  --skip-report            skip suite reporters (existing)
-  --list                   print discovered suites/cases without running
+  --skip-report            skip eval reporters (existing)
+  --list                   print discovered evals/cases without running
   --verbose                stream per-case ctx.log and event summaries
 ```
-- Positional suite ids replace `--suite`; the dead `--all` flag is removed
+- Positional eval ids select evals; the dead `--all` flag is removed
   (no filter already means all).
 - Exit codes: `0` all cases passed checks (and thresholds under `--strict`);
   `1` any case failed (check failure, run throw, execution error, or strict
@@ -702,7 +702,7 @@ into the framework, then it dies:
 3. **Turn failure surfacing**: export a typed
    `isTurnFailureEvent(event)` narrowing helper and the
    `EveEvalTurn.expectOk()` driver method instead of `TurnFailedError`-style
-   throw-by-default (negative-path suites need non-throwing sends).
+   throw-by-default (negative-path evals need non-throwing sends).
 4. **Multimodal sugar** (`sendTextWithImage`): becomes
    `EvalSession.sendFile`; the data-URL/`FilePart` encoding helper is exported
    from `eve/client` for general use.
@@ -710,20 +710,20 @@ into the framework, then it dies:
 ## Replacing the e2e surface
-### Where suites live
+### Where evals live
-Each fixture app keeps owning its coverage: suites move into
+Each fixture app keeps owning its coverage: evals move into
 `e2e/fixtures/agent-*/evals/*.eval.ts` and
 `apps/fixtures/weather-fixture/evals/*.eval.ts`. Discovery already scans
-`<appRoot>/evals/`. The area-policy module becomes unnecessary — a suite can
+`<appRoot>/evals/`. The area-policy module becomes unnecessary — an eval can
 only target its own app, by construction. Provisioning scripts live next to
 the fixtures (`e2e/provision/`), built from today's `e2e/lib/server.ts` and
 `e2e/target/` logic rather than rewriting it.
 ### Coverage mapping
-Scripted cases let one suite absorb a whole e2e group: today's 78 script files
-consolidate into roughly one suite per behavior family (e.g.
+Scripted cases let one eval absorb a whole e2e group: today's 78 script files
+consolidate into roughly one eval per behavior family (e.g.
 `agent-tools-hitl/evals/hitl.eval.ts` with `approve-then-persist`,
 `deny-regates`, `ask-question`, and `tool-auth` cases), each sharing one
 provisioned target. `--case` becomes the day-to-day tool for re-running a
@@ -734,11 +734,11 @@ single behavior while debugging.
 | `basic-runtime/*` (basic, multi-turn history, client context, output schema, image, define-state) | `task.run` + `Checks.messageIncludes` / `outputMatches`; `sendFile` for image; `send({ clientContext })`; multi-turn token recall is two `send`s and one check                                                                            |
 | `tools/*` (14 dynamic-tool files, MCP, multi-step loop, narrowing, throw-recover)                 | `Checks.toolCalled(name, { input, output, isError })`, `Checks.toolOrder`, `Checks.event` for ordering edge cases; MCP stub started by the provisioner (`startMcpStub`), addressed via `env:` requirement                                 |
 | `tools-hitl/*` (approval, denial, ask-question, tool auth)                                        | `expectInputRequests` + `respond`/`respondAll`; auth flows keep the IdP emulator as a provisioner sidecar                                                                                                                                 |
-| `tools-sandbox/*`                                                                                 | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot suite tagged `requires-credentials`, excluded in CI via `--tag`                                                                                                                   |
+| `tools-sandbox/*`                                                                                 | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot eval tagged `requires-credentials`, excluded in CI via `--tag`                                                                                                                    |
 | `channels/*`                                                                                      | provisioner starts the fake provider (one shared `startHttpProbe`) and exports `*_API_BASE_URL`; case does `target.fetch` webhook ingress, asserts on probe captures via its inspection endpoint, `attachSession` for the created session |
 | `schedules/*`                                                                                     | `target.dispatchSchedule` + `attachSession`; stream-resume test asserts `attachSession({ startIndex })` replay                                                                                                                            |
 | `subagents/*` (incl. remote delegation, callbacks, failures)                                      | provisioner boots both agents and wires `EVE_WEATHER_AGENT_HOST`; `Checks.subagentCalled(name, { remoteUrl })`; callback retry/bypass probes as provisioner sidecars                                                                      |
-| `codemode/*`                                                                                      | Unchanged suite bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server                                                                                                        |
+| `codemode/*`                                                                                      | Unchanged eval bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server                                                                                                         |
 | `tui-client/*`                                                                                    | **Stays a script harness** (non-goal)                                                                                                                                                                                                     |
 ### Worked example: porting `remote-agent-delegation`
@@ -746,15 +746,15 @@ single behavior while debugging.
 Today's `e2e/tests/subagents/remote-agent-delegation.ts` is 101 lines: manual
 port arithmetic, two `resolveTarget` calls threading `startEnv` by hand, and a
 58-line `assertRemoteDelegation` function of filter/narrow/throw prose. The v2
-suite, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
+eval, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
 ```ts
-import { defineEvalSuite } from "eve/evals";
+import { defineEval } from "eve/evals";
 import { Checks } from "eve/evals/checks";
 const CITY = "Lisbon";
-export default defineEvalSuite({
+export default defineEval({
   description: "Remote subagent delegation over HTTP to a second local agent.",
   // Assumptions about the target, verified by the runner. The provisioner
@@ -795,8 +795,8 @@ weather fixture with mocks, boot the parent with mocks +
 What the diff buys, beyond line count:
-- **Clean separation** — the suite holds only interaction and assertions; the
-  topology lives in one provisioner shared by every subagent case. The suite
+- **Clean separation** — the eval holds only interaction and assertions; the
+  topology lives in one provisioner shared by every subagent case. The eval
   states its assumptions (`requires`) and the runner enforces them, so running
   it against an unprovisioned target skips with a named reason instead of
   failing mysteriously.
@@ -807,14 +807,14 @@ What the diff buys, beyond line count:
   configured) Braintrust like any other eval, and `--case
 weather-result-reaches-parent` reruns it in isolation.
 - **Room to grow** — `remote-agent-callback-retry`, `-bypass`, and
-  `-start-failure` become sibling cases in the same suite, sharing one
+  `-start-failure` become sibling cases in the same eval, sharing one
   provisioned topology instead of re-spawning per file.
 ### CI
 `.github/workflows/smoke.yml` discovery changes from globbing
 `e2e/tests/*/*.ts` to globbing fixture apps with `evals/` directories. Each
-matrix leg provisions, then runs the suites against the resulting URL:
+matrix leg provisions, then runs the evals against the resulting URL:
 ```sh
 node e2e/provision/<group>.ts &        # build + sidecars + env + eve start
@@ -822,7 +822,7 @@ pnpm --filter <fixture-app> exec eve eval --strict --json --url "$TARGET_URL"
 ```
 twice (direct / `EVE_EXPERIMENTAL_CODE_MODE=1` set by the provisioner),
-`fail-fast: false`, JUnit reporter for annotations. Per-suite artifacts under
+`fail-fast: false`, JUnit reporter for annotations. Per-eval artifacts under
 `.eve/evals/` upload on failure — strictly better debuggability than today's
 stdout scraping. A post-deploy leg is the same invocation pointed at a preview
 deployment; requirement-incompatible cases skip visibly.
@@ -871,7 +871,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
 8. `EveEvalTargetHandle`: `baseUrl`, `fetch`, `info`, `capabilities`,
    `dispatchSchedule`, `attachSession`; `--mock-models` for the runner-booted
    dev server; readiness/identity handshake for `--url` targets.
-9. Requirements: suite/case `requires` (`mockModels` / `devRoutes` /
+9. Requirements: eval/case `requires` (`mockModels` / `devRoutes` /
    `env:<NAME>`), `skipped` verdict + `--no-skips`, runtime guards on
    requirement-gated handle methods, and `/eve/v1/info` reporting mock-model
    state so requirements are verified against the live target.
@@ -883,7 +883,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
 ### Phase 4 — Migration and deletion
-12. Port suites group by group in the order: `basic-runtime` → `tools` →
+12. Port evals group by group in the order: `basic-runtime` → `tools` →
     `tools-hitl` → `subagents` → `schedules` → `channels` → `codemode` →
     `tools-sandbox`. Each ported group flips its CI matrix entry from
     `node e2e/tests/...` to provision + `eve eval --strict --url` in the same
@@ -892,7 +892,7 @@ Each phase ships independently, keeps `pnpm test` green, and includes docs
     area policy; shrink `e2e/lib`/`e2e/target` into `e2e/provision/`; update
     `e2e/README.md` and AGENTS.md smoke-test guidance to point at `eve eval`.
 14. Optional follow-on: add a post-deploy `--url` leg against preview
-    deployments for the requirement-compatible subset of fixture suites.
+    deployments for the requirement-compatible subset of fixture evals.
 ## Breaking changes
@@ -903,37 +903,37 @@ changesets for each:
   typed records.
 - `EveEvalCase` becomes a data/scripted union; `input` is no longer required
   on scripted cases. Existing data cases are unaffected.
-- `model` becomes optional; suites passing dummy models can drop them.
-- `--suite` replaced by positional ids; `--all` removed.
-- Exit-code semantics gain check failures (suites without `checks` see no
+- `model` becomes optional; evals passing dummy models can drop them.
+- Eval selection uses positional ids; `--all` removed.
+- Exit-code semantics gain check failures (evals without `checks` see no
   change unless `--strict`).
 ## Risks and open questions
-- **Cost/flakiness of model-backed suites in CI.** Mitigation: the
+- **Cost/flakiness of model-backed evals in CI.** Mitigation: the
   `"mockModels"` requirement makes determinism a declared, verified property
   instead of a per-file accident; `trials` + `--strict` thresholds handle the
-  suites that must use real models. The migration should explicitly decide,
+  evals that must use real models. The migration should explicitly decide,
   per ported smoke, mock vs real — today's split is undocumented.
 - **The stream-registration race** may not be fully fixable server-side in
   Phase 2; the client-level bounded retry is the documented fallback. Track it
   as its own issue rather than letting retry constants drift again.
-- **Channel suites with real credentials** (`slack-thread-context`) and
-  sandbox snapshot suites stay tag-gated (`requires-credentials`) and excluded
+- **Channel evals with real credentials** (`slack-thread-context`) and
+  sandbox snapshot evals stay tag-gated (`requires-credentials`) and excluded
   from default CI, as today.
 - **Provisioner drift on `--url` legs.** Beyond what `/info` exposes
   (identity, mode, mock-model state) and `env:<NAME>` presence checks, the
-  runner trusts the provisioner. A suite can pass locally and skip-or-fail
+  runner trusts the provisioner. An eval can pass locally and skip-or-fail
   remotely because the target was provisioned differently; requirements make
   this visible but cannot make it impossible.
-- **Open: should the runner ever own provisioning?** A suite-owned
+- **Open: should the runner ever own provisioning?** An eval-owned
   `environment` block (runner builds/starts targets, injects env, manages
   sidecar lifecycles, boots secondary agents) was considered and deliberately
-  cut from v1: it duplicated CLI concerns and let suites declare build-time
+  cut from v1: it duplicated CLI concerns and let evals declare build-time
   properties they cannot own at run time. Revisit only with concrete friction
   data from the migration — the likely v2 shape, if any, is a _project-level_
   provisioning config consumed by `eve eval` (like Playwright's `webServer`),
-  not per-suite config.
+  not per-eval config.
 - **User-facing naming**: `checks` vs `scores` vs `thresholds` vs `requires`
   needs a docs pass so the soft/hard/assumption distinction is obvious; the
   evals doc gets a "smoke-testing your agent" section once Phase 2 lands.