npm - eve - Versions diffs - 0.6.0-beta.9 → 0.7.2 - Mend

eve 0.6.0-beta.9 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (650) hide show

package/dist/docs/evals-v2-plan.md DELETED Viewed

@@ -1,939 +0,0 @@
-# Evals v2: One Runner for Quality Evals and End-to-End Verification
-Status: proposal
-Owner: framework
-Scope: `packages/eve` (evals, client, CLI), `e2e/`, CI
-## Summary
-Eve's eval suites (`defineEvalSuite` + `eve eval`) and the hand-rolled e2e smoke
-surface (`e2e/tests/**` + `ExampleClient`) are two implementations of the same
-idea: drive a real agent over HTTP and judge what happened. The eval runner has
-the right bones — filesystem discovery, target resolution, stream capture,
-concurrency, scoring, reporting, artifacts — but cannot express scripted
-multi-turn interactions, HITL approvals, tool-call assertions, channel ingress,
-or hard pass/fail. The e2e surface can express all of that, but as 78
-framework-less scripts with a duplicated client, triplicated stream readers, and
-bespoke `throw new Error(...)` assertions.
-This plan extends the eval API into the single runner for both jobs and then
-deletes the hand-rolled e2e harness. We keep the declarative dataset-and-scorer
-core (it is the right shape for quality evals) and add three things it is
-missing:
-1. **An imperative interaction API** — `run(ctx)` with a typed `EvalSession`
-   driver for multi-turn control flow, HITL responses, approvals, structured
-   output, attachments, and multi-session scenarios. Available at suite level
-   (`task.run`, the shared default for dataset evals) and per case
-   (`case.run`), so one suite groups many distinct scripted behaviors the way
-   a test file groups `it` blocks.
-2. **A hard-assertion tier** — `checks`, distinct from `scores`. Check failures
-   fail the case and the process; scores remain soft, thresholded data.
-3. **A URL-shaped target model with verified requirements** — a target is
-   always just a URL. `eve eval` obtains one (boots the dev server, as today)
-   or `--url` brings your own (a built server in CI, a preview deployment).
-   Suites never declare _how_ to provision the agent — they declare
-   `requires` assumptions (mock models, dev routes, sidecar env) that the
-   runner verifies against the live target and skips or errors on, visibly.
-   Provisioning (build, start, env injection, sidecars, secondary agents)
-   stays external in v1: a thin script or CI step boots things and points
-   `eve eval --url` at them — the composition that
-   `e2e/tests/basic-runtime/evals.ts` already proves out today.
-The end state: every behavior currently proven by `e2e/tests/**` (except the
-TUI tests, see Non-goals) is an eval suite in a fixture app's `evals/`
-directory, CI runs `eve eval --strict` per fixture app per matrix mode, and
-`e2e/lib/client.ts` plus the per-file harness code are deleted. Users get the
-same machinery for their own agents: smoke-test your agent the way Eve
-smoke-tests itself.
-## Why not a `describe`/`it` runner
-We considered replacing `defineEvalSuite` with a vitest/`node:test`-shaped API.
-Rejected, for these reasons:
-1. **Evals have semantics tests don't.** Scores in `[0,1]`, per-scorer
-   thresholds, LLM judges, datasets/loaders, Braintrust experiments, per-scorer
-   averages, JSON artifacts. `describe`/`it` collapses everything to boolean
-   pass/fail; we would immediately reinvent scorers and reporters _inside_ `it`
-   blocks and lose the structured result model that powers `--json` and the
-   Braintrust reporter.
-2. **Suite identity is path-derived** (repo principle 5, enforced by
-   `scripts/guard-invariants.mjs`). One suite per `evals/<path>.eval.ts` file
-   maps cleanly onto the filesystem-first philosophy; nested describe blocks
-   fight it.
-3. **The dataset model is the right default.** "Load 200 cases from YAML, fan
-   out with bounded concurrency, score with a judge" is the 80% case for users.
-   A block-based runner makes that the awkward case (loops generating `it`s).
-4. **What e2e actually needs is not blocks** — it is control flow _inside a
-   case_, a typed session driver, and assertion helpers over the event stream.
-   All three fit inside the existing suite shape: `run` at the task and case
-   level provides the control flow, and scripted cases give the same grouping
-   a `describe` file gives `it` blocks (see "Scripted cases" below).
-5. **Wrapping vitest violates principle 3** (wrap third-party deps; don't
-   expose them), and writing a bespoke block runner re-solves problems the
-   eval runner already solves (discovery, concurrency, timeouts, reporting).
-The one philosophical gap — "low scores are data, not failures" vs. "smoke
-tests must fail the build" — is resolved by making the checks/scores split
-first-class rather than by switching paradigms.
-## Goals
-- Scripted multi-turn evals with branching on prior responses.
-- HITL: assert an agent parks on `input.requested`, respond
-  (approve/deny/select/freeform), assert resumption and approval persistence.
-- First-class assertions on tool calls: name, input, output, error state,
-  order, and count — plus subagent calls, messages, structured output, and
-  arbitrary event predicates.
-- Hard pass/fail semantics suitable for CI gating, coexisting with soft scores.
-- One target model: any URL — the runner-booted dev server, a locally built
-  `eve start` process, or a deployed instance — with suite-declared
-  requirements verified against the live target instead of suite-owned
-  provisioning.
-- Drive surfaces beyond the session route: channels (webhook ingress) and
-  schedules (dev dispatch), with stream consumption for sessions the suite did
-  not create.
-- One HTTP client: `eve/client` absorbs everything `e2e/lib/client.ts` does;
-  `ExampleClient` is deleted.
-- Replace `e2e/tests/**` (minus TUI) with eval suites; CI becomes
-  `eve eval --strict` runs.
-## Non-goals
-- **TUI tests** (`e2e/tests/tui-client/`) stay as scripts. They test the TUI
-  client and renderer, not agent behavior; an agent-eval runner is the wrong
-  tool. They already use the public `Client` and the built TUI test harness.
-- **Unit/integration/scenario tiers** are unchanged. Evals replace the _e2e_
-  tier only.
-- **Braintrust/autoevals integration** is unchanged in shape (still wrapped
-  per principle 3).
-- **No authored suite `id`/`name`.** Identity stays path-derived; the
-  invariant guard keeps enforcing it.
-## Current state (abridged; see code for detail)
-- Suite API: `defineEvalSuite({ cases | load, task, scores, model, thresholds,
-reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
-  `types.ts`. `task` is `prompt(case)` or `messages(case) => string[]` — a
-  static list, no branching, no HITL (`runner/execute-case.ts:38` only ever
-  sends `{ message }`).
-- Facts: `EveEvalDerivedFacts` exposes tool call **names only**
-  (`runner/derive-run-facts.ts`); inputs/outputs exist in the captured events
-  but have no typed surface.
-- Exit code: execution errors only; sub-threshold scores never fail the
-  process (`evals/cli/eval.ts`). A session parked on an approval ends
-  `"waiting"`, which `Run.didNotFail()` counts as success.
-- Target: `eve eval` boots a dev server in-process or hits `--url`. No built
-  (`eve start`) target, no env injection, no mock-model switch (only the
-  server-side `EVE_MOCK_AUTHORED_MODELS` env var), no sidecars, no secondary
-  targets.
-- Dead surface: case `tags` are documented but never read; `--all` is parsed
-  but unused.
-- e2e: `e2e/lib/run.ts` + `e2e/target/local-environment.ts` own
-  build/spawn/health/teardown; `e2e/lib/client.ts` (`ExampleClient`)
-  duplicates `ClientSession` and adds pending-input tracking, turn-failure
-  errors, and two retry loops papering over the POST→GET stream-registration
-  race; assertions are ad-hoc per file.
----
-## The v2 API
-### Suite shape
-```ts
-import { defineEvalSuite } from "eve/evals";
-import { Checks } from "eve/evals/checks";
-import { Run, Text } from "eve/evals/scores";
-export default defineEvalSuite({
-  description: "HITL approval flows: park, approve, deny, persist.",
-  // Suite-level checks apply to every case.
-  checks: [Checks.didNotFail()],
-  scores: [Run.didNotFail()],
-  cases: [
-    {
-      id: "approve-then-persist",
-      async run({ session }) {
-        await session.send("run `pwd`");
-        const [request] = session.expectInputRequests();
-        await session.respond({ requestId: request.requestId, optionId: "approve" });
-        const turn = await session.send("read the `weather-codes.md` file");
-        return turn.message;
-      },
-      checks: [Checks.toolCalled("bash", { input: { command: /pwd/ } }), Checks.completed()],
-    },
-    {
-      id: "deny-regates",
-      async run({ session }) {
-        await session.send("run `pwd`");
-        await session.respondAll("deny");
-        await session.send("run `pwd` again");
-        session.expectInputRequests({ toolName: "bash" }); // re-gated after denial
-      },
-      checks: [Checks.waiting()],
-    },
-  ],
-});
-```
-Changes to `EveEvalSuiteInput` (`packages/eve/src/evals/types.ts`):
-| Field            | Change                                                                                                                                                                                                                                                                                                                                                           |
-| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `task.run`       | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`.                                                                                                                                                                                                                           |
-| scripted cases   | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the suite `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `suite.task`. See "Scripted cases" below.                                                                                                                            |
-| `checks`         | New optional array of `EveEvalCheck`, at suite level and per case (case-level appends to suite-level). Hard assertions; any failure marks the case failed and flips the CLI exit code.                                                                                                                                                                           |
-| `requires`       | New optional requirement list (suite- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one suite runs against the dev server, a local build, and a deployed URL. See "Targets". Suites own no provisioning config — no kind/env/setup.  |
-| `model`          | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the suite provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
-| `trials`         | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed suites.                                                                                                                                              |
-| `tags` filtering | Case and suite `tags` become functional (CLI `--tag`).                                                                                                                                                                                                                                                                                                           |
-Everything else (`cases`/`load`, `scores`, `thresholds`, `reporters`,
-`maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing suites keep
-working except where noted in Breaking changes.
-### `task.run(ctx)` and the run context
-```ts
-export type EveEvalTask =
-  | { run(ctx: EveEvalRunContext): Promise<unknown | void>; prompt?: never; messages?: never; parseOutput?: ... }
-  | { messages(testCase: EveEvalCase): string[]; ... }   // unchanged, sugar over run
-  | { prompt(testCase: EveEvalCase): string; ... }       // unchanged, sugar over run
-  | { parseOutput?(result: EveEvalTaskResult): unknown }; // unchanged default
-export interface EveEvalRunContext {
-  /** The case under execution. */
-  readonly case: EveEvalCase;
-  /** Primary session, created lazily on first send. Fresh per case. */
-  readonly session: EveEvalSession;
-  /** Create an additional independent session against the same target. */
-  newSession(): EveEvalSession;
-  /** Handle to the agent server under test (channels, schedules, raw routes). */
-  readonly target: EveEvalTargetHandle;
-  /** Case timeout signal (from suite/CLI `timeoutMs`). */
-  readonly signal: AbortSignal;
-  /** Structured logger; lines land in the case artifact, and on stdout with `--verbose`. */
-  readonly log: (message: string) => void;
-}
-```
-Semantics:
-- The runner still owns session lifecycle, full-stream capture, derived facts,
-  timeout, and post-run checks/scores. `run` only drives the interaction.
-- `run`'s return value becomes `result.output` (then `parseOutput` applies as
-  today; default remains `finalMessage` when `run` returns `undefined`).
-- A `throw` inside `run` marks the case **failed** with the error recorded —
-  imperative assertions inside `run` are first-class, same as a failing check.
-- Events from every turn of every session created in the case are accumulated
-  into `result.events` (tagged with session id) so checks and scorers see the
-  whole interaction. `derived` facts are computed over the primary session by
-  default, with per-session access on the result.
-### Scripted cases: making the suite a suite
-A suite-level `task` alone means "one interaction script × many data rows" —
-the dataset shape. The e2e surface is the inverse: many distinct scripts, one
-execution each (`tool-approval`, `tool-denial`, and `ask-question-flow` are
-three different `run` functions, not three rows). Forcing each script into its
-own suite file with a single dummy-input case would make "suite" a misnomer
-and multiply provisioning cost (a provisioned target per file instead of per
-behavior family).
-So `EveEvalCase` becomes a union:
-```ts
-export type EveEvalCase = EveEvalDataCase | EveEvalScriptedCase;
-/** Today's shape: data routed through the suite-level task. */
-export interface EveEvalDataCase {
-  readonly id: string;
-  readonly input: string | Record<string, unknown>;
-  readonly expected?: unknown;
-  readonly checks?: readonly EveEvalCheck[]; // appended to suite-level
-  readonly scores?: readonly EveEvalScorer[]; // appended to suite-level
-  readonly tags?: readonly string[];
-  readonly metadata?: Readonly<Record<string, unknown>>;
-}
-/** A self-contained interaction script. No input required. */
-export interface EveEvalScriptedCase {
-  readonly id: string;
-  run(ctx: EveEvalRunContext): Promise<unknown | void>;
-  readonly expected?: unknown;
-  readonly checks?: readonly EveEvalCheck[];
-  readonly scores?: readonly EveEvalScorer[];
-  readonly tags?: readonly string[];
-  readonly metadata?: Readonly<Record<string, unknown>>;
-}
-```
-Resolution rules:
-- `case.run` wins over `suite.task`; a data case without a suite `task` falls
-  back to today's default (send `input` verbatim).
-- Case-level `checks`/`scores` **append** to suite-level ones; suite-level
-  expresses invariants ("never fails"), case-level expresses the specific
-  behavior under test.
-- A suite may freely mix data cases and scripted cases, though in practice
-  quality suites are all-data and smoke suites are all-scripted.
-The conceptual model this lands on: **the suite file is the `describe`, cases
-are the `it`s** — grouping, one shared target, shared baseline checks and
-requirements — without a block API, and with path-derived identity intact. The suite-level `task` keeps its role as the shared default for
-dataset evals; it is no longer the only way to define behavior.
-Execution semantics are uniform across both case kinds:
-- **Concurrency**: scripted cases join the same bounded pool as data cases
-  (each owns its sessions, so they parallelize safely). Suites whose cases
-  mutate shared target state (e.g. `defineState` persistence tests) set
-  `maxConcurrency: 1`.
-- **Timeout**: suite/CLI `timeoutMs` applies per case per trial; the signal is
-  `ctx.signal` inside `run` and aborts in-flight sends.
-- **Trials**: apply identically — a scripted case under `trials: 3` runs its
-  `run` three times against three fresh primary sessions.
-- **Loaders**: `load()` may return scripted cases too (the union is the return
-  type), though in practice loaded datasets are data cases.
-### `EvalSession`: the interaction driver
-A thin wrapper over the public `ClientSession` (`packages/eve/src/client/`),
-not a parallel implementation. Everything here is also useful to end users, so
-the driver lives in `eve/evals` but delegates transport entirely to
-`eve/client`.
-```ts
-export interface EveEvalSession {
-  /** Send one turn. Accepts the same SendTurnInput as ClientSession.send. */
-  send(input: SendTurnInput): Promise<EveEvalTurn>;
-  /** Sugar: text + file attachment inlined as a data: URL (multimodal turns). */
-  sendFile(text: string, filePath: string, mediaType?: string): Promise<EveEvalTurn>;
-  /**
-   * Input requests left pending by the last turn (from `input.requested`).
-   * Empty unless the last turn parked.
-   */
-  readonly pendingInputRequests: readonly InputRequest[];
-  /**
-   * Assert the last turn parked on HITL input. Throws (failing the case) when
-   * nothing is pending or the filter matches nothing. Returns the requests.
-   */
-  expectInputRequests(filter?: {
-    toolName?: string;
-    display?: InputRequest["display"];
-  }): readonly InputRequest[];
-  /** Resolve specific pending requests and run the resumed turn. */
-  respond(...responses: InputResponse[]): Promise<EveEvalTurn>;
-  /** Resolve every pending request with one optionId ("approve" / "deny" / ...). */
-  respondAll(optionId: string): Promise<EveEvalTurn>;
-  /** All events observed on this session so far. */
-  readonly events: readonly HandleMessageStreamEvent[];
-  readonly sessionId: string | undefined;
-  /** Serializable cursor (continuationToken / streamIndex), as ClientSession.state. */
-  readonly state: SessionState;
-}
-export interface EveEvalTurn {
-  readonly status: "completed" | "waiting" | "failed";
-  readonly message: string | undefined;
-  /** Structured output when the turn requested an outputSchema. */
-  readonly data: unknown;
-  readonly events: readonly HandleMessageStreamEvent[];
-  /** Input requests raised by this turn (parked HITL). */
-  readonly inputRequests: readonly InputRequest[];
-  /** Typed tool calls completed during this turn (see facts v2). */
-  readonly toolCalls: readonly EveEvalToolCall[];
-  /** Throw EveEvalTurnFailedError unless status is "completed" or "waiting". */
-  expectOk(): this;
-}
-```
-Notes:
-- `send` does **not** throw on `turn.failed`/`session.failed` by default —
-  failure handling belongs to checks (`Checks.completed()`) or explicit
-  `turn.expectOk()`. This keeps negative-path suites (e.g. today's
-  `remote-agent-start-failure.ts`, `tool-throw-recover.ts`) natural to write.
-- HITL coverage: `needsApproval` approvals (`approve`/`deny` option ids),
-  framework `ask_question` selects (`optionId`), freeform answers
-  (`InputResponse` with text), and tool/connection auth parks — all are just
-  `expectInputRequests()` + `respond(...)`. Subagent approval proxying needs
-  nothing extra: requests surface on the parent stream.
-- The static `messages` task compiles to
-  `for (const m of messages(case)) await ctx.session.send(m)` — identical
-  behavior to today.
-### Checks: hard assertions
-```ts
-export interface EveEvalCheckResult {
-  readonly name: string;
-  readonly passed: boolean;
-  /** Human-readable failure detail, shown in console + artifacts. */
-  readonly message?: string;
-  readonly metadata?: Readonly<Record<string, unknown>>;
-}
-export interface EveEvalCheckArgs {
-  readonly case: EveEvalCase;
-  readonly result: EveEvalTaskResult; // same data as scorers, no judge model
-  /** Target handle, so checks can reference the live target (URL, info). */
-  readonly target: EveEvalTargetHandle;
-}
-export type EveEvalCheck = (
-  args: EveEvalCheckArgs,
-) => EveEvalCheckResult | Promise<EveEvalCheckResult>;
-```
-Matcher options on built-in checks accept a literal, RegExp, predicate, or a
-resolver `(args: EveEvalCheckArgs) => value` — the last form is what lets a
-check compare against runner-assigned values like a secondary target's URL.
-Built-ins, exported from `eve/evals/checks`:
-| Check                                                         | Asserts                                                                                                                                                                                                                            |
-| ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `Checks.completed()`                                          | Final status is `"completed"` (not `"waiting"`, not `"failed"`)                                                                                                                                                                    |
-| `Checks.waiting()`                                            | Final status is `"waiting"` (for park-shaped suites)                                                                                                                                                                               |
-| `Checks.didNotFail()`                                         | Status is not `"failed"` and no `turn.failed`/`step.failed` events                                                                                                                                                                 |
-| `Checks.messageIncludes(token)`                               | Joined `message.completed` text contains `token` (string or RegExp)                                                                                                                                                                |
-| `Checks.outputEquals(value)` / `Checks.outputMatches(schema)` | Deep-equal / Standard Schema validation of `result.output`                                                                                                                                                                         |
-| `Checks.toolCalled(name, opts?)`                              | A tool call with `name` happened; `opts.input` partial-deep-matches the call input (values: literal, RegExp, or predicate); `opts.output` matches the result; `opts.times` constrains count; `opts.isError` constrains error state |
-| `Checks.toolNotCalled(name)`                                  | No call to `name`                                                                                                                                                                                                                  |
-| `Checks.toolOrder([...names])`                                | Names appear in order (subsequence match)                                                                                                                                                                                          |
-| `Checks.noFailedActions()`                                    | No `action.result` with `isError: true`                                                                                                                                                                                            |
-| `Checks.subagentCalled(name, opts?)`                          | Subagent delegation occurred; `opts.remoteUrl` matches `subagent.called` remote metadata; `opts.output` matches the `subagent.completed` output                                                                                    |
-| `Checks.event(predicate, label)`                              | Escape hatch: any predicate over the typed event stream                                                                                                                                                                            |
-Pass/fail policy:
-- Any check returning `passed: false`, any throw inside `run`, and any
-  execution error (timeout, transport) marks the case **failed**.
-- Failed cases always produce a non-zero `eve eval` exit code.
-- Scores keep today's semantics: thresholded, reported, never gate the exit
-  code — **unless** `--strict` is passed, which additionally fails the process
-  when any case scores below threshold. CI for fixture smoke suites runs
-  `--strict`; users running exploratory quality evals don't.
-### Derived facts v2 (breaking)
-`EveEvalDerivedFacts` gains typed records; counts stay for reporters.
-```ts
-export interface EveEvalToolCall {
-  readonly name: string;
-  readonly input: JsonObject; // from actions.requested
-  readonly output: unknown; // from the matching action.result
-  readonly isError: boolean;
-  readonly turnIndex: number;
-  readonly sessionId: string;
-}
-export interface EveEvalDerivedFacts {
-  readonly toolCalls: readonly EveEvalToolCall[]; // was readonly string[]
-  readonly toolCallCount: number;
-  readonly subagentCalls: readonly EveEvalSubagentCall[]; // { name, remoteUrl?, ... }
-  readonly subagentCallCount: number;
-  readonly inputRequests: readonly InputRequest[]; // NEW: all HITL requests raised
-  readonly parked: boolean; // NEW: ended waiting on input
-  readonly messageCount: number;
-  readonly reasoningBlockCount: number;
-  readonly failureCode?: string;
-}
-```
-`Run.usedTool(name)` keeps working; it gains an optional second argument with
-the same matcher options as `Checks.toolCalled` so scorers can grade tool-input
-quality fractionally where checks assert it absolutely.
-### Result model, reporters, and artifacts
-Checks and trials need a home in the result types
-(`packages/eve/src/evals/types.ts`):
-```ts
-export interface EveEvalCaseResult {
-  readonly case: EveEvalCase;
-  readonly result: EveEvalTaskResult; // aggregated over sessions
-  readonly checks: readonly EveEvalCheckResult[]; // NEW
-  readonly scores: readonly EveEvalScorerResult[];
-  /**
-   * NEW: per-case verdict, computed by the runner:
-   * "passed"  — no error, all checks passed
-   * "failed"  — a check failed, run() threw, or execution errored
-   * "scored"  — passed checks but at least one score below threshold
-   * "skipped" — an unmet `requires` entry
-   */
-  readonly verdict: "passed" | "failed" | "scored" | "skipped";
-  readonly trials?: readonly EveEvalTrialResult[]; // present when trials > 1
-  readonly error?: string;
-  readonly skipReason?: string; // unmet requirement, when skipped
-}
-export interface EveEvalSuiteResult {
-  readonly suite: string;
-  readonly target: EveEvalTarget;
-  readonly cases: readonly EveEvalCaseResult[];
-  readonly startedAt: string;
-  readonly completedAt: string;
-  readonly passed: number;
-  readonly failed: number; // NEW: check failures + run throws + exec errors
-  readonly scored: number; // NEW: below-threshold-only cases
-  readonly skipped: number; // NEW: requirement-skipped cases
-  readonly errored: number; // retained: the execution-error subset of failed
-}
-```
-Downstream effects:
-- **Console reporter**: today's `✓ / ○ / ✗` icons map onto
-  `passed / scored / failed`, plus `-` for `skipped` (with the unmet
-  requirement inline); failed checks print their `message` indented under the
-  case line (replacing the bespoke error prose e2e tests hand-craft today).
-  Summary adds check and skip totals.
-- **`EvalReporter` interface**: unchanged shape (`onSuiteStart` /
-  `onCaseComplete` / `onSuiteComplete`) — the richer `EveEvalCaseResult`
-  flows through existing hooks, so custom reporters keep compiling.
-- **Braintrust reporter**: checks log as binary scores under a `check:` name
-  prefix (e.g. `check:toolCalled(bash)`) so experiments diff check regressions
-  the same way they diff score regressions; `verdict` and failed-check
-  messages land in span metadata.
-- **Artifacts**: `cases/<id>.json` gains `checks` and `verdict`;
-  `summary.json` gains the new counters; per-trial event streams write as
-  `cases/<id>.trial-<n>.events.ndjson`. Multi-session cases write one events
-  file per session, keyed by session id.
-- **Reporter throughput**: `onCaseComplete` is currently awaited inline inside
-  the case pool, so a slow reporter throttles execution; v2 queues reporter
-  callbacks off the hot path (ordering preserved per suite).
-### Targets: a target is a URL
-There is no suite-owned provisioning config. A target is always just a base
-URL, and the suite's job is to interact and assert — never to describe how the
-agent gets built or started. This is deliberate: properties like the agent's
-model or the mock-model adapter are **build/start-time properties of the
-server process**. A suite cannot control them at run time, so an API that lets
-a suite "declare" them is a footgun — it works only when the runner happens to
-be the thing booting the server, and silently means nothing otherwise.
-There are exactly two ways to get a target:
-| Invocation             | Target                                                                                                                                                          | Use                                     |
-| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------- |
-| `eve eval`             | Runner boots the local dev server in-process (today's behavior), with the standard `.env` cascade; `--mock-models` boots it with the deterministic mock adapter | Local iteration                         |
-| `eve eval --url <url>` | Bring your own: a locally built `eve start` process, a preview deployment, production                                                                           | CI smoke legs, post-deploy verification |
-"Build mode" is not a runner concept — a built server is just a URL you
-obtained by running `eve build && eve start` yourself. Provisioning (build,
-start, env injection, sidecar fixtures, secondary agents for multi-agent
-topologies) lives in a script or CI step that boots things and then points
-`eve eval --url` at them. This composition already exists and works:
-`e2e/tests/basic-runtime/evals.ts` boots a fixture with
-`EVE_MOCK_AUTHORED_MODELS=1` and shells out to `eve eval --url` today. v1
-blesses that pattern instead of absorbing it.
-`EveEvalTargetHandle` (available on the run context) is how cases reach
-non-session surfaces:
-```ts
-export interface EveEvalTargetHandle {
-  readonly baseUrl: string;
-  /** Discovered from the live target (/eve/v1/info), never declared. */
-  readonly capabilities: { readonly devRoutes: boolean; readonly mockModels: boolean };
-  /** Raw fetch against the target — webhook/channel ingress, health, info. */
-  fetch(path: string, init?: RequestInit): Promise<Response>;
-  /** Typed agent info (GET /eve/v1/info). */
-  info(): Promise<AgentInfoResult>;
-  /**
-   * Dispatch a schedule via the dev-only route. Guarded by the devRoutes
-   * capability (see Requirements below).
-   */
-  dispatchSchedule(scheduleId: string): Promise<{ sessionIds: readonly string[] }>;
-  /**
-   * Attach to a session this case did not create (channel- or
-   * schedule-initiated). Consumes the durable stream from startIndex,
-   * resolving at the turn boundary.
-   */
-  attachSession(sessionId: string, opts?: { startIndex?: number }): EveEvalSession;
-}
-```
-This covers channel suites (POST a signed webhook via `target.fetch`, assert
-on an externally provisioned fake provider, attach to the created session) and
-schedule suites (`dispatchSchedule` + `attachSession`).
-The runner always performs the readiness/identity handshake regardless of who
-provisioned the target: `/eve/v1/health` polling, then `/eve/v1/info`
-verification that this is the expected agent (the same stale-server guard the
-e2e harness uses today). `/info` is extended to report mock-model state so
-capabilities are discovered, not assumed.
-### Requirements: `requires`, verified against the live target
-Suites cannot control the target, but their assertions still _assume_ things
-about it — determinism via mock models, dev-only routes, a sidecar URL in the
-environment. v1 gives those assumptions exactly one surface:
-```ts
-// Per suite (applies to all cases) and per case (additive):
-readonly requires?: readonly EveEvalRequirement[];
-type EveEvalRequirement =
-  | "mockModels"      // target runs the deterministic mock adapter (via /info)
-  | "devRoutes"       // dev-only routes are mounted (via /info)
-  | `env:${string}`;  // process env var is set in the eval process (sidecar URLs)
-```
-Rules:
-1. **Requirements are verified, never fulfilled.** The runner checks
-   `mockModels`/`devRoutes` against the discovered capabilities and
-   `env:<NAME>` against its own process environment. It never tries to make a
-   requirement true.
-2. **Unmet requirement → skip, visibly.** The case (or every case, for
-   suite-level `requires`) gets `verdict: "skipped"` with the unmet
-   requirement as the reason — reported in console, `--json`, and artifacts;
-   never silently dropped, never failed. `--no-skips` turns skips into
-   failures for legs that must prove full coverage. Skips don't otherwise
-   affect the exit code.
-3. **One convenience, because the runner boots the dev server:** plain
-   `eve eval` with suites requiring `"mockModels"` either passes
-   `--mock-models` or sees those suites skip with a message naming the flag.
-   No auto-magic in v1 — explicit and predictable beats clever.
-4. **Runtime guards back the declarations.** `target.dispatchSchedule` throws
-   a requirement error when called by a case that didn't declare
-   `"devRoutes"` — so undeclared dependencies surface as named failures on
-   the local leg, not as mystery flakes on the remote leg.
-The deferred idea — a suite-owned `environment` block where the runner builds
-and starts targets, injects env, and manages sidecar lifecycles — is recorded
-under Open questions. It is not in v1: it duplicated CLI concerns, and it let
-suites express build-time properties they cannot actually own.
-### The external provisioner pattern
-What v1 deliberately does not own — and where it lives instead:
-| Concern                                                    | v1 home                                                                                       |
-| ---------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
-| Building/starting the agent (`eve build` + `eve start`)    | CI step or a thin script (the existing `e2e/lib/server.ts` logic, kept)                       |
-| Agent env (mock models, fake-provider URLs, feature flags) | The provisioner's environment for the server process                                          |
-| Sidecars (fake Telegram/Discord APIs, MCP stubs, probes)   | Started by the provisioner; their URLs exported as env to both the agent and the eval process |
-| Secondary agents (remote-subagent topologies)              | Provisioner boots both servers and wires `EVE_*_HOST` env                                     |
-| Data-dir isolation (`WORKFLOW_LOCAL_DATA_DIR`)             | Provisioner-owned temp dirs                                                                   |
-A provisioned smoke leg is two steps:
-```sh
-node e2e/provision/<group>.ts &   # build, sidecars, env, eve start, health-poll
-pnpm --filter <fixture-app> exec eve eval --strict --url "http://127.0.0.1:$PORT"
-```
-Suites consume provisioner outputs through env (declared via `env:<NAME>`
-requirements): a channel suite reads `TELEGRAM_PROBE_URL` to query what the
-fake Bot API captured; a subagent suite reads `EVE_WEATHER_AGENT_HOST` to
-assert on `subagent.called` remote URLs. The probe/stub helpers
-(`startHttpProbe`, `startMcpStub`, generalized from `e2e/lib/`) ship in
-`eve/evals/environment` for provisioners — including users' own — to use, with
-an HTTP inspection endpoint so suites can assert on captured requests across
-the process boundary.
-### CLI v2
-```
-eve eval [suiteId...]
-  --url <url>              run against an existing target (built local server,
-                           preview deployment); without it, boots the dev server
-  --mock-models            boot the dev server with the deterministic mock
-                           adapter (invalid with --url; the target's mock state
-                           is discovered, not set)
-  --tag <tag...>           run only cases (or suites) carrying a tag
-  --case <id...>           run only specific case ids
-  --strict                 sub-threshold scores also fail the exit code
-  --no-skips               requirement skips fail instead of skipping
-  --trials <n>             override suite trials
-  --timeout <ms>           per-case timeout (existing)
-  --max-concurrency <n>    (existing)
-  --json                   structured stdout (existing)
-  --skip-report            skip suite reporters (existing)
-  --list                   print discovered suites/cases without running
-  --verbose                stream per-case ctx.log and event summaries
-```
-- Positional suite ids replace `--suite`; the dead `--all` flag is removed
-  (no filter already means all).
-- Exit codes: `0` all cases passed checks (and thresholds under `--strict`);
-  `1` any case failed (check failure, run throw, execution error, or strict
-  threshold miss); `2` runner/configuration error. Unmet requirements skip
-  (reported, exit-code-neutral); pass `--no-skips` to turn any skip into a
-  failure when a leg must prove full coverage.
-- Console reporter renders check failures with their `message` inline (the
-  bespoke error strings e2e tests craft today become structured output).
-- Artifacts gain `checks` per case in `cases/<id>.json` and `summary.json`.
-- New optional JUnit reporter (`eve/evals/reporters`) for CI annotation.
-## Client consolidation: delete `ExampleClient`
-`ClientSession` already supports everything `ExampleClient` posts
-(`UserContent` messages, `clientContext`, `inputResponses`, `outputSchema` —
-see `packages/eve/src/client/types.ts`). What `ExampleClient` adds must move
-into the framework, then it dies:
-1. **Pending input tracking**: `MessageResult` gains
-   `inputRequests: readonly InputRequest[]` (collected from `input.requested`
-   events for the consumed turn). The eval driver and the TUI both stop
-   re-deriving it.
-2. **Stream-registration race**: `fetchStreamWithRetry` /
-   `postWithDeliverRetry` exist because the GET stream and input-response
-   delivery race workflow registration after session-creating POSTs. Fix at
-   the source where possible (the server should not 500 on
-   `startIndex`-cursor GETs for a session it just acknowledged); where a
-   genuine propagation window remains, put one bounded retry policy inside
-   `eve/client` (`openStreamIterable` / `ClientSession.#postTurn`) so every
-   consumer — TUI, evals, users — inherits it. Delete both e2e copies and the
-   `tui-questions.ts` sleep.
-3. **Turn failure surfacing**: export a typed
-   `isTurnFailureEvent(event)` narrowing helper and the
-   `EveEvalTurn.expectOk()` driver method instead of `TurnFailedError`-style
-   throw-by-default (negative-path suites need non-throwing sends).
-4. **Multimodal sugar** (`sendTextWithImage`): becomes
-   `EvalSession.sendFile`; the data-URL/`FilePart` encoding helper is exported
-   from `eve/client` for general use.
-5. `e2e/lib/session-stream.ts` is subsumed by `target.attachSession`.
-## Replacing the e2e surface
-### Where suites live
-Each fixture app keeps owning its coverage: suites move into
-`e2e/fixtures/agent-*/evals/*.eval.ts` and
-`apps/fixtures/weather-fixture/evals/*.eval.ts`. Discovery already scans
-`<appRoot>/evals/`. The area-policy module becomes unnecessary — a suite can
-only target its own app, by construction. Provisioning scripts live next to
-the fixtures (`e2e/provision/`), built from today's `e2e/lib/server.ts` and
-`e2e/target/` logic rather than rewriting it.
-### Coverage mapping
-Scripted cases let one suite absorb a whole e2e group: today's 78 script files
-consolidate into roughly one suite per behavior family (e.g.
-`agent-tools-hitl/evals/hitl.eval.ts` with `approve-then-persist`,
-`deny-regates`, `ask-question`, and `tool-auth` cases), each sharing one
-provisioned target. `--case` becomes the day-to-day tool for re-running a
-single behavior while debugging.
-| e2e group                                                                                         | v2 expression                                                                                                                                                                                                                             |
-| ------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `basic-runtime/*` (basic, multi-turn history, client context, output schema, image, define-state) | `task.run` + `Checks.messageIncludes` / `outputMatches`; `sendFile` for image; `send({ clientContext })`; multi-turn token recall is two `send`s and one check                                                                            |
-| `tools/*` (14 dynamic-tool files, MCP, multi-step loop, narrowing, throw-recover)                 | `Checks.toolCalled(name, { input, output, isError })`, `Checks.toolOrder`, `Checks.event` for ordering edge cases; MCP stub started by the provisioner (`startMcpStub`), addressed via `env:` requirement                                 |
-| `tools-hitl/*` (approval, denial, ask-question, tool auth)                                        | `expectInputRequests` + `respond`/`respondAll`; auth flows keep the IdP emulator as a provisioner sidecar                                                                                                                                 |
-| `tools-sandbox/*`                                                                                 | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot suite tagged `requires-credentials`, excluded in CI via `--tag`                                                                                                                   |
-| `channels/*`                                                                                      | provisioner starts the fake provider (one shared `startHttpProbe`) and exports `*_API_BASE_URL`; case does `target.fetch` webhook ingress, asserts on probe captures via its inspection endpoint, `attachSession` for the created session |
-| `schedules/*`                                                                                     | `target.dispatchSchedule` + `attachSession`; stream-resume test asserts `attachSession({ startIndex })` replay                                                                                                                            |
-| `subagents/*` (incl. remote delegation, callbacks, failures)                                      | provisioner boots both agents and wires `EVE_WEATHER_AGENT_HOST`; `Checks.subagentCalled(name, { remoteUrl })`; callback retry/bypass probes as provisioner sidecars                                                                      |
-| `codemode/*`                                                                                      | Unchanged suite bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server                                                                                                        |
-| `tui-client/*`                                                                                    | **Stays a script harness** (non-goal)                                                                                                                                                                                                     |
-### Worked example: porting `remote-agent-delegation`
-Today's `e2e/tests/subagents/remote-agent-delegation.ts` is 101 lines: manual
-port arithmetic, two `resolveTarget` calls threading `startEnv` by hand, and a
-58-line `assertRemoteDelegation` function of filter/narrow/throw prose. The v2
-suite, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
-```ts
-import { defineEvalSuite } from "eve/evals";
-import { Checks } from "eve/evals/checks";
-const CITY = "Lisbon";
-export default defineEvalSuite({
-  description: "Remote subagent delegation over HTTP to a second local agent.",
-  // Assumptions about the target, verified by the runner. The provisioner
-  // boots both agents with mock models and wires EVE_WEATHER_AGENT_HOST.
-  requires: ["mockModels", "env:EVE_WEATHER_AGENT_HOST"],
-  checks: [Checks.didNotFail()],
-  scores: [],
-  cases: [
-    {
-      id: "weather-result-reaches-parent",
-      async run({ session }) {
-        await session.send(
-          `Use the weather remote agent to get the weather for ${CITY}. ` +
-            "Include its result in the final reply.",
-        );
-      },
-      checks: [
-        Checks.subagentCalled("weather", {
-          remoteUrl: () => process.env.EVE_WEATHER_AGENT_HOST!,
-          output: /Sunny[\s\S]*72F/,
-        }),
-        Checks.messageIncludes(CITY),
-        Checks.messageIncludes("Sunny"),
-        Checks.messageIncludes("72F"),
-      ],
-    },
-  ],
-});
-```
-The two-server topology moves to a ~20-line provisioner
-(`e2e/provision/subagents.ts`, reusing today's `startAgentServer`): boot the
-weather fixture with mocks, boot the parent with mocks +
-`EVE_WEATHER_AGENT_HOST`, then run
-`eve eval --strict --url "$PARENT_URL" remote-delegation`.
-What the diff buys, beyond line count:
-- **Clean separation** — the suite holds only interaction and assertions; the
-  topology lives in one provisioner shared by every subagent case. The suite
-  states its assumptions (`requires`) and the runner enforces them, so running
-  it against an unprovisioned target skips with a named reason instead of
-  failing mysteriously.
-- **Structured failures** — `Checks.subagentCalled` failing reports the
-  observed `subagent.called` events in its result metadata; today that detail
-  exists only because someone hand-built the error string.
-- **Free reporting** — the case lands in `--json`, artifacts, and (if
-  configured) Braintrust like any other eval, and `--case
-weather-result-reaches-parent` reruns it in isolation.
-- **Room to grow** — `remote-agent-callback-retry`, `-bypass`, and
-  `-start-failure` become sibling cases in the same suite, sharing one
-  provisioned topology instead of re-spawning per file.
-### CI
-`.github/workflows/smoke.yml` discovery changes from globbing
-`e2e/tests/*/*.ts` to globbing fixture apps with `evals/` directories. Each
-matrix leg provisions, then runs the suites against the resulting URL:
-```sh
-node e2e/provision/<group>.ts &        # build + sidecars + env + eve start
-pnpm --filter <fixture-app> exec eve eval --strict --json --url "$TARGET_URL"
-```
-twice (direct / `EVE_EXPERIMENTAL_CODE_MODE=1` set by the provisioner),
-`fail-fast: false`, JUnit reporter for annotations. Per-suite artifacts under
-`.eve/evals/` upload on failure — strictly better debuggability than today's
-stdout scraping. A post-deploy leg is the same invocation pointed at a preview
-deployment; requirement-incompatible cases skip visibly.
-### What gets deleted (end state)
-- `e2e/lib/client.ts`, `e2e/lib/session-stream.ts`,
-  `e2e/lib/schedule-dispatch.ts`, the duplicated retry helpers, the per-file
-  fake provider servers, `e2e/lib/area-policy.ts`.
-- `e2e/lib/run.ts`'s assertion-side surface; `e2e/lib/server.ts` and
-  `e2e/target/*` shrink into thin provisioners under `e2e/provision/` (their
-  self-tests come along) instead of being absorbed into the framework.
-- All of `e2e/tests/**` except `tui-client/`.
-## Implementation phases
-Each phase ships independently, keeps `pnpm test` green, and includes docs
-(`docs/public/advanced/evals.md`) + changesets per repo policy.
-### Phase 1 — Assertions and pass/fail (no breaking interaction changes)
-1. Derived facts v2: typed `toolCalls`/`subagentCalls`/`inputRequests`/`parked`
-   (breaking type change; update `Run` scorers and Braintrust reporter).
-2. `checks` field, `EveEvalCheck` types, `eve/evals/checks` built-ins with the
-   matcher mini-language (literal / RegExp / predicate, partial deep match).
-3. Exit-code policy + `--strict`; console/JSON/artifact rendering of checks.
-4. Hygiene: `--tag`/`--case` filtering, `--list`, remove `--all`, make `model`
-   conditionally required, surface `parked` so `Run.didNotFail` stops silently
-   passing parked sessions.
-### Phase 2 — Interaction API
-5. Client groundwork: `MessageResult.inputRequests`, retry policy moved into
-   `eve/client`, server-side fix for the stream-registration race, multimodal
-   helper export.
-6. `EvalSession` driver + `EveEvalTurn`; `task.run` variant and the
-   scripted-case union (`case.run`, optional `input`, per-case
-   `checks`/`scores`); reimplement `prompt`/`messages` as sugar over `run`;
-   multi-session support (`newSession`), accumulated multi-session event
-   capture.
-7. HITL surface: `expectInputRequests`, `respond`, `respondAll`; checks for
-   parked/resumed flows. Port `tools-hitl` smokes as the proving ground.
-### Phase 3 — Targets, requirements, and non-session surfaces
-8. `EveEvalTargetHandle`: `baseUrl`, `fetch`, `info`, `capabilities`,
-   `dispatchSchedule`, `attachSession`; `--mock-models` for the runner-booted
-   dev server; readiness/identity handshake for `--url` targets.
-9. Requirements: suite/case `requires` (`mockModels` / `devRoutes` /
-   `env:<NAME>`), `skipped` verdict + `--no-skips`, runtime guards on
-   requirement-gated handle methods, and `/eve/v1/info` reporting mock-model
-   state so requirements are verified against the live target.
-10. Provisioner helpers in `eve/evals/environment`: `startHttpProbe` (with
-    HTTP inspection endpoint) and `startMcpStub`, generalized from
-    `e2e/lib/`; restructure `e2e/lib/server.ts` + `e2e/target/*` into
-    `e2e/provision/` scripts.
-11. `trials` + JUnit reporter.
-### Phase 4 — Migration and deletion
-12. Port suites group by group in the order: `basic-runtime` → `tools` →
-    `tools-hitl` → `subagents` → `schedules` → `channels` → `codemode` →
-    `tools-sandbox`. Each ported group flips its CI matrix entry from
-    `node e2e/tests/...` to provision + `eve eval --strict --url` in the same
-    PR; both runners coexist until the last group lands.
-13. Delete `e2e/tests/**` (minus `tui-client/`) and `ExampleClient`; retire
-    area policy; shrink `e2e/lib`/`e2e/target` into `e2e/provision/`; update
-    `e2e/README.md` and AGENTS.md smoke-test guidance to point at `eve eval`.
-14. Optional follow-on: add a post-deploy `--url` leg against preview
-    deployments for the requirement-compatible subset of fixture suites.
-## Breaking changes
-Pre-1.0, breaking is preferred over compatibility shims (principle 4). Minor
-changesets for each:
-- `EveEvalDerivedFacts.toolCalls` / `subagentCalls` change from `string[]` to
-  typed records.
-- `EveEvalCase` becomes a data/scripted union; `input` is no longer required
-  on scripted cases. Existing data cases are unaffected.
-- `model` becomes optional; suites passing dummy models can drop them.
-- `--suite` replaced by positional ids; `--all` removed.
-- Exit-code semantics gain check failures (suites without `checks` see no
-  change unless `--strict`).
-## Risks and open questions
-- **Cost/flakiness of model-backed suites in CI.** Mitigation: the
-  `"mockModels"` requirement makes determinism a declared, verified property
-  instead of a per-file accident; `trials` + `--strict` thresholds handle the
-  suites that must use real models. The migration should explicitly decide,
-  per ported smoke, mock vs real — today's split is undocumented.
-- **The stream-registration race** may not be fully fixable server-side in
-  Phase 2; the client-level bounded retry is the documented fallback. Track it
-  as its own issue rather than letting retry constants drift again.
-- **Channel suites with real credentials** (`slack-thread-context`) and
-  sandbox snapshot suites stay tag-gated (`requires-credentials`) and excluded
-  from default CI, as today.
-- **Provisioner drift on `--url` legs.** Beyond what `/info` exposes
-  (identity, mode, mock-model state) and `env:<NAME>` presence checks, the
-  runner trusts the provisioner. A suite can pass locally and skip-or-fail
-  remotely because the target was provisioned differently; requirements make
-  this visible but cannot make it impossible.
-- **Open: should the runner ever own provisioning?** A suite-owned
-  `environment` block (runner builds/starts targets, injects env, manages
-  sidecar lifecycles, boots secondary agents) was considered and deliberately
-  cut from v1: it duplicated CLI concerns and let suites declare build-time
-  properties they cannot own at run time. Revisit only with concrete friction
-  data from the migration — the likely v2 shape, if any, is a _project-level_
-  provisioning config consumed by `eve eval` (like Playwright's `webServer`),
-  not per-suite config.
-- **User-facing naming**: `checks` vs `scores` vs `thresholds` vs `requires`
-  needs a docs pass so the soft/hard/assumption distinction is obvious; the
-  evals doc gets a "smoke-testing your agent" section once Phase 2 lands.