npm - eve - Versions diffs - 0.6.0-beta.9 → 0.7.2 - Mend

eve 0.6.0-beta.9 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (650) hide show

package/dist/docs/public/channels/overview.mdx CHANGED Viewed

@@ -5,7 +5,7 @@ description: "How users reach your agent: the channel contract, the base Eve HTT
 A channel is the edge adapter between a platform and your agent, and its job is deliberately small. It does exactly three things: normalizes platform input into a user message, owns the `continuationToken` (the resume handle for a conversation on that surface), and decides delivery (how, where, and whether a response goes back).
-Eve ships a base HTTP channel plus first-class platform channels (Slack, Discord, Teams, Telegram, Twilio, GitHub, Linear), and you can author your own.
+Eve ships a base HTTP channel plus first-class platform channels, and you can author your own. Browse the full set in the [Integrations](/integrations) gallery.
 ## Where channels live
@@ -24,30 +24,36 @@ Scaffold a channel file with `eve channels add` (interactive), or pass a kind: `
 ## The Eve HTTP channel (default)
-Eve's canonical HTTP session API: the routes the terminal UI, [`useEveAgent`](../frontend/overview), and `curl` all talk to. It is enabled by default, even with no `agent/channels/eve.ts` file. Add that file only to override the defaults, most often the route auth policy. See [Eve channel](./eve) for the routes, auth, and customization.
+Eve's canonical HTTP session API: the routes the terminal UI, [`useEveAgent`](../guides/frontend/overview), and `curl` all talk to. It is enabled by default, even with no `agent/channels/eve.ts` file. Add that file only to override the defaults, most often the route auth policy. See [Eve channel](./eve) for the routes, auth, and customization.
 ## Custom channels
 When Eve doesn't ship a channel for your surface, build one with `defineChannel` from `eve/channels`: route handlers (`GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `WS`), an `events` map, and a `send` call inside a handler to start or resume a session. See [Custom channels](./custom) for the full walkthrough, including WebSocket routes, cross-channel hand-off, channel metadata, continuation tokens, and file uploads.
+## Relationship to the Chat SDK
+Eve uses the Chat SDK's **card-builder components** (Cards, Buttons, Actions, etc.) for composing rich Slack messages. When you build a card with the [Slack channel](./slack), the underlying primitives come from the Chat SDK and get converted to Slack Block Kit at post time.
+Eve does **not** use the Chat SDK's runtime. The `Chat`, `Adapter`, and `Thread` primitives are never imported or reachable through Eve's public API. Eve's channel layer (webhook handling, signature verification, event parsing, and thread management) is implemented inside Eve itself. In short: building Slack messages feels just like Chat SDK cards, but wiring a channel means authoring against Eve's `defineChannel(...)` API, not a Chat SDK adapter.
 ## Which channel?
-| You want…                                   | Use                                                 |
-| ------------------------------------------- | --------------------------------------------------- |
-| A web app / browser chat UI                 | Eve channel + [`useEveAgent`](../frontend/overview) |
-| Local tooling, SDK clients, `curl`          | Eve channel (default)                               |
-| Slack mentions, DMs, buttons                | [Slack](./slack)                                    |
-| Discord slash commands, components          | [Discord](./discord)                                |
-| Microsoft Teams messages + Adaptive Cards   | [Teams](./teams)                                    |
-| Telegram bot messages                       | [Telegram](./telegram)                              |
-| SMS or speech-transcribed phone calls       | [Twilio](./twilio)                                  |
-| GitHub @mentions, PR review with checkout   | [GitHub](./github)                                  |
-| Linear issue delegation and Agent Sessions  | [Linear](./linear)                                  |
-| Anything else (internal webhook, WebSocket) | Custom channel (`defineChannel`, above)             |
+| You want…                                   | Use                                                        |
+| ------------------------------------------- | ---------------------------------------------------------- |
+| A web app / browser chat UI                 | Eve channel + [`useEveAgent`](../guides/frontend/overview) |
+| Local tooling, SDK clients, `curl`          | Eve channel (default)                                      |
+| Slack mentions, DMs, buttons                | [Slack](./slack)                                           |
+| Discord slash commands, components          | [Discord](./discord)                                       |
+| Microsoft Teams messages + Adaptive Cards   | [Teams](./teams)                                           |
+| Telegram bot messages                       | [Telegram](./telegram)                                     |
+| SMS or speech-transcribed phone calls       | [Twilio](./twilio)                                         |
+| GitHub @mentions, PR review with checkout   | [GitHub](./github)                                         |
+| Linear issue delegation and Agent Sessions  | [Linear](./linear)                                         |
+| Anything else (internal webhook, WebSocket) | Custom channel (`defineChannel`, above)                    |
 ## What to read next
 - [Slack](./slack): the most common platform channel, end to end
 - [Custom channels](./custom): build a channel for any surface with `defineChannel`
-- [Frontend](../frontend/overview): browser chat on the Eve channel with `useEveAgent`
+- [Frontend](../guides/frontend/overview): browser chat on the Eve channel with `useEveAgent`
 - [Integrations](/integrations): browse every built-in channel and connection in one gallery

package/dist/docs/public/channels/slack.mdx CHANGED Viewed

@@ -4,14 +4,14 @@ description: "Reach your agent from Slack app mentions and DMs, with thread anch
 type: integration
 ---
-The Slack channel puts your agent inside a workspace: it answers `@mentions` and DMs, replies in threads, shows typing indicators, and turns HITL prompts into buttons. Use it when the conversation should happen where your team already works. Credentials run through [Vercel Connect](../advanced/auth-and-route-protection), which handles both the outbound bot token and inbound webhook verification, so there's no `SLACK_BOT_TOKEN` or `SLACK_SIGNING_SECRET` for you to manage. See [Channels](./overview) for the contract this builds on.
+The Slack channel puts your agent inside a workspace: it answers `@mentions` and DMs, replies in threads, shows typing indicators, and turns HITL prompts into buttons. Use it when the conversation should happen where your team already works. Credentials run through [Vercel Connect](../guides/auth-and-route-protection), which handles both the outbound bot token and inbound webhook verification, so there's no `SLACK_BOT_TOKEN` or `SLACK_SIGNING_SECRET` for you to manage. See [Channels](./overview) for the contract this builds on.
 ## Set up Connect
 Create a Slack Connect client and copy its UID (e.g. `slack/my-agent`), then attach this project as the trigger destination at Eve's Slack route:
 ```bash
-pnpm i -g vercel@latest && export FF_CONNECT_ENABLED=1
+npm install -g vercel@latest && export FF_CONNECT_ENABLED=1
 vercel connect create slack --triggers
 vercel connect detach <uid> --yes
 vercel connect attach <uid> --triggers --trigger-path /eve/v1/slack --yes
@@ -24,7 +24,7 @@ vercel connect attach <uid> --triggers --trigger-path /eve/v1/slack --yes
 Scaffold the channel and its dependency with `eve channels add slack`, or set it up by hand:
 ```bash
-pnpm add @vercel/connect
+npm install @vercel/connect
 ```
 ```ts title="agent/channels/slack.ts"
@@ -68,6 +68,18 @@ async onAppMention(ctx, message) {
 **HITL** renders as Slack buttons/selects; submissions resume the parked session.
+**Authorization prompts are private.** A sign-in challenge (OAuth URL, device code) is a credential. Anyone who completes it binds their identity to the session's connection. The default `authorization.required` handler delivers the challenge ephemerally to the triggering user, device code included, and posts a public link-free status only when it has no user to target. The handler receives a private-delivery context with `postEphemeral`, `postDirectMessage` (needs the `im:write` scope), and `state`. There is, intentionally, no public `post` and no raw API access.
+```ts
+events: {
+  "authorization.required"(event, ctx) {
+    const userId = ctx.state.triggeringUserId;
+    if (!userId || !event.authorization?.url) return;
+    return ctx.postDirectMessage(userId, `Sign in to continue: ${event.authorization.url}`);
+  },
+},
+```
 ```ts
 import { defaultSlackAuth, slackChannel } from "eve/channels/slack";
@@ -89,4 +101,4 @@ Event handlers receive `(eventData, ctx)` with platform handles on `ctx.thread`
 ## What to read next
 - [Channels overview](./overview): the channel contract and every built-in channel
-- [Auth & route protection](../advanced/auth-and-route-protection): authenticating inbound traffic
+- [Auth & route protection](../guides/auth-and-route-protection): authenticating inbound traffic

package/dist/docs/public/channels/teams.mdx CHANGED Viewed

@@ -52,4 +52,4 @@ export default teamsChannel({
 ## What to read next
 - [Channels overview](./overview): the channel contract and every built-in channel
-- [Auth & route protection](../advanced/auth-and-route-protection): authenticating inbound traffic
+- [Auth & route protection](../guides/auth-and-route-protection): authenticating inbound traffic

package/dist/docs/public/channels/telegram.mdx CHANGED Viewed

@@ -53,4 +53,4 @@ export default telegramChannel({
 ## What to read next
 - [Channels overview](./overview): the channel contract and every built-in channel
-- [Auth & route protection](../advanced/auth-and-route-protection): authenticating inbound traffic
+- [Auth & route protection](../guides/auth-and-route-protection): authenticating inbound traffic

package/dist/docs/public/channels/twilio.mdx CHANGED Viewed

@@ -59,4 +59,4 @@ export default twilioChannel({
 ## What to read next
 - [Channels overview](./overview): the channel contract and every built-in channel
-- [Auth & route protection](../advanced/auth-and-route-protection): authenticating inbound traffic
+- [Auth & route protection](../guides/auth-and-route-protection): authenticating inbound traffic

package/dist/docs/public/{advanced → concepts}/context-control.md RENAMED Viewed

@@ -1,5 +1,5 @@
 ---
-title: "Context control"
+title: "Context Control"
 description: "Control what the model sees and when: instructions, skills, the workspace, and subagents."
 ---
@@ -80,7 +80,7 @@ See [Subagents](../subagents).
 ## Dynamic context with `defineDynamic`
-The levers above are static: authored once, the same on every session. When the right context depends on who is calling (their team, tenant, plan, or feature flags), resolve it at runtime instead. `defineDynamic` in `agent/instructions/` returns the per-session system prompt, and `defineDynamic` in `agent/skills/` returns the set of skills a caller can load. Both read `ctx.session.auth` or channel metadata, so a caller on the billing team gets the billing instructions and playbook while nobody else sees them. See [Dynamic capabilities](./dynamic-capabilities) for the resolver API and when each event fires.
+The levers above are static: authored once, the same on every session. When the right context depends on who is calling (their team, tenant, plan, or feature flags), resolve it at runtime instead. `defineDynamic` in `agent/instructions/` returns the per-session system prompt, and `defineDynamic` in `agent/skills/` returns the set of skills a caller can load. Both read `ctx.session.auth` or channel metadata, so a caller on the billing team gets the billing instructions and playbook while nobody else sees them. See [Dynamic capabilities](../guides/dynamic-capabilities) for the resolver API and when each event fires.
 ## Choosing the right lever
@@ -106,4 +106,4 @@ For most agents:
 - [Tools](../tools)
 - [Skills](../skills)
 - [Subagents](../subagents)
-- [Hooks](./hooks)
+- [Hooks](../guides/hooks)

package/dist/docs/public/{advanced → concepts}/default-harness.md RENAMED Viewed

@@ -18,7 +18,7 @@ export default defineAgent({
 });
 ```
-Compaction is also a hook point for tools. When the harness compacts history, it calls each tool's `onCompact(input, ctx)` in registration order, so a tool can re-inject the facts it needs to survive the summary: appending a short message, or patching session state. See [`onCompact`](./dynamic-capabilities) for the hook contract.
+Compaction also preserves the framework's own tool state automatically. When the harness compacts history, it resets read-before-write tracking (so a write afterward re-reads the file whose read evidence was summarized away) and re-injects the active todo list, so the model keeps its task list across the summary. There is no per-tool hook to configure.
 ## Built-in tools
@@ -41,7 +41,7 @@ These ship with every agent, no imports. Discovery never runs them: the harness
 Notes:
-- **`agent`** runs a copy of the current agent on a focused task. It inherits the same tools, connections, and instructions, but starts with fresh conversation history and fresh [state](./state). The child shares the parent's sandbox filesystem, so anything it writes is visible to the parent. See [Subagents](../subagents).
+- **`agent`** runs a copy of the current agent on a focused task. It inherits the same tools, connections, and instructions, but starts with fresh conversation history and fresh [state](../guides/state). The child shares the parent's sandbox filesystem, so anything it writes is visible to the parent. See [Subagents](../subagents).
 - **`load_skill`** only pulls instructions into context. It adds no new execution surface, because behavior still comes from the tools the agent already has.
 - **`connection_search`** is the model-facing `connection__search` tool. A search surfaces a connection's tools by their qualified name (e.g. `connection__linear__list_issues`), and the model can then call them directly. It's registered only when the agent has connections.
 - **`web_search`** has no local executor; the provider runs it. To supply your own implementation, override it with `defineTool()`.
@@ -85,10 +85,10 @@ There's also an experimental `Workflow` tool, shipped but off by default. To tur
 export { ExperimentalWorkflow as default } from "eve/tools";
 ```
-With it on, the model can orchestrate the agent's own subagents from model-authored JavaScript, all as one durable step. See [Dynamic workflows](./dynamic-workflows).
+With it on, the model can orchestrate the agent's own subagents from model-authored JavaScript, all as one durable step. See [Dynamic workflows](../guides/dynamic-workflows).
 ## What to read next
-- [Tools](../tools): define your own tools, gate them on approval, and shape their output (`toModelOutput` / `onCompact`)
-- [Dynamic capabilities](./dynamic-capabilities): generate the tool set per session with `defineDynamic`
+- [Tools](../tools): define your own tools, gate them on approval, and shape their output with `toModelOutput`
+- [Dynamic capabilities](../guides/dynamic-capabilities): generate the tool set per session with `defineDynamic`
 - [Sandbox](../sandbox): the sandbox the shell and file tools run in

package/dist/docs/public/{advanced → concepts}/execution-model-and-durability.md RENAMED Viewed

@@ -21,6 +21,8 @@ Crash the process, hit a timeout, or redeploy mid-turn, and the run picks up fro
 There's nothing to configure here. Eve owns the workflow lifecycle, and sessions are durable by default.
+You don't write workflow code directly. Workflow primitives (`start()`, `resumeHook()`, etc.) are an implementation detail of Eve's runtime layer; channels, tools, and hooks never touch them. When you do need session data from your own code, there are two supported surfaces: tools read the current session's metadata (id, turn, auth, parent lineage) via `ctx.session`, and [`defineState`](../guides/session-context) reads or writes session-scoped durable state. See [State](../guides/state) for the read/write model.
 ## Parked work
 Some work has to wait: a human approving a [tool](../tools), an interactive OAuth sign-in for a [connection](../connections), or a long-running [subagent](../subagents). At those points the turn parks durably. The workflow suspends and holds no compute until the input it's waiting on shows up (a click, a callback, a child completing), even if that's much later. When it does, the conversation picks up exactly where it left off.
@@ -45,4 +47,4 @@ Conversation history within a session is append-only. Turns land in order, and t
 - [Sessions, runs & streaming](./sessions-runs-and-streaming): the handles you hold and the event stream you watch.
 - [Security model](./security-model): the trust boundaries the runtime enforces.
-- [State](../advanced/state): durable per-session memory that persists across step boundaries.
+- [State](../guides/state): durable per-session memory that persists across step boundaries.

package/dist/docs/public/concepts/meta.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "title": "Concepts",
+  "pages": [
+    "execution-model-and-durability",
+    "sessions-runs-and-streaming",
+    "default-harness",
+    "context-control",
+    "security-model"
+  ]
+}

package/dist/docs/public/{advanced → concepts}/security-model.md RENAMED Viewed

@@ -42,7 +42,7 @@ A [channel](../channels/overview) is your agent's front door, which makes authen
   claims. A body field is attacker-controlled; treating it as identity is
   cross-user impersonation.
-The `support-fixture` dashboard channel is a concrete custom-channel fixture for these rules: it authenticates the raw body with an HMAC, compares signatures in constant time, and trusts the body-supplied principal only after the signature verifies.
+A custom channel that accepts dashboard-style webhooks should follow the same shape: authenticate the raw body with an HMAC, compare signatures in constant time, and trust any body-supplied principal only after the signature verifies.
 ## Authored markdown is data
@@ -50,7 +50,7 @@ The `support-fixture` dashboard channel is a concrete custom-channel fixture for
 ## Auth fails closed
-Routes reject unauthenticated traffic by default: if no `AuthFn` in the walk accepts the request, it gets a `401`, and admitting anonymous callers takes an explicit `none()`. The scaffold's `placeholderAuth()` keeps a half-configured app closed in production until you replace it. See [Auth & route protection](../advanced/auth-and-route-protection) for the full walk and verifiers.
+Routes reject unauthenticated traffic by default: if no `AuthFn` in the walk accepts the request, it gets a `401`, and admitting anonymous callers takes an explicit `none()`. The scaffold's `placeholderAuth()` keeps a half-configured app closed in production until you replace it. See [Auth & route protection](../guides/auth-and-route-protection) for the full walk and verifiers.
 ## Pre-production checklist
@@ -73,7 +73,7 @@ Before exposing an agent to real traffic:
 ## What to read next
-- [Auth & route protection](./auth-and-route-protection): the full auth walk and verifier helpers
+- [Auth & route protection](../guides/auth-and-route-protection): the full auth walk and verifier helpers
 - [Sandbox](../sandbox): backends, network policy, and brokering config
 - [Execution model & durability](./execution-model-and-durability): how durable sessions run
 - [Connections](../connections): static-token and OAuth connections

package/dist/docs/public/{advanced → concepts}/sessions-runs-and-streaming.md RENAMED Viewed

@@ -14,7 +14,7 @@ Two handles do two jobs here, and mixing them up is the most common mistake. One
 A session has one active continuation at a time: each follow-up uses the current `continuationToken`, and a stale one is rejected.
-React, Vue, and Svelte apps reach for [`useEveAgent()`](../frontend/overview) instead of calling these routes by hand. Next.js and Nuxt apps can proxy them to the Eve runtime from the same origin.
+React, Vue, and Svelte apps reach for [`useEveAgent()`](../guides/frontend/overview) instead of calling these routes by hand. Next.js and Nuxt apps can proxy them to the Eve runtime from the same origin.
 ## Start a session
@@ -96,7 +96,7 @@ curl "http://127.0.0.1:3000/eve/v1/session/<sessionId>/stream?startIndex=<count>
 For scripts, server-to-server calls, tests, evals, and custom UIs, `eve/client` wraps these routes in a typed client so you don't hand-roll the POST and NDJSON stream loop.
-Start with the [TypeScript Client](../client/overview) guide. It covers basic usage, sending messages, continuations, streaming, and per-turn `outputSchema` results.
+Start with the [TypeScript SDK](../guides/client/overview) guide. It covers basic usage, sending messages, continuations, streaming, and per-turn `outputSchema` results.
 ## Inspect the agent over HTTP
@@ -106,7 +106,7 @@ Start with the [TypeScript Client](../client/overview) guide. It covers basic us
 curl http://127.0.0.1:3000/eve/v1/info
 ```
-The route uses the same default auth chain as the eve channel (`[localDev(), vercelOidc()]`). Locally it answers anonymously; a deployed Vercel target requires a valid OIDC bearer, with a same-project bypass for in-deployment callers. See [auth & route protection](../advanced/auth-and-route-protection).
+The route uses the same default auth chain as the eve channel (`[localDev(), vercelOidc()]`). Locally it answers anonymously; a deployed Vercel target requires a valid OIDC bearer, with a same-project bypass for in-deployment callers. See [auth & route protection](../guides/auth-and-route-protection).
 ## Dispatch order
@@ -114,8 +114,8 @@ Every stream event runs four steps, in this order:
 1. **Channel handler**: the channel's event handler runs and can mutate adapter state.
 2. **Metadata projection**: the framework re-evaluates the channel's `metadata(state)` and stores the result.
-3. **Hooks**: authored [hooks](../advanced/hooks) subscribed to the event fire.
-4. **Dynamic resolvers**: [dynamic](../advanced/dynamic-capabilities) tool, skill, and instruction resolvers fire, and `ctx.channel.metadata` already holds the freshly projected metadata from step 2.
+3. **Hooks**: authored [hooks](../guides/hooks) subscribed to the event fire.
+4. **Dynamic resolvers**: [dynamic](../guides/dynamic-capabilities) tool, skill, and instruction resolvers fire, and `ctx.channel.metadata` already holds the freshly projected metadata from step 2.
 The order isn't incidental, it's structural. By the time a resolver or hook reads channel metadata, the channel has already updated its state and the projection is current.
@@ -123,5 +123,5 @@ The order isn't incidental, it's structural. By the time a resolver or hook read
 - [Execution model & durability](./execution-model-and-durability): what makes a session durable and how parked work resumes.
 - [Channels](../channels/overview): what owns the continuation token and delivery.
-- [TypeScript Client](../client/overview): call these routes from scripts and server-side code.
-- [Frontend](../frontend/overview): `useEveAgent` instead of raw routes.
+- [TypeScript SDK](../guides/client/overview): call these routes from scripts and server-side code.
+- [Frontend](../guides/frontend/overview): `useEveAgent` instead of raw routes.

package/dist/docs/public/connections.mdx CHANGED Viewed

@@ -124,7 +124,7 @@ export default defineMcpClientConnection({
 });
 ```
-`"linear"` is the UID you chose when registering the Connect client. Connect-managed OAuth is user-scoped by default, so the runtime resolves the per-user token before each tool call. The full setup (Connect client provisioning, project linking, the runtime consent flow) lives in [Auth & route protection](./advanced/auth-and-route-protection).
+`"linear"` is the UID you chose when registering the Connect client. Connect-managed OAuth is user-scoped by default, so the runtime resolves the per-user token before each tool call. The full setup (Connect client provisioning, project linking, the runtime consent flow) lives in [Auth & route protection](./guides/auth-and-route-protection).
 ## Self-hosted interactive OAuth
@@ -168,7 +168,9 @@ export default defineMcpClientConnection({
 });
 ```
-`getToken` runs before every tool call. `startAuthorization` and `completeAuthorization` are both-or-neither: provide one without the other and you get a definition error. The `challenge` rides along verbatim on the `authorization.required` event. Set `url` for redirect/device flows, `userCode` for a device code, and `instructions` as the call to action when there's no URL. Drop `resume` when the provider keeps flow state server-side, so nothing has to cross the step boundary.
+`getToken` runs before every tool call. `startAuthorization` and `completeAuthorization` are both-or-neither: provide one without the other and you get a definition error. The `challenge` rides along verbatim on the `authorization.required` event. Set `url` for redirect/device flows, `userCode` for a device code, `instructions` as the call to action when there's no URL, and `displayName` for the human-readable provider name channels show on the sign-in affordance (e.g. "Salesforce"). Drop `resume` when the provider keeps flow state server-side, so nothing has to cross the step boundary.
+`displayName` is presentation-only — the connection's path-derived name still keys the authorization scope, token cache, and callback URL. You can also set `displayName` on the `auth` definition itself (e.g. `auth: { ...connect("sfdc"), displayName: "Salesforce" }`); that definition-level value wins over one the strategy stamps on the challenge, and channels fall back to title-casing the connection name when neither is set.
 ### Signaling authorization state
@@ -213,5 +215,5 @@ A tool can require both sign-in (`auth`) and a human approval. Today the model's
 - [Integrations](/integrations): browse every channel and connection Eve ships, in one gallery.
 - [Tools](./tools): authored tools live alongside connection-provided tools; the same approval helpers apply.
-- [Auth & route protection](./advanced/auth-and-route-protection): the full interactive-OAuth flow with Vercel Connect.
-- [Security model](./advanced/security-model): how connection credentials stay out of the model's reach.
+- [Auth & route protection](./guides/auth-and-route-protection): the full interactive-OAuth flow with Vercel Connect.
+- [Security model](./concepts/security-model): how connection credentials stay out of the model's reach.

package/dist/docs/public/evals/assertions.mdx ADDED Viewed

@@ -0,0 +1,108 @@
+---
+title: "Assertions"
+description: "Run-level methods, t.check value assertions, the matcher mini-language, and gate vs soft severity."
+---
+Assertions are how an eval grades what its `test(t)` function produced. Each one **records** a result onto `t` and returns a chainable handle — the runner reads the recorded results to compute the verdict, so a single run reports every failing assertion rather than dying on the first. There are two deterministic surfaces: run-level methods on `t`, and `t.check` for grading a specific value. For model-graded assertions, see [Judge](./judge).
+## Run-level assertions
+Run-level assertions read the whole run, so they take no value. They are methods on `t` and gate by default.
+| Assertion                                           | Asserts                                                                           |
+| --------------------------------------------------- | --------------------------------------------------------------------------------- |
+| `t.completed()`                                     | The run did not fail and did not park on unanswered HITL input                    |
+| `t.didNotFail()`                                    | No terminal failure and no `turn.failed`/`step.failed` events (parked runs pass)  |
+| `t.waiting()`                                       | The run parked on HITL input (for approval-shaped evals)                          |
+| `t.messageIncludes(token)`                          | Joined assistant text contains `token` (string or RegExp)                         |
+| `t.outputEquals(value)` / `t.outputMatches(schema)` | Deep equality / Standard Schema (e.g. Zod) validation of the parsed output        |
+| `t.calledTool(name, opts?)`                         | A matching tool call happened (`input`, `output`, `isError`, `times` constraints) |
+| `t.notCalledTool(name)`                             | No call to `name`                                                                 |
+| `t.toolOrder([...names])`                           | Tool names appear in order (other calls may interleave)                           |
+| `t.usedNoTools()`                                   | No tool calls at all                                                              |
+| `t.maxToolCalls(n)`                                 | At most `n` tool calls                                                            |
+| `t.noFailedActions()`                               | No tool, subagent, or skill action reported a failure                             |
+| `t.calledSubagent(name, opts?)`                     | A subagent delegation happened (`remoteUrl`, `output` constraints)                |
+| `t.event(predicate, label)`                         | Escape hatch: any predicate over the typed event stream                           |
+`t.completed()` subsumes `t.didNotFail()` — reach for `completed` unless you specifically want to allow a parked run.
+```ts
+await t.send("What is the weather in Brooklyn?");
+t.completed();
+t.calledTool("get_weather");
+t.usedNoTools(); // mutually exclusive with the line above — pick the one you mean
+```
+## Value assertions with `t.check`
+`t.check(value, assertion)` grades an explicit value against a builder from `eve/evals/expect`. The value can be `t.reply`, a turn's `.message`, parsed JSON, or any local you computed:
+```ts
+import { includes, equals, matches, similarity } from "eve/evals/expect";
+t.check(t.reply, includes("sunny")); // substring (gate)
+t.check(parsed, equals({ city: "Brooklyn" })); // deep structural equality (gate)
+t.check(parsed, matches(WeatherSchema)); // Standard Schema, e.g. Zod (gate)
+t.check(t.reply, similarity("Sunny, 72F")); // fuzzy 0–1 Levenshtein (soft)
+```
+| Builder                | Scores                                           | Default |
+| ---------------------- | ------------------------------------------------ | ------- |
+| `includes(substring)`  | value (coerced to string) contains `substring`   | gate    |
+| `equals(value)`        | deep structural equality                         | gate    |
+| `matches(schema)`      | validates against a Standard Schema              | gate    |
+| `similarity(expected)` | normalized Levenshtein similarity, 1 = identical | soft    |
+Pick the cheapest builder that captures what "correct" means. When exact match is too strict but a judge model is overkill, `similarity` is the middle ground; for nuanced grading, reach for the [judge](./judge).
+## The matcher mini-language
+`t.calledTool` and `t.calledSubagent` take a matcher object — `{ input, output, isError, times }` for tools, `{ remoteUrl, output }` for subagents. Each field accepts a literal (objects partial-deep-match), a RegExp, or a function. A matcher function receives the value and returns either a boolean (acts as a predicate) or an expected value to compare against (handy for runner-assigned values like environment-provided URLs):
+```ts
+t.calledTool("bash", { input: { command: /^pwd/ }, isError: false, times: 1 });
+t.calledTool("echo", { output: (value) => String(value).includes(marker) });
+t.calledSubagent("weather", {
+  remoteUrl: () => process.env.WEATHER_AGENT_URL!,
+  output: /72F/,
+});
+```
+## Run state and derived facts
+A turn that leaves the session open for a next message is the normal end state of a successful turn. Parking on unanswered HITL input is tracked separately — that is what `t.completed()` and `t.waiting()` key off.
+Beyond the raw `t.events` stream, the runner derives typed facts the assertions read: tool calls (name, input, output, error state), subagent calls, and HITL input requests. The built-in assertions cover almost everything; when you need to read the stream directly, `t.event(predicate, label)` is the escape hatch:
+```ts
+t.event(
+  (events) =>
+    events.some((e) => e.type === "message.completed" && e.data.message?.includes(marker)),
+  "assistant reply includes the marker",
+);
+```
+## Severity
+Every assertion returns a chainable handle. Severity rides on the assertion — there is no separate thresholds map to keep in sync.
+- `.gate(threshold?)` — hard. A miss marks the eval `failed` and `eve eval` exits non-zero.
+- `.soft(threshold?)` — tracked data. A below-threshold miss marks the eval `scored`, fatal only under `--strict`. With no threshold, it is tracked-only and never fails.
+- `.atLeast(threshold)` — soft with a bar (equivalent to `.soft(threshold)`).
+The defaults are chosen so you rarely set severity. Run-level methods and `includes`/`equals`/`matches` are gates; `similarity` and every `t.judge.*` assertion are soft. Annotate only when you deviate:
+```ts
+t.calledTool("get_weather").soft(); // record the tool call as a metric, don't gate
+t.check(t.reply, similarity("Sunny")).atLeast(0.8); // gate the fuzzy match under --strict
+t.check(t.reply, includes("error")).soft(); // track without failing the build
+```
+## What to read next
+- [Judge](./judge): LLM-graded assertions with thresholds
+- [Cases](./cases): where assertions attach
+- [Running evals](./running): how verdicts map to exit codes

package/dist/docs/public/evals/cases.mdx ADDED Viewed

@@ -0,0 +1,143 @@
+---
+title: "Cases"
+description: "Author single-turn and multi-turn evals with test(t), and fan one file out over a dataset."
+---
+Each eval file is one graded case. The runner executes its `test(t)` function against the target, captures every event, and computes a verdict from the [assertions](./assertions) you recorded. Every eval — single-turn, multi-turn, HITL, or dataset-driven — is the same shape: one `async test(t)` function that drives the agent and asserts inline.
+## Single-turn evals
+The common case sends one turn and asserts on the reply. `t.send(input)` resolves once the turn settles; `t.reply` is the last assistant message:
+```ts title="evals/weather/brooklyn-forecast.eval.ts"
+import { defineEval } from "eve/evals";
+import { includes } from "eve/evals/expect";
+export default defineEval({
+  async test(t) {
+    await t.send("What is the weather in Brooklyn?");
+    t.completed();
+    t.check(t.reply, includes("Sunny"));
+  },
+});
+```
+Some evals only care about behavior, not text — assert on the run and skip the content check entirely:
+```ts title="evals/weather/no-tools-for-greetings.eval.ts"
+import { defineEval } from "eve/evals";
+export default defineEval({
+  async test(t) {
+    await t.send("Hello!");
+    t.completed();
+    t.notCalledTool("get_weather");
+  },
+});
+```
+## Organizing with directories
+Identity is the file path, so directories are the grouping mechanism. `evals/weather/brooklyn-forecast.eval.ts` gets the id `weather/brooklyn-forecast`, and `eve eval weather` runs everything under `evals/weather/`. Shared constants and helpers live in sibling non-eval files (any name that doesn't end in `.eval.ts`):
+```text
+evals/
+├── weather/
+│   ├── shared.ts                    # helpers — not an eval
+│   ├── brooklyn-forecast.eval.ts
+│   └── no-tools-for-greetings.eval.ts
+└── smoke.eval.ts
+```
+## Multi-turn evals
+Drive several turns in sequence — branching, HITL approvals, structured output, attachments, multiple sessions. Because assertions live in the function, an intermediate value is just a local variable: judge a draft before the next turn overwrites it, then keep going.
+```ts title="evals/draft-then-send.eval.ts"
+import { defineEval } from "eve/evals";
+import { includes } from "eve/evals/expect";
+export default defineEval({
+  async test(t) {
+    const draft = await t.send("Draft the follow-up email.");
+    t.check(draft.message, includes("Best regards"));
+    t.judge.autoevals.closedQA("professional tone", { on: draft.message }).atLeast(0.6);
+    await t.send("Now send it.");
+    t.calledTool("send_email");
+  },
+});
+```
+Bespoke preconditions that no built-in assertion expresses are plain `throw`s — a thrown error marks the eval `failed` with the message in the result:
+```ts title="evals/session-continuity.eval.ts"
+import { defineEval } from "eve/evals";
+import { includes } from "eve/evals/expect";
+export default defineEval({
+  requires: ["mockModels"],
+  async test(t) {
+    await t.send("My favorite word is marigold.");
+    const firstSessionId = t.sessionId;
+    const second = await t.send("Thanks for remembering.");
+    second.expectOk();
+    if (t.sessionId !== firstSessionId) {
+      throw new Error(`Expected one session; got ${firstSessionId} then ${t.sessionId}.`);
+    }
+    t.completed();
+    t.check(second.message, includes("Thanks for remembering."));
+  },
+});
+```
+## The drive API
+`t` drives the primary session; `t.newSession()` returns an independent `EveEvalSession` against the same target, whose events feed the same run-level assertions.
+- `t.send(input)` sends a turn and waits for it to settle. It accepts the same input as `ClientSession.send()` — a string or a structured message — and resolves to a turn carrying `.message` and `.expectOk()`.
+- `t.sendFile(text, path, mediaType?)` attaches a local file as a data URL.
+- `t.expectInputRequests(filter?)` asserts the previous turn parked on HITL input and returns the pending requests.
+- `t.respond(...responses)` answers specific pending input requests and sends them as the next turn.
+- `t.respondAll(optionId)` answers every pending input request with the same option and sends the responses as the next turn.
+- `t.reply` is the last assistant message (or `null`); `t.sessionId` is the current session id; `t.events` is the full typed event stream captured so far.
+Each `send` (and `respond`/`respondAll`) resolves to a turn whose `expectOk()` throws only when the turn ended failed — a session left open for a next message is the normal end state of a successful turn.
+Events from every session are captured in the result and artifacts. `t.log(message)` records debug lines into the eval artifact; `--verbose` also streams them to stdout as evals run. `t.signal` is an `AbortSignal` that fires on timeout.
+For driving sessions created outside the eval — by a channel webhook or a schedule — see [Targets and requirements](./targets).
+## Datasets: exporting an array
+To fan one file out over a dataset, default-export an array of `defineEval(...)` values. Eval modules are ESM, so top-level `await` can load anything. Ids derive from the file name plus a zero-padded index (`sql/0000`, `sql/0001`, …, in array order). The loaders (`loadJson`, `loadYaml` from `eve/evals/loaders`) parse fixture files relative to the app root:
+```ts title="evals/sql.eval.ts"
+import { defineEval } from "eve/evals";
+import { loadYaml } from "eve/evals/loaders";
+import { equals } from "eve/evals/expect";
+const doc = await loadYaml("evals/data/cases.yaml");
+const rows = doc.evals as readonly { task: string; prompt: string; sql: string }[];
+export default rows.map((row) =>
+  defineEval({
+    description: row.task,
+    async test(t) {
+      await t.send(row.prompt);
+      t.completed();
+      t.check(t.reply, equals(row.sql));
+    },
+  }),
+);
+```
+The loaders are meant for fixtures, not runtime agent code.
+## What to read next
+- [Assertions](./assertions): assert on what the eval did
+- [Judge](./judge): grade quality with an LLM judge
+- [TypeScript client](../guides/client/messages): the send/turn protocol eval sessions build on