npm - eve - Versions diffs - 0.6.0-beta.9 → 0.7.2 - Mend

eve 0.6.0-beta.9 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (650) hide show

package/dist/docs/public/evals/judge.mdx ADDED Viewed

@@ -0,0 +1,94 @@
+---
+title: "Judge"
+description: "Grade evals with an LLM judge via t.judge.autoevals, set thresholds on the assertion, and configure the judge model."
+---
+When no deterministic [assertion](./assertions) captures what "good" means — factual correctness, summary quality, free-form criteria — grade the run with an LLM judge. The `t.judge.*` assertions are the only model-backed ones, and they use a judge model that is resolved separately from the agent under test: Eve only uses it for scoring, never to swap out the agent.
+```ts
+import { defineEval } from "eve/evals";
+export default defineEval({
+  async test(t) {
+    await t.send("Explain quantum tunneling to a 10-year-old.");
+    t.completed();
+    t.judge.autoevals.closedQA("uses no math beyond arithmetic").atLeast(0.8);
+  },
+});
+```
+## The graders
+The judges live under `t.judge.autoevals` — the namespace names the [Braintrust autoevals](https://github.com/braintrustdata/autoevals) grader family, so the factuality and closedQA semantics are autoevals', not Eve-invented. Each grades `t.reply` by default and is soft by default (tracked, no gate):
+| Grader                                   | Grades                                                                                 |
+| ---------------------------------------- | -------------------------------------------------------------------------------------- |
+| `t.judge.autoevals.factuality(expected)` | Factual consistency of the reply against an expected answer (A–E buckets)              |
+| `t.judge.autoevals.summarizes(expected)` | How well the reply summarizes the expected text                                        |
+| `t.judge.autoevals.closedQA(criteria)`   | Whether the reply satisfies a free-form yes/no criterion (no expected answer to match) |
+| `t.judge.autoevals.sql(expected)`        | Semantic equivalence of two SQL statements                                             |
+The reference or criteria is the positional argument. An options object follows:
+- `on` — the value to grade, defaulting to `t.reply`. Pass an intermediate draft or parsed value to grade it instead.
+- `model` / `modelOptions` — a per-call judge override (see below).
+```ts
+const draft = await t.send("Draft the welcome email.");
+t.judge.autoevals.closedQA("professional tone", { on: draft.message }).atLeast(0.6);
+```
+## Soft scoring and thresholds
+Judge assertions are soft, so the threshold rides on the assertion handle — there is no separate thresholds map:
+- **No threshold** — tracked-only. The score lands in reports and artifacts and never fails the eval. Use it to watch a metric without gating on it.
+- `.atLeast(threshold)` — a soft bar. A below-threshold score marks the eval `scored`, fatal only under `eve eval --strict`.
+- `.gate(threshold)` — promote a judge to a hard gate that fails the eval outright.
+```ts
+t.judge.autoevals.closedQA("cites a source"); // tracked, never fails
+t.judge.autoevals.closedQA("cites a source").atLeast(0.6); // soft, fails under --strict below 0.6
+t.judge.autoevals.factuality(reference).gate(0.8); // hard gate at 0.8
+```
+A judge runs once per assertion and burns tokens, so reach for one only when nothing deterministic will do. Several slow judge calls in one eval can fan out with `await Promise.all([...])`.
+## Configuring the judge model
+The judge model is resolved once when the runner builds `t`. It is **never** the model under test. Three levels resolve innermost-wins:
+1. **Per-call** — `t.judge.autoevals.closedQA("…", { model, modelOptions })`.
+2. **Per-eval** — `defineEval({ judge: { model, modelOptions }, test })`.
+3. **Project default** — `defineEvalConfig({ judge: { model, modelOptions } })` in `evals.config.ts`.
+```ts title="evals/evals.config.ts"
+import { defineEvalConfig } from "eve/evals";
+export default defineEvalConfig({
+  judge: { model: "openai/gpt-5.4-mini" }, // the default judge for every eval in this tree
+});
+```
+```ts title="evals/quantum.eval.ts"
+import { defineEval } from "eve/evals";
+export default defineEval({
+  judge: { model: "anthropic/claude-opus-4.8" }, // a stronger judge for this eval
+  async test(t) {
+    await t.send("Explain quantum tunneling to a 10-year-old.");
+    t.judge.autoevals.factuality(reference).atLeast(0.7);
+    t.judge.autoevals.closedQA("is concise", { model: "anthropic/claude-haiku-4-5" }); // cheaper, per-call
+  },
+});
+```
+`judge` in `evals.config.ts` is optional — a tree of fully deterministic evals can omit it. But calling `t.judge.*` with no judge model resolved is a fail-fast error at eval definition time.
+A **string model id** (e.g. `"anthropic/claude-opus-4.8"`) routes through the Vercel AI Gateway and needs `AI_GATEWAY_API_KEY` or `VERCEL_OIDC_TOKEN` in the environment; an **AI SDK `LanguageModel` instance** is used directly. With a model configured but no credentials, a judge-backed eval **skips visibly** like other real-model legs, so mock-model fixture runs stay green. For provider-specific judge settings, use `modelOptions.providerOptions`.
+## What to read next
+- [Assertions](./assertions): deterministic run-level and value assertions
+- [Reporters](./reporters): ship judged scores to Braintrust experiments
+- [Targets and requirements](./targets): gating judge-backed evals on credentials

package/dist/docs/public/evals/meta.json ADDED Viewed

@@ -0,0 +1,4 @@
+{
+  "title": "Evals",
+  "pages": ["overview", "cases", "assertions", "judge", "targets", "reporters", "running"]
+}

package/dist/docs/public/evals/overview.mdx ADDED Viewed

@@ -0,0 +1,118 @@
+---
+title: "Overview"
+description: "Define repeatable scored checks for an Eve agent with defineEval and run them with eve eval."
+---
+An eval is a scored check that runs your agent against real sessions and grades the result. Use it to catch regressions when you change a prompt or a tool: drive the agent through one or more turns, then assert on what it did — the run completed, the right tool ran, the reply contains the right text — and optionally ship the results to Braintrust.
+Evals exercise the same HTTP surface your users hit. The runner boots (or targets) a real agent server, drives sessions through the [TypeScript client](../guides/client/overview) protocol, and grades what comes back — so a passing eval means the agent actually booted, accepted a request, and produced the result you asserted.
+## `defineEval`
+Eve discovers evals under the app-root `evals/` directory, in `.eval.ts` files. Each file is exactly one eval — one graded case. The file path is the eval's identity, so you don't author an `id` or `name`; directories group related evals (`evals/weather/brooklyn-forecast.eval.ts` → id `weather/brooklyn-forecast`).
+```text
+my-agent/
+├── agent/
+├── evals/
+│   ├── evals.config.ts
+│   ├── smoke.eval.ts
+│   └── weather/
+│       ├── brooklyn-forecast.eval.ts
+│       └── no-tools-for-greetings.eval.ts
+└── package.json
+```
+An eval is a single `async test(t)` function. You drive the agent with `t` and assert on the run with the same `t`:
+```ts title="evals/weather/brooklyn-forecast.eval.ts"
+import { defineEval } from "eve/evals";
+import { includes } from "eve/evals/expect";
+export default defineEval({
+  description: "Basic message and tool-usage coverage for the weather agent.",
+  async test(t) {
+    await t.send("What is the weather in Brooklyn?");
+    t.completed();
+    t.calledTool("get_weather");
+    t.check(t.reply, includes("Sunny"));
+  },
+});
+```
+`test` is the only required field. The rest are optional: `description`, `requires`, `judge`, `tags`, `metadata`, `timeoutMs`, `reporters`. The init template adds `evals/**/*.ts` to `tsconfig.json`, so your eval code type-checks alongside the app.
+## `evals.config.ts`
+Every `evals/` directory needs exactly one `evals.config.ts` at its root. It declares the defaults every eval shares:
+```ts title="evals/evals.config.ts"
+import { defineEvalConfig } from "eve/evals";
+import { Braintrust } from "eve/evals/reporters";
+export default defineEvalConfig({
+  judge: { model: "openai/gpt-5.4-mini" },
+  reporters: [Braintrust({ projectName: "my-agent" })],
+});
+```
+Everything is optional. `judge` sets the default model for [LLM-as-judge](./judge) assertions (`t.judge.*`) — only evals that use them need it, so a tree of fully deterministic evals can omit it entirely. `reporters`, `maxConcurrency`, and `timeoutMs` round out the defaults. Config `reporters` observe every eval in the run — set one `Braintrust()` here instead of adding it to each eval. CLI flags (`--max-concurrency`, `--timeout`) and per-eval values take precedence over the config defaults.
+## The `t` context
+`t` is both the driver and the assertion surface. There are no separate `input`, `run`, `checks`, or `scores` fields — you write ordinary control flow, sending turns and asserting inline.
+- **Drive** the agent: `t.send(...)`, `t.respond(...)`, `t.respondAll(...)`, `t.sendFile(...)`, `t.expectInputRequests(...)`, `t.newSession()`. Read what came back with `t.reply` (the last assistant message), `t.sessionId`, and `t.events`. See [Cases](./cases).
+- **Assert** with three surfaces, covered next.
+## Three assertion surfaces
+Each surface matches a genuinely different kind of judgment:
+- **Run-level methods** read the whole run — `t.completed()`, `t.calledTool("get_weather")`, `t.usedNoTools()`, `t.toolOrder([...])`. They take no value because they observe the run itself. See [Assertions](./assertions).
+- **`t.check(value, assertion)`** grades an explicit value with a deterministic builder from `eve/evals/expect` — `t.check(t.reply, includes("sunny"))`. Grade `t.reply`, an intermediate draft, parsed JSON — anything. See [Assertions](./assertions).
+- **`t.judge.autoevals.*`** is the LLM-as-judge surface — `t.judge.autoevals.closedQA("cites a source")`. It grades `t.reply` by default and uses the configured judge model, never the agent under test. See [Judge](./judge).
+## Gate vs soft
+Every assertion returns a chainable handle, so severity rides on the assertion itself — there is no separate thresholds map:
+- **Gates** are hard. A failed gate marks the eval `failed` and `eve eval` exits non-zero. Run-level methods, `includes`, `equals`, and `matches` are gates by default.
+- **Soft** assertions are tracked data. They land in reports and artifacts, and a below-threshold soft assertion marks the eval `scored` — visible but not fatal, unless you pass `--strict`. `similarity` and every `t.judge.*` assertion are soft by default. A soft assertion with no threshold is tracked-only and never fails.
+Override per assertion: `.gate(threshold?)` promotes to a hard gate, `.soft(threshold?)` demotes to tracked, and `.atLeast(threshold)` is a soft assertion with a bar.
+```ts
+t.completed(); // gate
+t.calledTool("get_weather").soft(); // record as a metric, don't gate
+t.judge.autoevals.closedQA("cites a source"); // soft, tracked (no threshold)
+t.judge.autoevals.factuality(reference).atLeast(0.7); // soft, gated under --strict at 0.7
+```
+## Run it
+```bash
+eve eval                       # run all discovered evals against a local dev server
+eve eval weather               # run one eval, or every eval under evals/weather/
+eve eval --url https://<app>   # target an existing server or deployment
+```
+Exit code `0` means every eval passed its gates. See [Running evals](./running) for the full flag list, exit codes, and CI guidance.
+## A good baseline
+Most apps do fine with a few small smoke evals. Assert behavior with `t.completed()` plus one or two content checks, keep dataset fixtures in `evals/data/`, and only reach for a judge or Braintrust once you actually need fuzzy grading or shared result review. In CI, run `eve eval --strict` so soft threshold misses fail the build too.
+The rest of this section covers each piece:
+- [Cases](./cases): single-turn evals, scripted multi-turn evals, and dataset fan-out
+- [Assertions](./assertions): run-level methods and `t.check` value assertions, with matchers and severity
+- [Judge](./judge): LLM-as-judge grading and the judge model
+- [Targets and requirements](./targets): local vs remote targets, and gating evals on capabilities
+- [Reporters](./reporters): Braintrust experiments and JUnit XML
+- [Running evals](./running): the `eve eval` CLI, exit codes, and artifacts
+## What to read next
+- [Cases](./cases): author your first evals
+- [Tools](../tools): the surface most evals assert on

package/dist/docs/public/evals/reporters.mdx ADDED Viewed

@@ -0,0 +1,62 @@
+---
+title: "Reporters"
+description: "Ship eval results to Braintrust experiments or JUnit XML — Eve runs and scores everything itself."
+---
+Eve runs and grades everything itself; reporters just ship the results out. The CLI prints a console summary by default — one line per eval, failed assertions with their messages — and reporters from `eve/evals/reporters` add destinations on top.
+Reporters attach in two places. Declare them in `evals.config.ts` to observe **every** eval in the run — the usual choice for a shared destination like one Braintrust experiment, so you don't repeat the reporter in each file. Or list them on an individual eval's `reporters` to scope a destination to that eval (or to a group of evals that share one instance).
+## Braintrust
+`Braintrust(...)` uploads eval results to Braintrust experiments. Put one instance in the config so it covers the whole run:
+```ts title="evals/evals.config.ts"
+import { defineEvalConfig } from "eve/evals";
+import { Braintrust } from "eve/evals/reporters";
+export default defineEvalConfig({
+  judge: { model: "openai/gpt-5.4-mini" },
+  reporters: [Braintrust({ projectName: "weather-agent" })],
+});
+```
+Need a destination for only some evals? Attach it per eval instead:
+```ts title="evals/brooklyn-forecast.eval.ts"
+import { defineEval } from "eve/evals";
+import { Braintrust } from "eve/evals/reporters";
+export default defineEval({
+  reporters: [Braintrust({ projectName: "weather-agent" })],
+  async test(t) {
+    await t.send("What is the weather in Brooklyn?");
+    t.completed();
+  },
+});
+```
+The reporter config takes an optional `projectName` and `experimentName`, plus a base experiment (by name or id) to diff against. Gate assertions log as binary scores under a `gate:` prefix so experiments diff gate regressions the same way they diff soft-score regressions. Eval `metadata` rides along to reporters.
+A reporter instance observes the evals that reference it: share one instance across several evals — the config, a `shared.ts` export, or every entry of a dataset array — and their results land in a single experiment. Listing the same config reporter on an eval too does not double-report it.
+Braintrust needs its SDK installed in the app and credentials in the environment. Pass `--skip-report` to run the eval without shipping results (this also suppresses config reporters) — useful locally when iterating.
+## JUnit
+`JUnit({ filePath })` writes JUnit XML for CI annotations. The `--junit <path>` CLI flag does the same thing without touching the eval file, which is usually the better fit — CI owns the output path, not the eval:
+```bash
+eve eval --strict --junit .eve/junit.xml
+```
+Each eval becomes one `<testcase>` named by its path-derived id; failed gates and execution errors land as failure messages on the matching test case, so CI surfaces them inline.
+## Custom reporters
+A reporter implements the `EvalReporter` interface from `eve/evals/reporters` and receives the same structured results the built-ins do. Reach for one only when a destination isn't covered — the per-run artifacts under `.eve/evals/` already capture everything for ad-hoc inspection.
+## What to read next
+- [Running evals](./running): console output, `--json`, and artifacts
+- [Judge](./judge): what the reported numbers mean

package/dist/docs/public/evals/running.mdx ADDED Viewed

@@ -0,0 +1,63 @@
+---
+title: "Running Evals"
+description: "The eve eval CLI: flags, filters, exit codes, artifacts, and how to wire evals into CI."
+---
+`eve eval` discovers every `.eval.ts` file under `evals/`, boots a local dev server (or targets a remote one), runs the evals concurrently, and prints a per-eval summary.
+```bash
+eve eval                       # run all discovered evals locally
+eve eval weather smoke         # run selected evals (an id, or a directory prefix)
+eve eval --url https://<app>   # target a remote app instead of a local host
+eve eval --mock-models         # local dev target uses deterministic mock models
+eve eval --tag fast            # only evals carrying a tag
+eve eval --strict              # soft below-threshold assertions also fail the exit code
+eve eval --no-skips            # unmet requirements fail instead of skipping
+eve eval --timeout 60000       # per-eval timeout in milliseconds
+eve eval --max-concurrency 4   # cap concurrent eval executions (default 8)
+eve eval --junit .eve/junit.xml  # write JUnit XML
+eve eval --list                # print discovered evals without running
+eve eval --verbose             # stream per-eval ctx.log lines to stdout
+eve eval --json                # machine-readable output
+eve eval --skip-report         # skip config and eval-defined reporters (e.g. Braintrust)
+```
+Positional ids match exactly or by directory prefix: `eve eval weather` runs `evals/weather.eval.ts`, every eval under `evals/weather/`, and every entry of an array-exported `weather.eval.ts`.
+## Exit codes
+| Code | Means                                                                           |
+| ---- | ------------------------------------------------------------------------------- |
+| `0`  | Every eval passed its gates (and soft thresholds, under `--strict`)             |
+| `1`  | Any eval failed — a failed gate, an execution error, or a strict threshold miss |
+| `2`  | Configuration error                                                             |
+Unmet [requirements](./targets) skip visibly without affecting the exit code unless you pass `--no-skips`.
+## Artifacts
+Each run drops artifacts under `.eve/evals/<timestamp>/`: a run `summary.json`, a `results.jsonl` index, and per-eval assertion results, verdicts, captured event streams, and `t.log` lines under `evals/`. The console output stays tight on purpose; when an eval fails, the artifact has the full story.
+## CI
+A solid CI invocation is strict, deterministic, and machine-reportable:
+```bash
+eve eval --strict --mock-models --junit .eve/junit.xml
+```
+- `--strict` turns soft threshold misses into failures, so score regressions block the merge.
+- `--mock-models` keeps the default leg deterministic and credential-free. Put real-model evals in their own files gated on `requires: ["env:..."]`, and add `--no-skips` on legs that must prove those ran.
+- `--junit` gives the CI provider per-eval annotations; upload the `.eve/evals/` directory as a failure artifact for the full event streams.
+Against a deployed app, swap `--mock-models` for `--url`:
+```bash
+eve eval --strict --url "$DEPLOY_URL" --junit .eve/junit.xml
+```
+## What to read next
+- [Targets and requirements](./targets): what `--url`, `--mock-models`, and `--no-skips` interact with
+- [Reporters](./reporters): Braintrust and JUnit output
+- [CLI reference](../reference/cli): the rest of the `eve` CLI

package/dist/docs/public/evals/targets.mdx ADDED Viewed

@@ -0,0 +1,54 @@
+---
+title: "Targets and Requirements"
+description: "Point evals at a local dev server or a deployment, and gate evals on target capabilities with requires."
+---
+An eval target is always an HTTP URL. `eve eval` starts a local dev server, while `eve eval --url <url>` runs against an existing server or deployment — the same eval files work for both, which is what makes evals usable as end-to-end tests in CI.
+The runner polls `/eve/v1/health`, verifies `/eve/v1/info`, and exposes the live target as `t.target` inside the `test` function.
+## Target helpers
+```ts title="evals/heartbeat.eval.ts"
+import { defineEval } from "eve/evals";
+export default defineEval({
+  requires: ["mockModels", "devRoutes"],
+  async test(t) {
+    const { sessionIds } = await t.target.dispatchSchedule("heartbeat");
+    await t.target.attachSession(sessionIds[0]!);
+    t.completed();
+    t.calledTool("send_report");
+  },
+});
+```
+- `t.target.fetch(path, init)` performs an authenticated fetch against the target — useful for channel and webhook ingress.
+- `t.target.dispatchSchedule(id)` triggers a [schedule](../schedules) through the dev-only schedule route and returns the session ids it created. It requires the `devRoutes` capability.
+- `t.target.attachSession(sessionId, { startIndex? })` consumes one turn from a session created outside the eval — by a channel or a schedule — so its events feed the run-level assertions.
+Sessions attached this way are full `EveEvalSession`s: you can keep driving them with `send` and read their event streams. The run-level assertions on `t` (`t.completed()`, `t.calledTool(...)`) read the whole run, attached sessions included.
+## Requirements
+Use `requires` to declare assumptions the runner verifies before executing an eval:
+| Requirement    | Means                                                                 |
+| -------------- | --------------------------------------------------------------------- |
+| `"mockModels"` | `/eve/v1/info` reports the deterministic mock model adapter is active |
+| `"devRoutes"`  | `/eve/v1/info` reports dev-only routes are mounted                    |
+| `"env:NAME"`   | The eval process has environment variable `NAME` set                  |
+Unmet requirements produce a visible `skipped` verdict and do not affect the exit code. Pass `--no-skips` when a CI leg must prove full coverage.
+## Mock models
+Deterministic evals — the kind you want in CI — should not depend on a live model. `eve eval --mock-models` starts the local dev server with deterministic authored models, and `requires: ["mockModels"]` makes the dependency explicit so the eval skips instead of flaking anywhere else.
+`--mock-models` is invalid with `--url` because remote target capabilities are discovered, not set by the runner. For evals that genuinely need a real model — judging nuanced behavior, exercising provider-side tools — gate them on credentials instead (`requires: ["env:AI_GATEWAY_API_KEY"]`) and keep them in their own eval files so a tag filter can select or exclude them.
+## What to read next
+- [Running evals](./running): `--url`, `--mock-models`, and `--no-skips` in practice
+- [Schedules](../schedules): the surface `dispatchSchedule` drives
+- [Channels](../channels/overview): ingress you can exercise with `target.fetch`

package/dist/docs/public/getting-started.mdx CHANGED Viewed

@@ -5,52 +5,56 @@ description: "Install Eve, scaffold your first agent, give it a tool, and run it
 Eve is a filesystem-first framework for durable agents: you write capabilities under `agent/`, and Eve runs the model loop, persists every session, and serves the agent over HTTP and platform channels. This guide gets a small app running locally and walks the current request loop end to end: build, run, message, stream, and follow up.
+## Quick start
+Run `eve init` with `npx` before Eve is installed locally:
+```bash
+npx eve@latest init my-agent
+```
+The command creates an npm-managed child directory, uses Eve's default model, installs dependencies, initializes Git, and starts the development server — the interactive [terminal UI](./guides/dev-tui) opens; type a message and watch the model loop run. Pass `--channel-web-nextjs` to add the Web Chat application; every app ships the built-in HTTP channel (`agent/channels/eve.ts`) regardless. Stop the server before editing the generated agent. The command does not create a Vercel project or deploy.
 ## Prerequisites
 - Node `24.x`
-- `pnpm`
+- npm (bundled with Node)
 You also need a model credential. Set the provider or gateway key your model string requires (gateway ids like `anthropic/claude-opus-4.8` route through the Vercel AI Gateway), or link a Vercel project that supplies one.
-## Create Your Agent
-The fastest path is the `create` CLI. It scaffolds the project, prompts for a model, and wires up an optional channel:
-```bash
-pnpm create eve@beta
-```
+## Manual installation
-The wizard asks for a model and which channel you want (Web Chat or Slack). You can skip both: every app ships the built-in HTTP channel (`agent/channels/eve.ts`) regardless. For a local chat it installs dependencies and starts the dev server for you.
+The quick start uses `eve init` for a guided scaffold. The target can also be an existing project directory (`eve init .`): the project must have a `package.json`, the `agent/` files must not exist yet, and the missing `eve`, `ai`, and `zod` dependencies are added without touching anything else the project owns. Either way the final handoff runs the `eve dev` binary through the project's package manager, never the project's own `dev` script.
-To add Eve to an existing app instead:
+To wire Eve into an existing app yourself instead, add only the dependency and author the two files the runtime needs:
 ```bash
-pnpm add eve@beta
+npm install eve@latest
 ```
-## What's In Your Project
+### Project files
-The scaffold writes two files; you add tools as you need them.
+A minimal agent is two files; you add tools as you need them.
-`agent/instructions.md` (generated) is the always-on system prompt:
+`agent/instructions.md` is the always-on system prompt:
 ```md
 You are a concise assistant. Use tools when they are available.
 ```
-`agent/agent.ts` (generated; model comes from the wizard) holds runtime config:
+`agent/agent.ts` holds runtime config:
 ```ts
 import { defineAgent } from "eve";
 export default defineAgent({
-  model: "anthropic/claude-opus-4.8",
+  model: "anthropic/claude-sonnet-4.6",
 });
 ```
-Even at this size the agent can already do real work. The default harness gives it file, shell, web, and delegation tools out of the box. See [Default harness](./advanced/default-harness) for the full set and how to override or disable any of them.
+Even at this size the agent can already do real work. The default harness gives it file, shell, web, and delegation tools out of the box. See [Default harness](./concepts/default-harness) for the full set and how to override or disable any of them.
-### Add Your First Tool
+### Add your first tool
 Whatever you name the file becomes the tool name the model sees. Create `agent/tools/get_weather.ts`:
@@ -71,12 +75,12 @@ export default defineTool({
 Tools run in your app runtime with full `process.env`, not inside the [sandbox](./sandbox). More in [Tools](./tools).
-## Run The App
+## Run the app
 From the app root:
 ```bash
-pnpm dev
+npm run dev
 ```
 Useful commands:
@@ -84,13 +88,13 @@ Useful commands:
 - `eve info`: show the active routes and compiled artifacts
 - `eve build`: compile the agent into `.eve/` and build the host output
 - `eve start`: serve the built output
-- `eve dev`: start the local runtime and open the interactive [terminal UI](./advanced/dev-tui)
+- `eve dev`: start the local runtime and open the interactive [terminal UI](./guides/dev-tui)
 In the dev TUI, type a message and watch it happen in order: the `get_weather` call, its result, then the reply.
-The same CLI can point at a deployment. `eve dev https://your-app.vercel.app` drives a deployed app, which is handy for preview and production smoke tests. See [Deployment](./advanced/deployment).
+The same CLI can point at a deployment. `eve dev https://your-app.vercel.app` drives a deployed app, which is handy for preview and production smoke tests. See [Deployment](./guides/deployment).
-## Send A Message
+## Send a message
 Every Eve app exposes the same stable HTTP API. Start a durable session:
@@ -105,7 +109,7 @@ The response comes back with two things you'll reuse:
 - a `continuationToken` in the JSON body, to resume this conversation
 - an `x-eve-session-id` header that identifies the run to stream
-## Stream The Session
+## Stream the session
 Attach to the session stream:
@@ -137,7 +141,7 @@ The stream is NDJSON. Expect lifecycle events such as:
 `message.completed.data.finishReason` tells you whether assistant text is interim tool-call narration or a terminal reply, and `step.completed.data.usage` carries token usage. When a parent delegates to a subagent, `subagent.called.data.childSessionId` gives you the child session id, so you can subscribe to that child stream and watch the delegated work.
-## Send A Follow-Up Message
+## Send a follow-up message
 When the session is waiting for the next user message, post a follow-up with the token:
@@ -147,16 +151,16 @@ curl -X POST http://127.0.0.1:3000/eve/v1/session/<sessionId> \
   -d '{"continuationToken":"<token>","message":"Now do Queens."}'
 ```
-See [Sessions, runs & streaming](./advanced/sessions-runs-and-streaming) for the full contract.
+See [Sessions, runs & streaming](./concepts/sessions-runs-and-streaming) for the full contract.
 ## Setting up with a coding agent
 If a coding agent (Claude Code, Cursor, and the like) is doing the setup, hand it this prompt:
-<CopyPrompt text="Set up an Eve agent for the user. Eve is a filesystem-first TypeScript framework for durable agents, published as the npm package eve. Read its docs: once eve is installed they are bundled in the package at node_modules/eve/dist/docs/public; before eve is installed, read the published Introduction and Getting Started pages. If the project has no Eve app, scaffold one with `pnpm create eve@beta`; to add Eve to an existing app, run `pnpm add eve@beta`. Make sure agent/agent.ts (which sets the model) and agent/instructions.md exist, then add a first typed tool at agent/tools/get_weather.ts using defineTool from eve/tools with a Zod inputSchema and an inline execute. Run it locally with `pnpm dev` (or `eve dev`), then exercise the HTTP API: create a session with POST /eve/v1/session, attach to GET /eve/v1/session/:id/stream, and send a follow-up with the returned continuationToken. Verify with the project's typecheck, adapt model and provider choices to the project, and do not commit unless the user asks.">
+<CopyPrompt text="Set up an Eve agent for the user. Eve is a filesystem-first TypeScript framework for durable agents, published as the npm package eve. Read its docs: once eve is installed they are bundled in the package at node_modules/eve/dist/docs/public; before eve is installed, read the published Introduction and Getting Started pages. If the project has no Eve app, scaffold one with `npx eve@latest init <name>`; add `--channel-web-nextjs` only when the user wants Web Chat. The init command installs dependencies, initializes Git, and starts the dev server, so run it in a controllable process and stop it before editing. To add Eve to an existing app, run `npm install eve@latest`. Make sure agent/agent.ts and agent/instructions.md exist, then add a first typed tool at agent/tools/get_weather.ts using defineTool from eve/tools with a Zod inputSchema and an inline execute. Start the dev server again, then exercise the HTTP API: create a session with POST /eve/v1/session, attach to GET /eve/v1/session/:id/stream, and send a follow-up with the returned continuationToken. Verify with the project's typecheck, adapt model and provider choices to the project, and do not commit unless the user asks.">
   Set up an Eve agent: read the Eve docs (bundled at node_modules/eve/dist/docs/public once eve is
-  installed), scaffold with `pnpm create eve@beta` (or `pnpm add eve@beta` in an existing app), add
-  a typed tool at agent/tools/get_weather.ts, run it with `pnpm dev`, then create a session, stream
+  installed), scaffold with `npx eve@latest init <name>` (or `npm install eve@latest` in an existing app), add
+  a typed tool at agent/tools/get_weather.ts, run it with `npm run dev`, then create a session, stream
   it, and send a follow-up.
 </CopyPrompt>
@@ -164,13 +168,14 @@ Once `eve` is a dependency, the full docs are bundled in the package, so the age
 - Docs: `node_modules/eve/dist/docs/public/`
-To scaffold a project end to end, `pnpm create eve@beta` collects the decisions (name, model, channels), runs setup, adds Slack interactively with `eve channels add slack`, and verifies the result with `eve info --json`.
+`eve init <name>` creates the base agent; `eve init .` adds one to an existing project. Add `--channel-web-nextjs` for Web Chat, or run
+`eve channels add slack` later from an interactive terminal.
 ## What to read next
 - [Instructions](./instructions) and [Tools](./tools): the core building blocks
 - [Channels](./channels/overview): reach the agent from Slack, Discord, or a web UI
-- [Frontend](./frontend/overview): browser chat with `useEveAgent`
-- [TypeScript Client](./client/overview): call the agent from scripts or server-side code
-- [Sessions, runs & streaming](./advanced/sessions-runs-and-streaming): the durable session model
+- [Frontend](./guides/frontend/overview): browser chat with `useEveAgent`
+- [TypeScript SDK](./guides/client/overview): call the agent from scripts or server-side code
+- [Sessions, runs & streaming](./concepts/sessions-runs-and-streaming): the durable session model
 - [Build an agent](./tutorial/first-agent): the full end-to-end walkthrough

package/dist/docs/public/{advanced → guides}/auth-and-route-protection.md RENAMED Viewed

@@ -1,5 +1,5 @@
 ---
-title: "Auth & route protection"
+title: "Auth & Route Protection"
 description: "Secure your agent's HTTP routes with an ordered auth walk, verifier helpers, and connection OAuth via Vercel Connect."
 ---
@@ -184,7 +184,7 @@ export default defineChannel({
 ## Replace `placeholderAuth` before production
-`pnpm create eve@beta` sometimes scaffolds `agent/channels/eve.ts` with a `placeholderAuth()` guardrail:
+`eve init` scaffolds `agent/channels/eve.ts` with a `placeholderAuth()` guardrail:
 ```ts
 import { eveChannel } from "eve/channels/eve";
@@ -263,8 +263,10 @@ Declaring `auth` adds two accessors to the tool's `ctx`:
 Throw `ConnectionAuthorizationRequiredError` anywhere in `execute` (directly, via `requireAuth()`, or implicitly from `getToken()`) and you trigger the consent flow, keyed by the tool's name. Calling either accessor on a tool that does not declare `auth` throws.
+By default the sign-in affordance title-cases the tool's path-derived name — a tool file named `sfdc_lookup.ts` renders "Sign in with Sfdc_lookup". Set `displayName` on the `auth` definition to control what users see instead: `auth: { ...connect("sfdc"), displayName: "Salesforce" }`. It is presentation-only; the tool's name still keys the authorization scope, token cache, and callback URL, and a definition-level `displayName` wins over one the strategy stamps on the challenge.
 ## What to read next
-- [Security model](./security-model): trust boundaries and the pre-production checklist
+- [Security model](../concepts/security-model): trust boundaries and the pre-production checklist
 - [Connections](../connections): connection auth shapes (`connect()` vs static token)
 - [Deployment](./deployment): where route-auth secrets live in production

package/dist/docs/public/{client → guides/client}/continuations.mdx RENAMED Viewed

@@ -122,5 +122,5 @@ for await (const event of session.stream({ startIndex: 0 })) {
 ## What to read next
 - [Streaming](./streaming): stream events and reconnect by index
-- [Sessions, runs & streaming](../advanced/sessions-runs-and-streaming): the raw HTTP contract
-- [Eve channel](../channels/eve): where continuation tokens come from
+- [Sessions, runs & streaming](../../concepts/sessions-runs-and-streaming): the raw HTTP contract
+- [Eve channel](../../channels/eve): where continuation tokens come from

package/dist/docs/public/{client → guides/client}/messages.mdx RENAMED Viewed

@@ -149,4 +149,4 @@ Don't do both on the same response. Once the stream is consumed, the `ClientSess
 - [Continuations](./continuations): how the session cursor advances
 - [Streaming](./streaming): handle events live instead of using `result()`
-- [Tools](../tools): configure approvals and question prompts
+- [Tools](../../tools): configure approvals and question prompts

package/dist/docs/public/{client → guides/client}/meta.json RENAMED Viewed

@@ -1,4 +1,4 @@
 {
-  "title": "TypeScript Client",
+  "title": "TypeScript SDK",
   "pages": ["overview", "messages", "continuations", "streaming", "output-schema"]
 }

package/dist/docs/public/{client → guides/client}/output-schema.mdx RENAMED Viewed

@@ -126,10 +126,10 @@ const followUp = await followUpResponse.result();
 console.log(followUp.data); // undefined unless this turn also requested a schema
 ```
-For task-mode output that belongs to the agent or subagent definition itself, see [`agent.ts`](../agent-config#outputschema) and [Subagents](../subagents).
+For task-mode output that belongs to the agent or subagent definition itself, see [`agent.ts`](../../agent-config#outputschema) and [Subagents](../../subagents).
 ## What to read next
 - [Messages](./messages): send turns with `send()`
 - [Streaming](./streaming): handle `result.completed` live
-- [`agent.ts`](../agent-config#outputschema): configured task-mode output
+- [`agent.ts`](../../agent-config#outputschema): configured task-mode output