eve 0.6.0-beta.15 → 0.6.0-beta.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (62) hide show
  1. package/CHANGELOG.md +12 -0
  2. package/README.md +1 -1
  3. package/dist/docs/public/evals/cases.mdx +58 -59
  4. package/dist/docs/public/evals/checks.mdx +5 -7
  5. package/dist/docs/public/evals/overview.mdx +34 -18
  6. package/dist/docs/public/evals/reporters.mdx +23 -7
  7. package/dist/docs/public/evals/running.mdx +15 -14
  8. package/dist/docs/public/evals/scores.mdx +10 -9
  9. package/dist/docs/public/evals/targets.mdx +12 -17
  10. package/dist/docs/public/reference/cli.md +12 -13
  11. package/dist/docs/public/reference/typescript-api.md +2 -1
  12. package/dist/src/cli/run.d.ts +0 -1
  13. package/dist/src/cli/run.js +1 -1
  14. package/dist/src/evals/cli/eval.d.ts +3 -4
  15. package/dist/src/evals/cli/eval.js +1 -1
  16. package/dist/src/evals/define-eval-config.d.ts +16 -0
  17. package/dist/src/evals/define-eval-config.js +1 -0
  18. package/dist/src/evals/define-eval.d.ts +16 -14
  19. package/dist/src/evals/define-eval.js +1 -1
  20. package/dist/src/evals/index.d.ts +2 -1
  21. package/dist/src/evals/index.js +1 -1
  22. package/dist/src/evals/requirements.d.ts +1 -2
  23. package/dist/src/evals/requirements.js +1 -1
  24. package/dist/src/evals/runner/artifacts.d.ts +7 -6
  25. package/dist/src/evals/runner/artifacts.js +3 -3
  26. package/dist/src/evals/runner/discover.d.ts +28 -7
  27. package/dist/src/evals/runner/discover.js +1 -1
  28. package/dist/src/evals/runner/execute-eval.d.ts +8 -10
  29. package/dist/src/evals/runner/execute-eval.js +1 -1
  30. package/dist/src/evals/runner/execute-task.d.ts +22 -0
  31. package/dist/src/evals/runner/execute-task.js +1 -0
  32. package/dist/src/evals/runner/reporters/braintrust.d.ts +6 -4
  33. package/dist/src/evals/runner/reporters/braintrust.js +2 -2
  34. package/dist/src/evals/runner/reporters/console.d.ts +4 -4
  35. package/dist/src/evals/runner/reporters/console.js +1 -1
  36. package/dist/src/evals/runner/reporters/junit.d.ts +1 -0
  37. package/dist/src/evals/runner/reporters/junit.js +3 -7
  38. package/dist/src/evals/runner/reporters/types.d.ts +14 -8
  39. package/dist/src/evals/runner/run-evals.d.ts +38 -0
  40. package/dist/src/evals/runner/run-evals.js +1 -0
  41. package/dist/src/evals/runner/verdict.d.ts +5 -5
  42. package/dist/src/evals/runner/verdict.js +1 -1
  43. package/dist/src/evals/scorers/autoevals.js +1 -1
  44. package/dist/src/evals/scorers/json.d.ts +3 -3
  45. package/dist/src/evals/scorers/json.js +1 -1
  46. package/dist/src/evals/types.d.ts +134 -176
  47. package/dist/src/harness/action-result-helpers.js +1 -1
  48. package/dist/src/harness/authorization.d.ts +26 -0
  49. package/dist/src/harness/authorization.js +1 -1
  50. package/dist/src/harness/emission.d.ts +12 -5
  51. package/dist/src/harness/emission.js +1 -1
  52. package/dist/src/harness/step-hooks.d.ts +4 -4
  53. package/dist/src/harness/step-hooks.js +1 -1
  54. package/dist/src/harness/tool-loop.js +1 -1
  55. package/dist/src/harness/tools.d.ts +4 -6
  56. package/dist/src/harness/tools.js +1 -1
  57. package/dist/src/internal/application/package.js +1 -1
  58. package/dist/src/setup/scaffold/create/project.js +1 -1
  59. package/dist/src/setup/scaffold/update/channels.js +1 -1
  60. package/package.json +1 -1
  61. package/dist/src/evals/runner/execute-case.d.ts +0 -23
  62. package/dist/src/evals/runner/execute-case.js +0 -1
package/CHANGELOG.md CHANGED
@@ -1,5 +1,17 @@
1
1
  # eve
2
2
 
3
+ ## 0.6.0-beta.16
4
+
5
+ ### Minor Changes
6
+
7
+ - 5a6ac17: Add a required `evals/evals.config.ts` (authored with `defineEvalConfig`) that declares run-wide eval defaults: a mandatory scorer `model`, plus optional run-level `reporters`, `maxConcurrency`, and `timeoutMs`. Model-backed scorers now fall back to the config `model`, so `model` is optional on `defineEval` and a shared reporter (e.g. one `Braintrust()`) no longer needs to be repeated in every eval. CLI flags and per-eval values still take precedence over the config defaults.
8
+ - 5a6ac17: `defineEval` is now always a single case, with identity fully derived from the file path — `cases`, `load`, `task`, per-case `id`, and `maxConcurrency` are removed. Declare `input` or `run` (plus `expected`, `checks`, `scores`, `parseOutput`, …) at the top level, organize related evals with directory nesting (`evals/runtime/multi-turn.eval.ts` → `runtime/multi-turn`), and default-export an array of `defineEval(...)` values for dataset fan-out (ids get a zero-padded index suffix, e.g. `weather/0000`). The runner now executes eval files concurrently (default 8, `--max-concurrency`), positional `eve eval` ids match by directory prefix, `--case` is removed, reporters use a run-level lifecycle (`onRunStart`/`onEvalComplete`/`onRunComplete`), check/scorer args expose `evaluation` instead of `case`, and artifacts land under one `.eve/evals/<timestamp>/` directory per run.
9
+
10
+ ### Patch Changes
11
+
12
+ - a8363e6: Fix `authorization.required` not being emitted when a tool combines `needsApproval` with interactive auth. Approval-resume auth signals are now routed through the authorization park path instead of being replayed to the model as a plain tool result.
13
+ - a8363e6: Authorization-pending tool results no longer expose OAuth URLs, user codes, or hook URLs to the model. Channels still receive full `authorization.required` events.
14
+
3
15
  ## 0.6.0-beta.15
4
16
 
5
17
  ### Minor Changes
package/README.md CHANGED
@@ -52,7 +52,7 @@ Every authored directory has a typed helper. Import each from the matching subpa
52
52
  | `eveChannel(...)`, `slackChannel(...)`, `vercelOidc(...)` | `eve/channels/eve`, `/slack`, `/auth` | reused from `channels/<name>.ts` |
53
53
  | `defineSandbox(...)` | `eve/sandbox` | `sandbox.ts` (or `sandbox/sandbox.ts`) |
54
54
  | `defineSchedule(...)` | `eve/schedules` | `schedules/<name>.ts` (or `schedules/<name>.md`) |
55
- | `defineEval(...)` | `eve/evals` | `evals/<name>.eval.ts` |
55
+ | `defineEval(...)`, `defineEvalConfig(...)` | `eve/evals` | `evals/<name>.eval.ts`, `evals/evals.config.ts` |
56
56
 
57
57
  Runtime accessors live on the subpath that owns the concern:
58
58
 
@@ -1,96 +1,95 @@
1
1
  ---
2
- title: "Cases and tasks"
3
- description: "Author data cases, load fixtures with loaders, and script multi-turn cases with run(ctx)."
2
+ title: "Cases"
3
+ description: "Author prompt evals, script multi-turn evals with run(ctx), and fan one file out over a dataset."
4
4
  ---
5
5
 
6
- A case is one graded unit of work: the runner executes it against the target, captures every event, and applies [checks](./checks) and [scores](./scores) to the result. A case is either a data case or a scripted case.
6
+ Each eval file is one graded case: the runner executes it against the target, captures every event, and applies [checks](./checks) and [scores](./scores) to the result. An eval is either a prompt eval (`input`) or a scripted eval (`run`).
7
7
 
8
- ## Data cases
8
+ ## Prompt evals
9
9
 
10
- Data cases pair an `input` with an `expected`, plus optional `checks`, `scores`, `requires`, `tags`, and `metadata`. `input` can be a string or an object. `expected` is optional, which is handy when you only care about behavior:
10
+ Prompt evals pair an `input` with an optional `expected`. `input` can be a string or an object (objects are `JSON.stringify`d); the runner sends it as a single user turn. `expected` is optional, which is handy when you only care about behavior:
11
11
 
12
- ```ts title="evals/weather.eval.ts"
12
+ ```ts title="evals/weather/brooklyn-forecast.eval.ts"
13
13
  import { defineEval } from "eve/evals";
14
14
  import { Checks } from "eve/evals/checks";
15
15
 
16
16
  export default defineEval({
17
+ input: "What is the weather in Brooklyn?",
18
+ expected: "Sunny",
17
19
  checks: [Checks.didNotFail()],
18
20
  scores: [],
19
- cases: [
20
- { id: "brooklyn-forecast", input: "What is the weather in Brooklyn?", expected: "Sunny" },
21
- {
22
- id: "no-tools-for-greetings",
23
- input: "Hello!",
24
- checks: [Checks.toolNotCalled("get_weather")],
25
- },
26
- ],
27
21
  });
28
22
  ```
29
23
 
30
- Eval-level `checks` and `scores` apply to every case; case-level entries append to them.
31
-
32
- ## Loading cases from fixtures
33
-
34
- List cases inline, or load them dynamically with `load`. The loaders (`loadJson`, `loadYaml` from `eve/evals/loaders`) resolve paths relative to the app root:
35
-
36
- ```ts title="evals/sql.eval.ts"
24
+ ```ts title="evals/weather/no-tools-for-greetings.eval.ts"
37
25
  import { defineEval } from "eve/evals";
38
- import { loadYaml } from "eve/evals/loaders";
39
- import { Text } from "eve/evals/scores";
26
+ import { Checks } from "eve/evals/checks";
40
27
 
41
28
  export default defineEval({
42
- async load() {
43
- const doc = await loadYaml("evals/data/cases.yaml");
44
- return (doc.evals as readonly { task: string; prompt: string; sql: string }[]).map((row) => ({
45
- id: row.task,
46
- input: row.prompt,
47
- expected: row.sql,
48
- }));
49
- },
50
- scores: [Text.exact()],
29
+ input: "Hello!",
30
+ checks: [Checks.didNotFail(), Checks.toolNotCalled("get_weather")],
31
+ scores: [],
51
32
  });
52
33
  ```
53
34
 
54
- Pass either `cases` or `load`, never both. The loaders are meant for fixtures, not runtime agent code.
35
+ ## Organizing with directories
36
+
37
+ Identity is the file path, so directories are the grouping mechanism. `evals/weather/brooklyn-forecast.eval.ts` gets the id `weather/brooklyn-forecast`, and `eve eval weather` runs everything under `evals/weather/`. Shared constants and helpers live in sibling non-eval files (any name that doesn't end in `.eval.ts`):
38
+
39
+ ```text
40
+ evals/
41
+ ├── weather/
42
+ │ ├── shared.ts # helpers — not an eval
43
+ │ ├── brooklyn-forecast.eval.ts
44
+ │ └── no-tools-for-greetings.eval.ts
45
+ └── smoke.eval.ts
46
+ ```
47
+
48
+ ## Datasets: exporting an array
55
49
 
56
- ## Tasks
50
+ To fan one file out over a dataset, default-export an array of `defineEval(...)` values. Eval modules are ESM, so top-level `await` can load anything. Ids derive from the file name plus a zero-padded index (`sql/0000`, `sql/0001`, …, in array order). The loaders (`loadJson`, `loadYaml` from `eve/evals/loaders`) parse fixture files relative to the app root:
57
51
 
58
- The eval-level `task` controls how the runner turns a data case into agent work:
52
+ ```ts title="evals/sql.eval.ts"
53
+ import { defineEval } from "eve/evals";
54
+ import { loadYaml } from "eve/evals/loaders";
55
+ import { Text } from "eve/evals/scores";
59
56
 
60
- - `task.prompt` covers the single-string case — a template that interpolates the case input.
61
- - `task.messages` sets up a static multi-turn conversation.
62
- - `task.run` gives you imperative control flow for every case in the eval.
63
- - `task.parseOutput` compares a transformed result rather than the raw final message.
57
+ const doc = await loadYaml("evals/data/cases.yaml");
58
+ const rows = doc.evals as readonly { task: string; prompt: string; sql: string }[];
59
+
60
+ export default rows.map((row) =>
61
+ defineEval({
62
+ description: row.task,
63
+ input: row.prompt,
64
+ expected: row.sql,
65
+ scores: [Text.exact()],
66
+ }),
67
+ );
68
+ ```
64
69
 
65
- A task can specify at most one of `prompt`, `messages`, and `run`. When none is set, the case `input` is sent as a single user message.
70
+ The loaders are meant for fixtures, not runtime agent code.
66
71
 
67
- ## Scripted cases
72
+ ## Scripted evals
68
73
 
69
- Scripted cases define their own `run(ctx)` and do not need `input`. Use them for smoke-style behavior: multi-turn branching, HITL approvals, structured output, attachments, and multiple independent sessions.
74
+ Scripted evals define `run(ctx)` instead of `input`. Use them for smoke-style behavior: multi-turn branching, HITL approvals, structured output, attachments, and multiple independent sessions.
70
75
 
71
- ```ts title="evals/approvals.eval.ts"
76
+ ```ts title="evals/approve-tool.eval.ts"
72
77
  import { defineEval } from "eve/evals";
73
78
  import { Checks } from "eve/evals/checks";
74
79
 
75
80
  export default defineEval({
76
- checks: [Checks.didNotFail()],
81
+ async run({ session }) {
82
+ await session.send("run `pwd`");
83
+ session.expectInputRequests({ toolName: "bash" });
84
+ const turn = await session.respondAll("approve");
85
+ return turn.message;
86
+ },
87
+ checks: [Checks.didNotFail(), Checks.toolCalled("bash", { input: { command: /pwd/ } })],
77
88
  scores: [],
78
- cases: [
79
- {
80
- id: "approve-tool",
81
- async run({ session }) {
82
- await session.send("run `pwd`");
83
- session.expectInputRequests({ toolName: "bash" });
84
- const turn = await session.respondAll("approve");
85
- return turn.message;
86
- },
87
- checks: [Checks.toolCalled("bash", { input: { command: /pwd/ } })],
88
- },
89
- ],
90
89
  });
91
90
  ```
92
91
 
93
- The return value of `run` becomes the case output that scorers grade. Throwing marks the case failed with the error message in the result.
92
+ The return value of `run` becomes the output that scorers grade (set `parseOutput` to transform the raw result instead). Throwing marks the eval failed with the error message in the result.
94
93
 
95
94
  ## The session API
96
95
 
@@ -104,12 +103,12 @@ The return value of `run` becomes the case output that scorers grade. Throwing m
104
103
 
105
104
  Each `send` resolves to an `EveEvalTurn` carrying the turn's `message`, `events`, and status. `turn.expectOk()` throws only when the turn ended failed — a session left open for a next message is the normal end state of a successful turn.
106
105
 
107
- Events from every eval session are captured in the case result and artifacts. `ctx.log(message)` records debug lines into the case artifact; `--verbose` also streams them to stdout as cases run.
106
+ Events from every eval session are captured in the result and artifacts. `ctx.log(message)` records debug lines into the eval artifact; `--verbose` also streams them to stdout as evals run.
108
107
 
109
108
  For driving sessions created outside the eval — by a channel webhook or a schedule — see [Targets and requirements](./targets).
110
109
 
111
110
  ## What to read next
112
111
 
113
- - [Checks](./checks): assert on what the case did
112
+ - [Checks](./checks): assert on what the eval did
114
113
  - [Scores](./scores): grade how well it did it
115
114
  - [TypeScript client](../client/messages): the send/turn protocol eval sessions build on
@@ -1,11 +1,9 @@
1
1
  ---
2
2
  title: "Checks"
3
- description: "Hard assertions over runs, tool calls, and output — any failed check fails the case and the exit code."
3
+ description: "Hard assertions over runs, tool calls, and output — any failed check fails the eval and the exit code."
4
4
  ---
5
5
 
6
- Checks are hard assertions. Any failed check marks the case failed and `eve eval` exits non-zero. Use them for the things that must hold — the run completed, the right tool ran, the output parses. For graded, non-fatal signals, use [scores](./scores) instead.
7
-
8
- Checks exist at the eval level (applied to every case) and the case level (appended to the eval's).
6
+ Checks are hard assertions. Any failed check marks the eval failed and `eve eval` exits non-zero. Use them for the things that must hold — the run completed, the right tool ran, the output parses. For graded, non-fatal signals, use [scores](./scores) instead.
9
7
 
10
8
  ## Built-in checks
11
9
 
@@ -39,7 +37,7 @@ Checks.subagentCalled("weather", {
39
37
 
40
38
  ## Custom checks
41
39
 
42
- A custom check is a plain function receiving `{ case, result, target }` and returning `{ name, passed, message? }`:
40
+ A custom check is a plain function receiving `{ evaluation, result, target }` and returning `{ name, passed, message? }`:
43
41
 
44
42
  ```ts
45
43
  import type { EveEvalCheck } from "eve/evals/checks";
@@ -51,7 +49,7 @@ const repliedFast: EveEvalCheck = ({ result }) => ({
51
49
  });
52
50
  ```
53
51
 
54
- Write a `message` for the failing path — it is what the console reporter prints under the case line and what lands in JUnit output.
52
+ Write a `message` for the failing path — it is what the console reporter prints under the eval line and what lands in JUnit output.
55
53
 
56
54
  ## Run status and parking
57
55
 
@@ -62,4 +60,4 @@ Write a `message` for the failing path — it is what the console reporter print
62
60
  ## What to read next
63
61
 
64
62
  - [Scores](./scores): graded signals with thresholds
65
- - [Cases and tasks](./cases): where checks attach
63
+ - [Cases](./cases): where checks attach
@@ -9,67 +9,83 @@ Evals exercise the same HTTP surface your users hit. The runner boots (or target
9
9
 
10
10
  ## `defineEval`
11
11
 
12
- Eve discovers evals under the app-root `evals/` directory, in `.eval.ts` files. The file path is the eval's identity, so you don't author an `id` or `name`.
12
+ Eve discovers evals under the app-root `evals/` directory, in `.eval.ts` files. Each file is exactly one eval — one graded case. The file path is the eval's identity, so you don't author an `id` or `name`; directories group related evals (`evals/weather/brooklyn-forecast.eval.ts` → id `weather/brooklyn-forecast`).
13
13
 
14
14
  ```text
15
15
  my-agent/
16
16
  ├── agent/
17
17
  ├── evals/
18
+ │ ├── evals.config.ts
18
19
  │ ├── smoke.eval.ts
19
- │ └── weather.eval.ts
20
+ │ └── weather/
21
+ │ ├── brooklyn-forecast.eval.ts
22
+ │ └── no-tools-for-greetings.eval.ts
20
23
  └── package.json
21
24
  ```
22
25
 
23
- ```ts title="evals/weather.eval.ts"
26
+ ```ts title="evals/weather/brooklyn-forecast.eval.ts"
24
27
  import { defineEval } from "eve/evals";
25
28
  import { Checks } from "eve/evals/checks";
26
29
  import { Run } from "eve/evals/scores";
27
30
 
28
31
  export default defineEval({
29
32
  description: "Basic message and tool-usage coverage for the weather agent.",
30
- cases: [
31
- { id: "brooklyn-forecast", input: "What is the weather in Brooklyn?", expected: "Sunny" },
32
- ],
33
+ input: "What is the weather in Brooklyn?",
34
+ expected: "Sunny",
33
35
  checks: [Checks.didNotFail(), Checks.toolCalled("get_weather")],
34
36
  scores: [Run.didNotFail()],
35
37
  });
36
38
  ```
37
39
 
38
- Every eval needs `scores` (an empty array is fine) and either `cases` or `load`. The rest are optional: `description`, `task`, `checks`, `requires`, `model`, `thresholds`, `modelOptions`, `tags`, `metadata`, `maxConcurrency`, `timeoutMs`, `reporters`. The init template adds `evals/**/*.ts` to `tsconfig.json`, so your eval code type-checks alongside the app.
40
+ Every eval needs `scores` (an empty array is fine) and either `input` (a prompt sent as a single turn) or `run` (an imperative script). The rest are optional: `description`, `expected`, `checks`, `requires`, `parseOutput`, `model`, `thresholds`, `modelOptions`, `tags`, `metadata`, `timeoutMs`, `reporters`. The init template adds `evals/**/*.ts` to `tsconfig.json`, so your eval code type-checks alongside the app.
39
41
 
40
- ## Two grading tiers
42
+ ## `evals.config.ts`
43
+
44
+ Every `evals/` directory needs exactly one `evals.config.ts` at its root. It declares the defaults every eval shares — most importantly the `model` used by model-backed scorers, so you don't repeat it in each file:
45
+
46
+ ```ts title="evals/evals.config.ts"
47
+ import { defineEvalConfig } from "eve/evals";
48
+ import { Braintrust } from "eve/evals/reporters";
49
+
50
+ export default defineEvalConfig({
51
+ model: "openai/gpt-5.4-mini",
52
+ reporters: [Braintrust({ projectName: "my-agent" })],
53
+ });
54
+ ```
41
55
 
42
- Evals grade a case on two distinct tiers:
56
+ `model` is required; `reporters`, `maxConcurrency`, and `timeoutMs` are optional. Config `reporters` observe every eval in the run — set one `Braintrust()` here instead of adding it to each eval. CLI flags (`--max-concurrency`, `--timeout`) and per-eval values take precedence over the config defaults. An eval that needs a different judge model overrides it with its own `model`; otherwise the config `model` applies.
57
+
58
+ ## Two grading tiers
43
59
 
44
- - **[Checks](./checks) are hard assertions.** Any failed check marks the case failed and `eve eval` exits non-zero. Use them for the things that must hold — the run completed, the right tool ran, the output parses.
45
- - **[Scores](./scores) are soft data.** They land in reports and artifacts, and a below-threshold score marks the case `scored` — visible but not fatal, unless you pass `--strict`.
60
+ Evals are graded on two distinct tiers:
46
61
 
47
- Both exist at the eval level (applied to every case) and the case level (appended to the eval's).
62
+ - **[Checks](./checks) are hard assertions.** Any failed check marks the eval failed and `eve eval` exits non-zero. Use them for the things that must hold the run completed, the right tool ran, the output parses.
63
+ - **[Scores](./scores) are soft data.** They land in reports and artifacts, and a below-threshold score marks the eval `scored` — visible but not fatal, unless you pass `--strict`.
48
64
 
49
65
  ## Run it
50
66
 
51
67
  ```bash
52
68
  eve eval # run all discovered evals against a local dev server
53
- eve eval weather # run selected evals
69
+ eve eval weather # run one eval, or every eval under evals/weather/
54
70
  eve eval --url https://<app> # target an existing server or deployment
55
71
  ```
56
72
 
57
- Exit code `0` means every case passed its checks. See [Running evals](./running) for the full flag list, exit codes, and CI guidance.
73
+ Exit code `0` means every eval passed its checks. See [Running evals](./running) for the full flag list, exit codes, and CI guidance.
58
74
 
59
75
  ## A good baseline
60
76
 
61
- Most apps do fine with a small smoke eval. Assert behavior with `Checks.didNotFail()` plus one or two content checks, keep case fixtures in `evals/data/`, and only reach for Braintrust once you actually need shared result review or experiment history. In CI, run `eve eval --strict` so threshold misses fail the build too.
77
+ Most apps do fine with a few small smoke evals. Assert behavior with `Checks.didNotFail()` plus one or two content checks, keep dataset fixtures in `evals/data/`, and only reach for Braintrust once you actually need shared result review or experiment history. In CI, run `eve eval --strict` so threshold misses fail the build too.
62
78
 
63
79
  The rest of this section covers each piece:
64
80
 
65
- - [Cases and tasks](./cases): data cases, loaders, and scripted multi-turn cases
81
+ - [Cases](./cases): prompt evals, scripted multi-turn evals, and dataset fan-out
66
82
  - [Checks](./checks): hard assertions over runs, tools, and output
67
83
  - [Scores](./scores): deterministic and LLM-judged scorers, with thresholds
68
- - [Targets and requirements](./targets): local vs remote targets, and gating cases on capabilities
84
+ - [Targets and requirements](./targets): local vs remote targets, and gating evals on capabilities
69
85
  - [Reporters](./reporters): Braintrust experiments and JUnit XML
70
86
  - [Running evals](./running): the `eve eval` CLI, exit codes, and artifacts
71
87
 
72
88
  ## What to read next
73
89
 
74
- - [Cases and tasks](./cases): author your first cases
90
+ - [Cases](./cases): author your first evals
75
91
  - [Tools](../tools): the surface most evals assert on
@@ -3,27 +3,43 @@ title: "Reporters"
3
3
  description: "Ship eval results to Braintrust experiments or JUnit XML — Eve runs and scores everything itself."
4
4
  ---
5
5
 
6
- Eve runs and scores everything itself; reporters just ship the results out. The CLI prints a console summary by default — one line per case, failed checks with their messages — and eval-level reporters from `eve/evals/reporters` add destinations on top.
6
+ Eve runs and scores everything itself; reporters just ship the results out. The CLI prints a console summary by default — one line per eval, failed checks with their messages — and reporters from `eve/evals/reporters` add destinations on top.
7
+
8
+ Reporters attach in two places. Declare them in `evals.config.ts` to observe **every** eval in the run — the usual choice for a shared destination like one Braintrust experiment, so you don't repeat the reporter in each file. Or list them on an individual eval's `reporters` to scope a destination to that eval (or to a group of evals that share one instance).
7
9
 
8
10
  ## Braintrust
9
11
 
10
- `Braintrust(...)` uploads eval results to Braintrust experiments:
12
+ `Braintrust(...)` uploads eval results to Braintrust experiments. Put one instance in the config so it covers the whole run:
13
+
14
+ ```ts title="evals/evals.config.ts"
15
+ import { defineEvalConfig } from "eve/evals";
16
+ import { Braintrust } from "eve/evals/reporters";
11
17
 
12
- ```ts title="evals/weather.eval.ts"
18
+ export default defineEvalConfig({
19
+ model: "openai/gpt-5.4-mini",
20
+ reporters: [Braintrust({ projectName: "weather-agent" })],
21
+ });
22
+ ```
23
+
24
+ Need a destination for only some evals? Attach it per eval instead:
25
+
26
+ ```ts title="evals/brooklyn-forecast.eval.ts"
13
27
  import { defineEval } from "eve/evals";
14
28
  import { Braintrust } from "eve/evals/reporters";
15
29
  import { Run } from "eve/evals/scores";
16
30
 
17
31
  export default defineEval({
18
- cases: [{ id: "brooklyn-forecast", input: "What is the weather in Brooklyn?" }],
32
+ input: "What is the weather in Brooklyn?",
19
33
  scores: [Run.didNotFail()],
20
34
  reporters: [Braintrust({ projectName: "weather-agent" })],
21
35
  });
22
36
  ```
23
37
 
24
- The config takes an optional `projectName` and `experimentName`, plus a base experiment (by name or id) to diff against. Checks log as binary scores under a `check:` prefix so experiments diff check regressions the same way they diff score regressions. Eval and case `metadata` ride along to reporters.
38
+ The reporter config takes an optional `projectName` and `experimentName`, plus a base experiment (by name or id) to diff against. Checks log as binary scores under a `check:` prefix so experiments diff check regressions the same way they diff score regressions. Eval `metadata` rides along to reporters.
39
+
40
+ A reporter instance observes the evals that reference it: share one instance across several evals — the config, a `shared.ts` export, or every entry of a dataset array — and their results land in a single experiment. Listing the same config reporter on an eval too does not double-report it.
25
41
 
26
- Braintrust needs its SDK installed in the app and credentials in the environment. Pass `--skip-report` to run the eval without shipping results — useful locally when iterating.
42
+ Braintrust needs its SDK installed in the app and credentials in the environment. Pass `--skip-report` to run the eval without shipping results (this also suppresses config reporters) — useful locally when iterating.
27
43
 
28
44
  ## JUnit
29
45
 
@@ -33,7 +49,7 @@ Braintrust needs its SDK installed in the app and credentials in the environment
33
49
  eve eval --strict --junit .eve/junit.xml
34
50
  ```
35
51
 
36
- Failed checks and execution errors land as failure messages on the matching test case, so CI surfaces them inline.
52
+ Each eval becomes one `<testcase>` named by its path-derived id; failed checks and execution errors land as failure messages on the matching test case, so CI surfaces them inline.
37
53
 
38
54
  ## Custom reporters
39
55
 
@@ -3,39 +3,40 @@ title: "Running evals"
3
3
  description: "The eve eval CLI: flags, filters, exit codes, artifacts, and how to wire evals into CI."
4
4
  ---
5
5
 
6
- `eve eval` discovers every `.eval.ts` file under `evals/`, boots a local dev server (or targets a remote one), runs the cases, and prints a per-case summary.
6
+ `eve eval` discovers every `.eval.ts` file under `evals/`, boots a local dev server (or targets a remote one), runs the evals concurrently, and prints a per-eval summary.
7
7
 
8
8
  ```bash
9
9
  eve eval # run all discovered evals locally
10
- eve eval weather smoke # run selected evals
10
+ eve eval weather smoke # run selected evals (an id, or a directory prefix)
11
11
  eve eval --url https://<app> # target a remote app instead of a local host
12
12
  eve eval --mock-models # local dev target uses deterministic mock models
13
- eve eval --tag fast # only cases (or evals) carrying a tag
14
- eve eval --case brooklyn-forecast # only specific case ids
13
+ eve eval --tag fast # only evals carrying a tag
15
14
  eve eval --strict # below-threshold scores also fail the exit code
16
15
  eve eval --no-skips # unmet requirements fail instead of skipping
17
- eve eval --timeout 60000 # per-case timeout in milliseconds
18
- eve eval --max-concurrency 4 # cap concurrent cases per eval
16
+ eve eval --timeout 60000 # per-eval timeout in milliseconds
17
+ eve eval --max-concurrency 4 # cap concurrent eval executions (default 8)
19
18
  eve eval --junit .eve/junit.xml # write JUnit XML
20
- eve eval --list # print discovered evals and cases without running
21
- eve eval --verbose # stream per-case ctx.log lines to stdout
19
+ eve eval --list # print discovered evals without running
20
+ eve eval --verbose # stream per-eval ctx.log lines to stdout
22
21
  eve eval --json # machine-readable output
23
- eve eval --skip-report # skip eval-defined reporters (e.g. Braintrust)
22
+ eve eval --skip-report # skip config and eval-defined reporters (e.g. Braintrust)
24
23
  ```
25
24
 
25
+ Positional ids match exactly or by directory prefix: `eve eval weather` runs `evals/weather.eval.ts`, every eval under `evals/weather/`, and every entry of an array-exported `weather.eval.ts`.
26
+
26
27
  ## Exit codes
27
28
 
28
29
  | Code | Means |
29
30
  | ---- | -------------------------------------------------------------------------------- |
30
- | `0` | Every case passed its checks (and thresholds, under `--strict`) |
31
- | `1` | Any case failed — a failed check, an execution error, or a strict threshold miss |
31
+ | `0` | Every eval passed its checks (and thresholds, under `--strict`) |
32
+ | `1` | Any eval failed — a failed check, an execution error, or a strict threshold miss |
32
33
  | `2` | Configuration error |
33
34
 
34
35
  Unmet [requirements](./targets) skip visibly without affecting the exit code unless you pass `--no-skips`.
35
36
 
36
37
  ## Artifacts
37
38
 
38
- Each run drops per-eval artifacts under `.eve/evals/<timestamp>-<eval-id>/`, including per-case check results, verdicts, captured event streams, and `ctx.log` lines. The console output stays tight on purpose; when a case fails, the artifact has the full story.
39
+ Each run drops artifacts under `.eve/evals/<timestamp>/`: a run `summary.json`, a `results.jsonl` index, and per-eval check results, verdicts, captured event streams, and `ctx.log` lines under `evals/`. The console output stays tight on purpose; when an eval fails, the artifact has the full story.
39
40
 
40
41
  ## CI
41
42
 
@@ -46,8 +47,8 @@ eve eval --strict --mock-models --junit .eve/junit.xml
46
47
  ```
47
48
 
48
49
  - `--strict` turns threshold misses into failures, so score regressions block the merge.
49
- - `--mock-models` keeps the default leg deterministic and credential-free. Put real-model cases in their own eval files gated on `requires: ["env:..."]`, and add `--no-skips` on legs that must prove those ran.
50
- - `--junit` gives the CI provider per-case annotations; upload the `.eve/evals/` directory as a failure artifact for the full event streams.
50
+ - `--mock-models` keeps the default leg deterministic and credential-free. Put real-model evals in their own files gated on `requires: ["env:..."]`, and add `--no-skips` on legs that must prove those ran.
51
+ - `--junit` gives the CI provider per-eval annotations; upload the `.eve/evals/` directory as a failure artifact for the full event streams.
51
52
 
52
53
  Against a deployed app, swap `--mock-models` for `--url`:
53
54
 
@@ -1,13 +1,13 @@
1
1
  ---
2
2
  title: "Scores"
3
- description: "Grade eval cases with deterministic scorers or LLM judges, and gate them with thresholds."
3
+ description: "Grade evals with deterministic scorers or LLM judges, and gate them with thresholds."
4
4
  ---
5
5
 
6
- Scores are soft data. They land in reports and artifacts, and a below-threshold score marks the case `scored` — visible but not fatal, unless you pass `--strict`. Use them to grade quality fractionally where a [check](./checks) would assert it absolutely.
6
+ Scores are soft data. They land in reports and artifacts, and a below-threshold score marks the eval `scored` — visible but not fatal, unless you pass `--strict`. Use them to grade quality fractionally where a [check](./checks) would assert it absolutely.
7
7
 
8
8
  ## Choosing a scorer
9
9
 
10
- Scorers live in namespaces on `eve/evals/scores`. Pick the cheapest one that captures what "correct" means here. The deterministic scorers run instantly for free; an LLM judge runs once per case and burns tokens, so save it for when nothing simpler will do.
10
+ Scorers live in namespaces on `eve/evals/scores`. Pick the cheapest one that captures what "correct" means here. The deterministic scorers run instantly for free; an LLM judge runs once per eval and burns tokens, so save it for when nothing simpler will do.
11
11
 
12
12
  | Need | Use |
13
13
  | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
@@ -21,7 +21,7 @@ Scorers live in namespaces on `eve/evals/scores`. Pick the cheapest one that cap
21
21
  | LLM-judged SQL semantic equivalence | `Autoevals.sql()` |
22
22
  | LLM-judged free-form criteria (no `expected` to match) | `Autoevals.closedQA({ criteria: "..." })` |
23
23
 
24
- Each scorer gets the flattened `input`, `output`, and `expected` strings along with the full case and task result — including derived facts: typed tool calls (name, input, output, error state), subagent calls, HITL input requests, and whether the run parked. `Run.usedTool` accepts the same matcher options as `Checks.toolCalled`. Return `null` from a scorer to skip a case.
24
+ Each scorer gets the flattened `input`, `output`, and `expected` strings along with the full eval and task result — including derived facts: typed tool calls (name, input, output, error state), subagent calls, HITL input requests, and whether the run parked. `Run.usedTool` accepts the same matcher options as `Checks.toolCalled`. Return `null` from a scorer to skip it.
25
25
 
26
26
  ## Thresholds
27
27
 
@@ -32,23 +32,24 @@ import { defineEval } from "eve/evals";
32
32
  import { Run, Text } from "eve/evals/scores";
33
33
 
34
34
  export default defineEval({
35
- cases: [{ id: "hello", input: "Hello", expected: "Hello" }],
35
+ input: "Hello",
36
+ expected: "Hello",
36
37
  scores: [Run.didNotFail(), Text.includes()],
37
38
  thresholds: { "run.didNotFail": 1, "text.includes": 0.5 },
38
39
  });
39
40
  ```
40
41
 
41
- A case below a threshold gets the `scored` verdict — reported, but only fatal under `eve eval --strict`.
42
+ An eval below a threshold gets the `scored` verdict — reported, but only fatal under `eve eval --strict`.
42
43
 
43
44
  ## The scorer model
44
45
 
45
- `model` is only required when a model-backed scorer (one of the `Autoevals` wrappers) is present without its own per-scorer model override and it's the scorer model, not the agent's. Eve only uses it for model-backed scoring, never to swap out the agent under test. Pass a string id (e.g. `"anthropic/claude-opus-4.8"`) to route through the Vercel AI Gateway, or hand it an AI SDK model instance to use that directly.
46
+ Model-backed scorers (the `Autoevals` wrappers) need a judge model — the scorer model, not the agent's. Eve only uses it for scoring, never to swap out the agent under test. The default lives in [`evals.config.ts`](./overview#evalsconfigts) as the required `model`, so most evals inherit it without setting anything. Pass a string id (e.g. `"anthropic/claude-opus-4.8"`) to route through the Vercel AI Gateway, or an AI SDK model instance to use it directly.
46
47
 
47
- For provider-specific scorer-model settings, use `modelOptions.providerOptions`. Individual Autoevals scorers can also take their own `model` / `modelOptions`, which win over the eval default.
48
+ Override the default on a single eval by setting that eval's own `model`. For provider-specific scorer-model settings, use `modelOptions.providerOptions`. Individual Autoevals scorers can also take their own `model` / `modelOptions`, which win over both the eval and config defaults.
48
49
 
49
50
  ## Concurrency and timeouts
50
51
 
51
- `maxConcurrency` caps parallelism and `timeoutMs` bounds each case. Leave them off and `eve eval` runs up to 8 cases per eval at once. Lower `maxConcurrency` when cases contend for a shared resource a rate-limited connection, or a sandbox-heavy fixture.
52
+ `timeoutMs` bounds one eval's execution: the eval's own value wins, then `evals.config.ts`'s default, and `eve eval --timeout <ms>` overrides both for a run. The runner executes up to 8 evals at once set a default `maxConcurrency` in `evals.config.ts` or pass `--max-concurrency <n>` (which wins) to change that, and lower it when evals contend for a shared resource: a rate-limited connection, or a sandbox-heavy fixture.
52
53
 
53
54
  ## What to read next
54
55
 
@@ -1,30 +1,25 @@
1
1
  ---
2
2
  title: "Targets and requirements"
3
- description: "Point evals at a local dev server or a deployment, and gate cases on target capabilities with requires."
3
+ description: "Point evals at a local dev server or a deployment, and gate evals on target capabilities with requires."
4
4
  ---
5
5
 
6
6
  An eval target is always an HTTP URL. `eve eval` starts a local dev server, while `eve eval --url <url>` runs against an existing server or deployment — the same eval files work for both, which is what makes evals usable as end-to-end tests in CI.
7
7
 
8
- The runner polls `/eve/v1/health`, verifies `/eve/v1/info`, and exposes the live target as `ctx.target` inside scripted cases.
8
+ The runner polls `/eve/v1/health`, verifies `/eve/v1/info`, and exposes the live target as `ctx.target` inside scripted evals.
9
9
 
10
10
  ## Target helpers
11
11
 
12
- ```ts title="evals/schedules.eval.ts"
12
+ ```ts title="evals/heartbeat.eval.ts"
13
13
  import { defineEval } from "eve/evals";
14
14
 
15
15
  export default defineEval({
16
16
  requires: ["mockModels", "devRoutes"],
17
17
  scores: [],
18
- cases: [
19
- {
20
- id: "heartbeat",
21
- async run({ target }) {
22
- const { sessionIds } = await target.dispatchSchedule("heartbeat");
23
- const session = await target.attachSession(sessionIds[0]!);
24
- return session.events;
25
- },
26
- },
27
- ],
18
+ async run({ target }) {
19
+ const { sessionIds } = await target.dispatchSchedule("heartbeat");
20
+ const session = await target.attachSession(sessionIds[0]!);
21
+ return session.events;
22
+ },
28
23
  });
29
24
  ```
30
25
 
@@ -36,7 +31,7 @@ Sessions attached this way are full `EveEvalSession`s: you can keep driving them
36
31
 
37
32
  ## Requirements
38
33
 
39
- Use `requires` to declare assumptions the runner verifies before executing a case:
34
+ Use `requires` to declare assumptions the runner verifies before executing an eval:
40
35
 
41
36
  | Requirement | Means |
42
37
  | -------------- | --------------------------------------------------------------------- |
@@ -44,13 +39,13 @@ Use `requires` to declare assumptions the runner verifies before executing a cas
44
39
  | `"devRoutes"` | `/eve/v1/info` reports dev-only routes are mounted |
45
40
  | `"env:NAME"` | The eval process has environment variable `NAME` set |
46
41
 
47
- Eval-level requirements apply to every case; case-level requirements append. Unmet requirements produce a visible `skipped` verdict and do not affect the exit code. Pass `--no-skips` when a CI leg must prove full coverage.
42
+ Unmet requirements produce a visible `skipped` verdict and do not affect the exit code. Pass `--no-skips` when a CI leg must prove full coverage.
48
43
 
49
44
  ## Mock models
50
45
 
51
- Deterministic evals — the kind you want in CI — should not depend on a live model. `eve eval --mock-models` starts the local dev server with deterministic authored models, and `requires: ["mockModels"]` makes the dependency explicit so the case skips instead of flaking anywhere else.
46
+ Deterministic evals — the kind you want in CI — should not depend on a live model. `eve eval --mock-models` starts the local dev server with deterministic authored models, and `requires: ["mockModels"]` makes the dependency explicit so the eval skips instead of flaking anywhere else.
52
47
 
53
- `--mock-models` is invalid with `--url` because remote target capabilities are discovered, not set by the runner. For cases that genuinely need a real model — judging nuanced behavior, exercising provider-side tools — gate them on credentials instead (`requires: ["env:AI_GATEWAY_API_KEY"]`) and keep them in their own eval file so a tag filter can select or exclude them.
48
+ `--mock-models` is invalid with `--url` because remote target capabilities are discovered, not set by the runner. For evals that genuinely need a real model — judging nuanced behavior, exercising provider-side tools — gate them on credentials instead (`requires: ["env:AI_GATEWAY_API_KEY"]`) and keep them in their own eval files so a tag filter can select or exclude them.
54
49
 
55
50
  ## What to read next
56
51
 
@@ -96,19 +96,18 @@ Local dev keeps immutable runtime source snapshots under `.eve/dev-runtime/snaps
96
96
  eve eval [evalId...] [--url <url>] [options]
97
97
  ```
98
98
 
99
- Runs all discovered evals when no eval ids are given. Exits `0` when every case passed its checks, `1` when any case failed (a failed check, an execution error, or a `--strict` threshold miss), `2` on configuration errors.
100
-
101
- | Flag | Effect |
102
- | ----------------------- | ----------------------------------------------------- |
103
- | `--url <url>` | Remote agent URL (skip local host startup) |
104
- | `--tag <tag...>` | Run only cases (or evals) carrying a tag |
105
- | `--case <id...>` | Run only specific case ids |
106
- | `--strict` | Below-threshold scores also fail the exit code |
107
- | `--list` | Print discovered evals and cases without running them |
108
- | `--timeout <ms>` | Per-case timeout in milliseconds |
109
- | `--max-concurrency <n>` | Max concurrent case executions per eval |
110
- | `--json` | Output results as JSON |
111
- | `--skip-report` | Skip eval-defined reporters (e.g. Braintrust) |
99
+ Runs all discovered evals when no eval ids are given; ids match exactly or by directory prefix (`eve eval weather` runs everything under `evals/weather/`). Exits `0` when every eval passed its checks, `1` when any eval failed (a failed check, an execution error, or a `--strict` threshold miss), `2` on configuration errors.
100
+
101
+ | Flag | Effect |
102
+ | ----------------------- | ---------------------------------------------- |
103
+ | `--url <url>` | Remote agent URL (skip local host startup) |
104
+ | `--tag <tag...>` | Run only evals carrying a tag |
105
+ | `--strict` | Below-threshold scores also fail the exit code |
106
+ | `--list` | Print discovered evals without running them |
107
+ | `--timeout <ms>` | Per-eval timeout in milliseconds |
108
+ | `--max-concurrency <n>` | Max concurrent eval executions (default 8) |
109
+ | `--json` | Output results as JSON |
110
+ | `--skip-report` | Skip eval-defined reporters (e.g. Braintrust) |
112
111
 
113
112
  See [Evals](../evals/overview) for authoring evals.
114
113