eve 0.6.0-beta.9 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (650) hide show
  1. package/CHANGELOG.md +281 -0
  2. package/README.md +9 -6
  3. package/dist/docs/public/README.md +17 -12
  4. package/dist/docs/public/agent-config.md +10 -10
  5. package/dist/docs/public/channels/custom.mdx +4 -4
  6. package/dist/docs/public/channels/discord.mdx +1 -1
  7. package/dist/docs/public/channels/eve.mdx +10 -10
  8. package/dist/docs/public/channels/github.mdx +1 -1
  9. package/dist/docs/public/channels/overview.mdx +21 -15
  10. package/dist/docs/public/channels/slack.mdx +16 -4
  11. package/dist/docs/public/channels/teams.mdx +1 -1
  12. package/dist/docs/public/channels/telegram.mdx +1 -1
  13. package/dist/docs/public/channels/twilio.mdx +1 -1
  14. package/dist/docs/public/{advanced → concepts}/context-control.md +3 -3
  15. package/dist/docs/public/{advanced → concepts}/default-harness.md +5 -5
  16. package/dist/docs/public/{advanced → concepts}/execution-model-and-durability.md +3 -1
  17. package/dist/docs/public/concepts/meta.json +10 -0
  18. package/dist/docs/public/{advanced → concepts}/security-model.md +3 -3
  19. package/dist/docs/public/{advanced → concepts}/sessions-runs-and-streaming.md +7 -7
  20. package/dist/docs/public/connections.mdx +6 -4
  21. package/dist/docs/public/evals/assertions.mdx +108 -0
  22. package/dist/docs/public/evals/cases.mdx +143 -0
  23. package/dist/docs/public/evals/judge.mdx +94 -0
  24. package/dist/docs/public/evals/meta.json +4 -0
  25. package/dist/docs/public/evals/overview.mdx +118 -0
  26. package/dist/docs/public/evals/reporters.mdx +62 -0
  27. package/dist/docs/public/evals/running.mdx +63 -0
  28. package/dist/docs/public/evals/targets.mdx +54 -0
  29. package/dist/docs/public/getting-started.mdx +38 -33
  30. package/dist/docs/public/{advanced → guides}/auth-and-route-protection.md +5 -3
  31. package/dist/docs/public/{client → guides/client}/continuations.mdx +2 -2
  32. package/dist/docs/public/{client → guides/client}/messages.mdx +1 -1
  33. package/dist/docs/public/{client → guides/client}/meta.json +1 -1
  34. package/dist/docs/public/{client → guides/client}/output-schema.mdx +2 -2
  35. package/dist/docs/public/{client → guides/client}/overview.mdx +5 -5
  36. package/dist/docs/public/{client → guides/client}/streaming.mdx +1 -1
  37. package/dist/docs/public/{advanced → guides}/deployment.md +9 -1
  38. package/dist/docs/public/guides/dev-tui.md +50 -0
  39. package/dist/docs/public/{advanced → guides}/dynamic-capabilities.md +1 -1
  40. package/dist/docs/public/{advanced → guides}/dynamic-workflows.md +1 -1
  41. package/dist/docs/public/{frontend → guides/frontend}/nextjs.mdx +16 -7
  42. package/dist/docs/public/{frontend → guides/frontend}/nuxt.mdx +7 -7
  43. package/dist/docs/public/{frontend → guides/frontend}/overview.mdx +6 -6
  44. package/dist/docs/public/{frontend → guides/frontend}/sveltekit.mdx +5 -5
  45. package/dist/docs/public/{frontend → guides/frontend}/use-eve-agent-svelte.mdx +2 -2
  46. package/dist/docs/public/{frontend → guides/frontend}/use-eve-agent-vue.mdx +2 -2
  47. package/dist/docs/public/{advanced → guides}/hooks.md +2 -2
  48. package/dist/docs/public/{advanced → guides}/instrumentation.md +3 -1
  49. package/dist/docs/public/{advanced → guides}/meta.json +8 -12
  50. package/dist/docs/public/{advanced → guides}/session-context.md +3 -3
  51. package/dist/docs/public/{advanced → guides}/state.md +1 -1
  52. package/dist/docs/public/instructions.mdx +2 -2
  53. package/dist/docs/public/introduction.md +5 -2
  54. package/dist/docs/public/meta.json +4 -3
  55. package/dist/docs/public/reference/cli.md +35 -19
  56. package/dist/docs/public/reference/meta.json +1 -1
  57. package/dist/docs/public/reference/project-layout.md +5 -1
  58. package/dist/docs/public/reference/typescript-api.md +27 -23
  59. package/dist/docs/public/sandbox.mdx +1 -1
  60. package/dist/docs/public/schedules.mdx +2 -2
  61. package/dist/docs/public/skills.mdx +3 -3
  62. package/dist/docs/public/subagents.mdx +3 -3
  63. package/dist/docs/public/tools.mdx +4 -8
  64. package/dist/docs/public/tutorial/connect-a-warehouse.mdx +3 -3
  65. package/dist/docs/public/tutorial/first-agent.mdx +6 -3
  66. package/dist/docs/public/tutorial/guard-the-spend.mdx +1 -1
  67. package/dist/docs/public/tutorial/how-it-runs.mdx +2 -2
  68. package/dist/docs/public/tutorial/meta.json +1 -1
  69. package/dist/docs/public/tutorial/query-sample-data.mdx +1 -1
  70. package/dist/docs/public/tutorial/remember-definitions.mdx +3 -3
  71. package/dist/docs/public/tutorial/run-analysis.mdx +1 -1
  72. package/dist/docs/public/tutorial/ship-it.mdx +4 -4
  73. package/dist/docs/public/tutorial/team-playbooks.mdx +3 -3
  74. package/dist/src/chunks/{use-eve-agent-DCZbkLG7.js → use-eve-agent-DErQj5hs.js} +125 -37
  75. package/dist/src/chunks/{use-eve-agent-DoheC4_o.js → use-eve-agent-DoR8C4i6.js} +125 -37
  76. package/dist/src/cli/banner.d.ts +7 -0
  77. package/dist/src/cli/banner.js +1 -0
  78. package/dist/src/cli/commands/channel-add-conflicts.d.ts +1 -1
  79. package/dist/src/cli/commands/channels.d.ts +9 -6
  80. package/dist/src/cli/commands/channels.js +1 -1
  81. package/dist/src/cli/commands/deploy.d.ts +21 -0
  82. package/dist/src/cli/commands/deploy.js +1 -0
  83. package/dist/src/cli/commands/init-git.d.ts +15 -0
  84. package/dist/src/cli/commands/init-git.js +1 -0
  85. package/dist/src/cli/commands/init.d.ts +29 -0
  86. package/dist/src/cli/commands/init.js +1 -0
  87. package/dist/src/cli/commands/link.d.ts +21 -0
  88. package/dist/src/cli/commands/link.js +1 -0
  89. package/dist/src/cli/commands/preconditions.d.ts +7 -0
  90. package/dist/src/cli/commands/preconditions.js +1 -0
  91. package/dist/src/cli/commands/register-project-commands.d.ts +12 -0
  92. package/dist/src/cli/commands/register-project-commands.js +1 -0
  93. package/dist/src/cli/dev/tui/agent-header.d.ts +15 -9
  94. package/dist/src/cli/dev/tui/agent-header.js +1 -1
  95. package/dist/src/cli/dev/tui/blocks.d.ts +1 -1
  96. package/dist/src/cli/dev/tui/blocks.js +3 -2
  97. package/dist/src/cli/dev/tui/command-typeahead.d.ts +47 -0
  98. package/dist/src/cli/dev/tui/command-typeahead.js +1 -0
  99. package/dist/src/cli/dev/tui/dev-rebuild-status.d.ts +21 -0
  100. package/dist/src/cli/dev/tui/dev-rebuild-status.js +1 -0
  101. package/dist/src/cli/dev/tui/errors.d.ts +18 -0
  102. package/dist/src/cli/dev/tui/errors.js +1 -1
  103. package/dist/src/cli/dev/tui/prompt-command-handler.d.ts +14 -0
  104. package/dist/src/cli/dev/tui/prompt-command-handler.js +1 -0
  105. package/dist/src/cli/dev/tui/prompt-commands.d.ts +54 -0
  106. package/dist/src/cli/dev/tui/prompt-commands.js +2 -0
  107. package/dist/src/cli/dev/tui/runner.d.ts +64 -7
  108. package/dist/src/cli/dev/tui/runner.js +1 -1
  109. package/dist/src/cli/dev/tui/setup-commands.d.ts +48 -0
  110. package/dist/src/cli/dev/tui/setup-commands.js +2 -0
  111. package/dist/src/cli/dev/tui/setup-flow.d.ts +35 -0
  112. package/dist/src/cli/dev/tui/setup-issues.d.ts +40 -0
  113. package/dist/src/cli/dev/tui/setup-issues.js +1 -0
  114. package/dist/src/cli/dev/tui/setup-panel.d.ts +103 -0
  115. package/dist/src/cli/dev/tui/setup-panel.js +1 -0
  116. package/dist/src/cli/dev/tui/status-line.d.ts +25 -0
  117. package/dist/src/cli/dev/tui/status-line.js +1 -0
  118. package/dist/src/cli/dev/tui/stream-format.d.ts +16 -1
  119. package/dist/src/cli/dev/tui/stream-format.js +1 -1
  120. package/dist/src/cli/dev/tui/terminal-renderer.d.ts +32 -3
  121. package/dist/src/cli/dev/tui/terminal-renderer.js +5 -2
  122. package/dist/src/cli/dev/tui/test/index.d.ts +3 -1
  123. package/dist/src/cli/dev/tui/test/index.js +1 -1
  124. package/dist/src/cli/dev/tui/test/mock-terminal.d.ts +1 -0
  125. package/dist/src/cli/dev/tui/test/mock-terminal.js +1 -1
  126. package/dist/src/cli/dev/tui/theme.d.ts +10 -0
  127. package/dist/src/cli/dev/tui/theme.js +1 -1
  128. package/dist/src/cli/dev/tui/tui-prompter.d.ts +20 -0
  129. package/dist/src/cli/dev/tui/tui-prompter.js +1 -0
  130. package/dist/src/cli/dev/tui/tui.d.ts +6 -8
  131. package/dist/src/cli/dev/tui/tui.js +1 -1
  132. package/dist/src/cli/dev/tui/types.d.ts +4 -3
  133. package/dist/src/cli/dev/tui/vercel-status.d.ts +47 -0
  134. package/dist/src/cli/dev/tui/vercel-status.js +1 -0
  135. package/dist/src/cli/run.d.ts +9 -18
  136. package/dist/src/cli/run.js +2 -2
  137. package/dist/src/client/client.d.ts +8 -0
  138. package/dist/src/client/client.js +1 -1
  139. package/dist/src/client/file-parts.d.ts +18 -0
  140. package/dist/src/client/file-parts.js +1 -0
  141. package/dist/src/client/index.d.ts +3 -2
  142. package/dist/src/client/index.js +1 -1
  143. package/dist/src/client/message-response.js +1 -1
  144. package/dist/src/client/open-stream.d.ts +6 -0
  145. package/dist/src/client/open-stream.js +1 -1
  146. package/dist/src/client/session-utils.d.ts +5 -0
  147. package/dist/src/client/session-utils.js +1 -1
  148. package/dist/src/client/session.js +1 -1
  149. package/dist/src/client/types.d.ts +9 -2
  150. package/dist/src/compiled/.vendor-stamp.json +8 -8
  151. package/dist/src/compiled/@ai-sdk/anthropic/index.d.ts +56 -31
  152. package/dist/src/compiled/@ai-sdk/anthropic/index.js +2 -2
  153. package/dist/src/compiled/@ai-sdk/google/index.js +1 -1
  154. package/dist/src/compiled/@ai-sdk/mcp/index.js +1 -1
  155. package/dist/src/compiled/@ai-sdk/openai/index.d.ts +16 -9
  156. package/dist/src/compiled/@ai-sdk/openai/index.js +2 -2
  157. package/dist/src/compiled/@ai-sdk/otel/index.js +2 -2
  158. package/dist/src/compiled/@vercel/sandbox/index.js +1 -1
  159. package/dist/src/compiled/@workflow/core/capabilities.d.ts +19 -1
  160. package/dist/src/compiled/@workflow/core/class-serialization.d.ts +32 -0
  161. package/dist/src/compiled/@workflow/core/create-hook.d.ts +37 -0
  162. package/dist/src/compiled/@workflow/core/global.d.ts +11 -1
  163. package/dist/src/compiled/@workflow/core/index.js +2 -2
  164. package/dist/src/compiled/@workflow/core/runtime/helpers.d.ts +4 -2
  165. package/dist/src/compiled/@workflow/core/runtime/start.d.ts +6 -0
  166. package/dist/src/compiled/@workflow/core/runtime/suspension-handler.d.ts +15 -2
  167. package/dist/src/compiled/@workflow/core/runtime/wait-continuation.d.ts +84 -0
  168. package/dist/src/compiled/@workflow/core/runtime/wait-until.d.ts +18 -0
  169. package/dist/src/compiled/@workflow/core/runtime.d.ts +3 -1
  170. package/dist/src/compiled/@workflow/core/runtime.js +28 -28
  171. package/dist/src/compiled/@workflow/core/serialization/types.d.ts +21 -0
  172. package/dist/src/compiled/@workflow/core/serialization.d.ts +113 -6
  173. package/dist/src/compiled/@workflow/core/symbols.d.ts +2 -0
  174. package/dist/src/compiled/@workflow/core/util.d.ts +0 -5
  175. package/dist/src/compiled/@workflow/core/version.d.ts +1 -1
  176. package/dist/src/compiled/@workflow/core/workflow/attribute-dispatcher.d.ts +6 -0
  177. package/dist/src/compiled/@workflow/core/workflow/set-attributes.d.ts +3 -4
  178. package/dist/src/compiled/@workflow/core/workflow.js +1 -1
  179. package/dist/src/compiled/@workflow/world/events.d.ts +48 -0
  180. package/dist/src/compiled/@workflow/world/index.d.ts +3 -3
  181. package/dist/src/compiled/@workflow/world/queue.d.ts +31 -2
  182. package/dist/src/compiled/@workflow/world/runs.d.ts +2 -0
  183. package/dist/src/compiled/@workflow/world/spec-version.d.ts +2 -1
  184. package/dist/src/compiled/_chunks/workflow/attribute-changes-DGVGRGfw.js +59 -0
  185. package/dist/src/compiled/_chunks/workflow/{dist-gEXVSMPU.js → dist-CkMRLaRV.js} +1 -1
  186. package/dist/src/compiled/_chunks/workflow/functions-DuPjIvMH.js +1 -0
  187. package/dist/src/compiled/_chunks/workflow/resume-hook-DMSadN9o.js +1 -0
  188. package/dist/src/compiled/_chunks/workflow/run-BRdn7zy_.js +1 -0
  189. package/dist/src/compiled/_chunks/workflow/sleep-CpXfoXLF.js +1 -0
  190. package/dist/src/compiled/just-bash/index.d.ts +4 -4
  191. package/dist/src/compiler/artifacts.js +1 -1
  192. package/dist/src/compiler/manifest.d.ts +8 -8
  193. package/dist/src/compiler/normalize-agent-config.js +1 -1
  194. package/dist/src/compiler/normalize-channel.d.ts +2 -1
  195. package/dist/src/compiler/normalize-channel.js +1 -1
  196. package/dist/src/compiler/normalize-connection.d.ts +2 -1
  197. package/dist/src/compiler/normalize-connection.js +1 -1
  198. package/dist/src/compiler/normalize-helpers.d.ts +5 -0
  199. package/dist/src/compiler/normalize-helpers.js +1 -1
  200. package/dist/src/compiler/normalize-instructions.d.ts +3 -2
  201. package/dist/src/compiler/normalize-instructions.js +1 -1
  202. package/dist/src/compiler/normalize-manifest.js +2 -2
  203. package/dist/src/compiler/normalize-sandbox.d.ts +2 -1
  204. package/dist/src/compiler/normalize-sandbox.js +1 -1
  205. package/dist/src/compiler/normalize-schedule.d.ts +2 -1
  206. package/dist/src/compiler/normalize-schedule.js +1 -1
  207. package/dist/src/compiler/normalize-skill.d.ts +2 -1
  208. package/dist/src/compiler/normalize-skill.js +1 -1
  209. package/dist/src/compiler/normalize-subagent.d.ts +4 -1
  210. package/dist/src/compiler/normalize-subagent.js +1 -1
  211. package/dist/src/compiler/normalize-tool.d.ts +2 -1
  212. package/dist/src/compiler/normalize-tool.js +1 -1
  213. package/dist/src/compiler/workspace-resources.js +1 -1
  214. package/dist/src/context/node.d.ts +1 -1
  215. package/dist/src/evals/assertions/collector.d.ts +43 -0
  216. package/dist/src/evals/assertions/collector.js +1 -0
  217. package/dist/src/evals/assertions/run.d.ts +72 -0
  218. package/dist/src/evals/assertions/run.js +2 -0
  219. package/dist/src/evals/autoevals-client.js +2 -0
  220. package/dist/src/evals/cli/eval-client.d.ts +22 -0
  221. package/dist/src/evals/cli/eval-client.js +1 -0
  222. package/dist/src/evals/cli/eval.d.ts +8 -5
  223. package/dist/src/evals/cli/eval.js +1 -1
  224. package/dist/src/evals/context.d.ts +19 -0
  225. package/dist/src/evals/context.js +1 -0
  226. package/dist/src/evals/define-eval-config.d.ts +16 -0
  227. package/dist/src/evals/define-eval-config.js +1 -0
  228. package/dist/src/evals/define-eval.d.ts +20 -0
  229. package/dist/src/evals/define-eval.js +1 -0
  230. package/dist/src/evals/expect/index.d.ts +25 -0
  231. package/dist/src/evals/expect/index.js +1 -0
  232. package/dist/src/evals/index.d.ts +6 -2
  233. package/dist/src/evals/index.js +1 -1
  234. package/dist/src/evals/judge.d.ts +20 -0
  235. package/dist/src/evals/judge.js +1 -0
  236. package/dist/src/evals/{checks/match.d.ts → match.d.ts} +17 -18
  237. package/dist/src/evals/match.js +1 -0
  238. package/dist/src/evals/reporters/index.d.ts +1 -0
  239. package/dist/src/evals/reporters/index.js +1 -1
  240. package/dist/src/evals/requirements.d.ts +3 -0
  241. package/dist/src/evals/requirements.js +1 -0
  242. package/dist/src/evals/runner/artifacts.d.ts +7 -6
  243. package/dist/src/evals/runner/artifacts.js +3 -3
  244. package/dist/src/evals/runner/discover.d.ts +31 -10
  245. package/dist/src/evals/runner/discover.js +1 -1
  246. package/dist/src/evals/runner/execute-eval.d.ts +25 -0
  247. package/dist/src/evals/runner/execute-eval.js +1 -0
  248. package/dist/src/evals/runner/execute-task.d.ts +31 -0
  249. package/dist/src/evals/runner/execute-task.js +1 -0
  250. package/dist/src/evals/runner/reporters/braintrust.d.ts +7 -5
  251. package/dist/src/evals/runner/reporters/braintrust.js +2 -2
  252. package/dist/src/evals/runner/reporters/console.d.ts +4 -4
  253. package/dist/src/evals/runner/reporters/console.js +1 -1
  254. package/dist/src/evals/runner/reporters/junit.d.ts +10 -0
  255. package/dist/src/evals/runner/reporters/junit.js +4 -0
  256. package/dist/src/evals/runner/reporters/types.d.ts +14 -8
  257. package/dist/src/evals/runner/run-evals.d.ts +38 -0
  258. package/dist/src/evals/runner/run-evals.js +1 -0
  259. package/dist/src/evals/runner/verdict.d.ts +10 -15
  260. package/dist/src/evals/runner/verdict.js +1 -1
  261. package/dist/src/evals/session.d.ts +52 -0
  262. package/dist/src/evals/session.js +1 -0
  263. package/dist/src/evals/target.d.ts +23 -0
  264. package/dist/src/evals/target.js +1 -0
  265. package/dist/src/evals/types.d.ts +294 -219
  266. package/dist/src/execution/compaction.d.ts +14 -0
  267. package/dist/src/execution/compaction.js +1 -0
  268. package/dist/src/execution/delegated-parent-notification.js +1 -1
  269. package/dist/src/execution/dispatch-runtime-actions-step.js +1 -1
  270. package/dist/src/execution/node-step.js +1 -1
  271. package/dist/src/execution/sandbox/bash-tool.d.ts +6 -6
  272. package/dist/src/execution/sandbox/bash-tool.js +1 -1
  273. package/dist/src/execution/sandbox/bindings/local.js +1 -1
  274. package/dist/src/execution/sandbox/bindings/vercel.d.ts +2 -6
  275. package/dist/src/execution/sandbox/bindings/vercel.js +1 -1
  276. package/dist/src/execution/sandbox/glob-tool.js +3 -3
  277. package/dist/src/execution/sandbox/grep-tool.js +3 -3
  278. package/dist/src/execution/sandbox/read-file-tool.js +1 -1
  279. package/dist/src/execution/subagent-adapter.js +1 -1
  280. package/dist/src/execution/tool-auth.js +1 -1
  281. package/dist/src/execution/turn-workflow.js +1 -1
  282. package/dist/src/execution/workflow-runtime.d.ts +2 -2
  283. package/dist/src/execution/workflow-runtime.js +1 -1
  284. package/dist/src/execution/workflow-steps.js +1 -1
  285. package/dist/src/harness/action-result-helpers.js +1 -1
  286. package/dist/src/harness/authorization.d.ts +26 -0
  287. package/dist/src/harness/authorization.js +1 -1
  288. package/dist/src/harness/code-mode-lifecycle.js +1 -1
  289. package/dist/src/harness/emission.d.ts +12 -5
  290. package/dist/src/harness/emission.js +1 -1
  291. package/dist/src/harness/model-call-error.d.ts +35 -6
  292. package/dist/src/harness/model-call-error.js +1 -1
  293. package/dist/src/harness/step-hooks.d.ts +10 -4
  294. package/dist/src/harness/step-hooks.js +1 -1
  295. package/dist/src/harness/tool-loop.js +1 -1
  296. package/dist/src/harness/tools.d.ts +4 -6
  297. package/dist/src/harness/tools.js +1 -1
  298. package/dist/src/harness/turn-tag-state.d.ts +4 -0
  299. package/dist/src/harness/turn-tag-state.js +1 -1
  300. package/dist/src/harness/types.d.ts +4 -15
  301. package/dist/src/internal/application/cache-metadata.js +1 -1
  302. package/dist/src/internal/application/compiled-artifacts.js +1 -1
  303. package/dist/src/internal/application/package.js +1 -1
  304. package/dist/src/internal/application/paths.js +1 -1
  305. package/dist/src/internal/authored-definition/schema-backed.js +1 -1
  306. package/dist/src/internal/authored-module-loader.d.ts +4 -1
  307. package/dist/src/internal/authored-module-loader.js +2 -2
  308. package/dist/src/internal/authored-module-map-loader.js +1 -1
  309. package/dist/src/internal/nitro/dev-runtime-artifacts.js +1 -1
  310. package/dist/src/internal/nitro/host/build-application.js +1 -1
  311. package/dist/src/internal/nitro/host/build-vercel-agent-summary.js +1 -1
  312. package/dist/src/internal/nitro/host/configure-nitro-routes.js +3 -3
  313. package/dist/src/internal/nitro/host/create-application-nitro.js +1 -1
  314. package/dist/src/internal/nitro/host/dev-authored-source-watcher.js +1 -1
  315. package/dist/src/internal/nitro/host/dev-watcher-log.d.ts +37 -0
  316. package/dist/src/internal/nitro/host/dev-watcher-log.js +1 -0
  317. package/dist/src/internal/nitro/host/ports.d.ts +8 -0
  318. package/dist/src/internal/nitro/host/ports.js +1 -0
  319. package/dist/src/internal/nitro/host/prepare-application-host.js +1 -1
  320. package/dist/src/internal/nitro/host/server-external-packages.d.ts +1 -1
  321. package/dist/src/internal/nitro/host/server-external-packages.js +1 -1
  322. package/dist/src/internal/nitro/host/start-development-server.js +1 -1
  323. package/dist/src/internal/nitro/host/start-production-server.js +1 -1
  324. package/dist/src/internal/nitro/routes/agent-info/build-agent-info-response-from-manifest.d.ts +5 -0
  325. package/dist/src/internal/nitro/routes/agent-info/build-agent-info-response-from-manifest.js +1 -0
  326. package/dist/src/internal/nitro/routes/agent-info/build-agent-info-response.d.ts +31 -2
  327. package/dist/src/internal/nitro/routes/agent-info/build-agent-info-response.js +1 -1
  328. package/dist/src/internal/nitro/routes/agent-info/load-agent-info-data.d.ts +13 -0
  329. package/dist/src/internal/nitro/routes/agent-info/load-agent-info-data.js +1 -1
  330. package/dist/src/internal/nitro/routes/info.d.ts +2 -2
  331. package/dist/src/internal/nitro/routes/info.js +1 -1
  332. package/dist/src/internal/workflow/queue-namespace.d.ts +5 -0
  333. package/dist/src/internal/workflow/queue-namespace.js +1 -0
  334. package/dist/src/internal/workflow-bundle/builder-support.js +2 -2
  335. package/dist/src/internal/workflow-bundle/builder.js +3 -5
  336. package/dist/src/internal/workflow-bundle/vercel-workflow-output.js +1 -1
  337. package/dist/src/internal/workflow-bundle/workflow-builders.d.ts +1 -1
  338. package/dist/src/internal/workflow-bundle/workflow-builders.js +1 -1
  339. package/dist/src/node_modules/.pnpm/@clack_core@1.3.1/node_modules/@clack/core/dist/index.js +4 -4
  340. package/dist/src/protocol/message.d.ts +15 -0
  341. package/dist/src/protocol/message.js +2 -2
  342. package/dist/src/public/channels/slack/api.d.ts +8 -0
  343. package/dist/src/public/channels/slack/api.js +1 -1
  344. package/dist/src/public/channels/slack/connections.d.ts +26 -18
  345. package/dist/src/public/channels/slack/connections.js +1 -1
  346. package/dist/src/public/channels/slack/defaults.d.ts +5 -2
  347. package/dist/src/public/channels/slack/defaults.js +1 -1
  348. package/dist/src/public/channels/slack/index.d.ts +1 -1
  349. package/dist/src/public/channels/slack/slackChannel.d.ts +65 -5
  350. package/dist/src/public/channels/slack/slackChannel.js +1 -1
  351. package/dist/src/public/channels/teams/defaults.js +1 -1
  352. package/dist/src/public/connections/errors.d.ts +8 -0
  353. package/dist/src/public/definitions/tool.d.ts +0 -33
  354. package/dist/src/public/next/index.d.ts +7 -1
  355. package/dist/src/public/next/index.js +1 -1
  356. package/dist/src/public/next/server.d.ts +1 -0
  357. package/dist/src/public/next/server.js +1 -1
  358. package/dist/src/public/nuxt/dev-server.js +1 -1
  359. package/dist/src/public/sveltekit/dev-server.js +1 -1
  360. package/dist/src/public/sveltekit/index.d.ts +1 -1
  361. package/dist/src/public/tools/defaults.d.ts +2 -4
  362. package/dist/src/public/tools/defaults.js +1 -1
  363. package/dist/src/public/tools/define-bash-tool.d.ts +3 -3
  364. package/dist/src/public/tools/define-bash-tool.js +1 -1
  365. package/dist/src/public/tools/define-read-file-tool.d.ts +0 -6
  366. package/dist/src/public/tools/define-read-file-tool.js +1 -1
  367. package/dist/src/public/tools/index.d.ts +2 -2
  368. package/dist/src/public/tools/index.js +1 -1
  369. package/dist/src/public/tools/internal.js +1 -1
  370. package/dist/src/runtime/actions/types.d.ts +11 -11
  371. package/dist/src/runtime/agent/mock-model-adapter.js +1 -1
  372. package/dist/src/runtime/agent/mock-model-fixtures.js +3 -2
  373. package/dist/src/runtime/agent/mock-model-skill-selection.js +3 -4
  374. package/dist/src/runtime/connections/callback-route.js +1 -1
  375. package/dist/src/runtime/connections/mcp-client.js +1 -1
  376. package/dist/src/runtime/connections/scoped-authorization.d.ts +21 -5
  377. package/dist/src/runtime/connections/scoped-authorization.js +1 -1
  378. package/dist/src/runtime/connections/types.d.ts +33 -0
  379. package/dist/src/runtime/connections/validate-authorization.js +1 -1
  380. package/dist/src/runtime/framework-tools/bash.d.ts +3 -3
  381. package/dist/src/runtime/framework-tools/bash.js +1 -1
  382. package/dist/src/runtime/framework-tools/connection-search-dynamic.d.ts +1 -1
  383. package/dist/src/runtime/framework-tools/connection-search-dynamic.js +1 -1
  384. package/dist/src/runtime/framework-tools/file-state.d.ts +3 -3
  385. package/dist/src/runtime/framework-tools/index.js +1 -1
  386. package/dist/src/runtime/framework-tools/read-file.js +2 -2
  387. package/dist/src/runtime/framework-tools/todo.d.ts +7 -0
  388. package/dist/src/runtime/framework-tools/todo.js +2 -2
  389. package/dist/src/runtime/governance/auth/http-basic.js +1 -1
  390. package/dist/src/runtime/input/types.d.ts +1 -1
  391. package/dist/src/runtime/resolve-tool.d.ts +2 -2
  392. package/dist/src/runtime/resolve-tool.js +1 -1
  393. package/dist/src/runtime/sandbox/keys.js +1 -1
  394. package/dist/src/runtime/session-callback-route.js +1 -1
  395. package/dist/src/runtime/types.d.ts +1 -7
  396. package/dist/src/services/dev-client/client-options.d.ts +8 -0
  397. package/dist/src/services/dev-client/client-options.js +1 -0
  398. package/dist/src/services/dev-client/runtime-artifacts.d.ts +13 -0
  399. package/dist/src/services/dev-client/runtime-artifacts.js +1 -0
  400. package/dist/src/services/dev-client.d.ts +13 -46
  401. package/dist/src/services/dev-client.js +1 -1
  402. package/dist/src/setup/ask.d.ts +205 -0
  403. package/dist/src/setup/ask.js +1 -0
  404. package/dist/src/setup/boxes/add-channels.d.ts +100 -16
  405. package/dist/src/setup/boxes/add-channels.js +2 -1
  406. package/dist/src/setup/boxes/add-connections.d.ts +13 -23
  407. package/dist/src/setup/boxes/add-connections.js +1 -1
  408. package/dist/src/setup/boxes/apply-ai-gateway-credential.d.ts +2 -2
  409. package/dist/src/setup/boxes/apply-ai-gateway-credential.js +1 -1
  410. package/dist/src/setup/boxes/deploy-project.d.ts +46 -14
  411. package/dist/src/setup/boxes/deploy-project.js +1 -1
  412. package/dist/src/setup/boxes/detect-ai-gateway.d.ts +10 -3
  413. package/dist/src/setup/boxes/detect-ai-gateway.js +1 -1
  414. package/dist/src/setup/boxes/link-project.d.ts +3 -3
  415. package/dist/src/setup/boxes/link-project.js +1 -1
  416. package/dist/src/setup/boxes/one-shot-next-steps.d.ts +18 -0
  417. package/dist/src/setup/boxes/one-shot-next-steps.js +2 -0
  418. package/dist/src/setup/boxes/preflight.d.ts +14 -6
  419. package/dist/src/setup/boxes/preflight.js +1 -1
  420. package/dist/src/setup/boxes/resolve-provisioning.d.ts +36 -8
  421. package/dist/src/setup/boxes/resolve-provisioning.js +1 -1
  422. package/dist/src/setup/boxes/resolve-target.d.ts +25 -8
  423. package/dist/src/setup/boxes/resolve-target.js +1 -1
  424. package/dist/src/setup/boxes/scaffold.d.ts +12 -6
  425. package/dist/src/setup/boxes/scaffold.js +1 -1
  426. package/dist/src/setup/boxes/select-channels.d.ts +38 -9
  427. package/dist/src/setup/boxes/select-channels.js +1 -1
  428. package/dist/src/setup/boxes/select-chat.d.ts +15 -11
  429. package/dist/src/setup/boxes/select-chat.js +1 -1
  430. package/dist/src/setup/boxes/select-connections.d.ts +30 -0
  431. package/dist/src/setup/boxes/select-connections.js +1 -0
  432. package/dist/src/setup/boxes/select-model.d.ts +18 -14
  433. package/dist/src/setup/boxes/select-model.js +1 -1
  434. package/dist/src/setup/boxes/select-setup-mode.d.ts +32 -0
  435. package/dist/src/setup/boxes/select-setup-mode.js +1 -0
  436. package/dist/src/setup/channel-add-conflicts.d.ts +28 -0
  437. package/dist/src/setup/channel-add-conflicts.js +1 -0
  438. package/dist/src/setup/cli/channel-setup-prompter.d.ts +23 -0
  439. package/dist/src/setup/cli/channel-setup-prompter.js +1 -0
  440. package/dist/src/setup/cli/connection-add-prompter.d.ts +8 -0
  441. package/dist/src/setup/cli/connection-add-prompter.js +1 -0
  442. package/dist/src/setup/{scaffold/cli → cli}/index.d.ts +4 -3
  443. package/dist/src/setup/cli/index.js +1 -0
  444. package/dist/src/setup/{scaffold/cli → cli}/prompt-ui.d.ts +39 -15
  445. package/dist/src/setup/cli/prompt-ui.js +5 -0
  446. package/dist/src/setup/{scaffold/cli → cli}/rail-log.d.ts +2 -0
  447. package/dist/src/setup/{scaffold/cli → cli}/rail-log.js +2 -2
  448. package/dist/src/setup/{scaffold/cli → cli}/select-component.d.ts +18 -3
  449. package/dist/src/setup/cli/select-component.js +1 -0
  450. package/dist/src/setup/cli/select-option-codec.d.ts +12 -0
  451. package/dist/src/setup/cli/select-option-codec.js +1 -0
  452. package/dist/src/setup/{scaffold/cli → cli}/select-state.d.ts +13 -1
  453. package/dist/src/setup/cli/select-state.js +1 -0
  454. package/dist/src/setup/cli/whimsy.d.ts +16 -0
  455. package/dist/src/setup/cli/whimsy.js +1 -0
  456. package/dist/src/setup/{scaffold/steps/setup-connection.d.ts → connection-connector.d.ts} +3 -2
  457. package/dist/src/setup/connection-connector.js +1 -0
  458. package/dist/src/setup/flows/channels.d.ts +43 -0
  459. package/dist/src/setup/flows/channels.js +1 -0
  460. package/dist/src/setup/flows/deploy.d.ts +40 -0
  461. package/dist/src/setup/flows/deploy.js +1 -0
  462. package/dist/src/setup/flows/in-project.d.ts +16 -0
  463. package/dist/src/setup/flows/in-project.js +1 -0
  464. package/dist/src/setup/flows/link.d.ts +43 -0
  465. package/dist/src/setup/flows/link.js +1 -0
  466. package/dist/src/setup/flows/model.d.ts +112 -0
  467. package/dist/src/setup/flows/model.js +1 -0
  468. package/dist/src/setup/flows/vercel.d.ts +31 -0
  469. package/dist/src/setup/flows/vercel.js +2 -0
  470. package/dist/src/setup/gateway-models.js +1 -1
  471. package/dist/src/setup/headless.d.ts +1 -1
  472. package/dist/src/setup/index.d.ts +10 -4
  473. package/dist/src/setup/index.js +1 -1
  474. package/dist/src/setup/onboarding.d.ts +7 -4
  475. package/dist/src/setup/onboarding.js +1 -1
  476. package/dist/src/setup/package-manager.d.ts +27 -0
  477. package/dist/src/setup/package-manager.js +1 -0
  478. package/dist/src/setup/primitives/index.d.ts +3 -0
  479. package/dist/src/setup/primitives/index.js +1 -0
  480. package/dist/src/setup/primitives/pm/bun.d.ts +10 -0
  481. package/dist/src/setup/primitives/pm/bun.js +1 -0
  482. package/dist/src/setup/primitives/pm/index.d.ts +11 -0
  483. package/dist/src/setup/primitives/pm/index.js +1 -0
  484. package/dist/src/setup/primitives/pm/npm.d.ts +10 -0
  485. package/dist/src/setup/primitives/pm/npm.js +1 -0
  486. package/dist/src/setup/primitives/pm/pnpm.d.ts +27 -0
  487. package/dist/src/setup/primitives/pm/pnpm.js +8 -0
  488. package/dist/src/setup/primitives/pm/run.d.ts +23 -0
  489. package/dist/src/setup/primitives/pm/run.js +1 -0
  490. package/dist/src/setup/primitives/pm/shared.d.ts +8 -0
  491. package/dist/src/setup/primitives/pm/shared.js +1 -0
  492. package/dist/src/setup/primitives/pm/types.d.ts +37 -0
  493. package/dist/src/setup/primitives/pm/types.js +1 -0
  494. package/dist/src/setup/primitives/pm/yarn.d.ts +10 -0
  495. package/dist/src/setup/primitives/pm/yarn.js +1 -0
  496. package/dist/src/setup/primitives/run-pnpm.d.ts +1 -0
  497. package/dist/src/setup/primitives/run-pnpm.js +1 -0
  498. package/dist/src/setup/{scaffold/primitives → primitives}/run-vercel.d.ts +7 -0
  499. package/dist/src/setup/primitives/run-vercel.js +1 -0
  500. package/dist/src/setup/project-name.d.ts +4 -0
  501. package/dist/src/setup/project-name.js +1 -0
  502. package/dist/src/setup/project-resolution.d.ts +54 -0
  503. package/dist/src/setup/project-resolution.js +1 -0
  504. package/dist/src/setup/prompter.d.ts +52 -4
  505. package/dist/src/setup/prompter.js +1 -1
  506. package/dist/src/setup/quit-guard.d.ts +1 -1
  507. package/dist/src/setup/run-vercel-link.d.ts +1 -1
  508. package/dist/src/setup/run-vercel-link.js +1 -1
  509. package/dist/src/setup/runner.d.ts +5 -4
  510. package/dist/src/setup/runner.js +1 -1
  511. package/dist/src/setup/scaffold/channels-catalog.d.ts +3 -3
  512. package/dist/src/setup/scaffold/channels-catalog.js +1 -1
  513. package/dist/src/setup/scaffold/create/add-to-project.d.ts +26 -0
  514. package/dist/src/setup/scaffold/create/add-to-project.js +1 -0
  515. package/dist/src/setup/scaffold/create/project.d.ts +54 -0
  516. package/dist/src/setup/scaffold/create/project.js +80 -0
  517. package/dist/src/setup/scaffold/index.d.ts +4 -4
  518. package/dist/src/setup/scaffold/index.js +1 -1
  519. package/dist/src/setup/scaffold/{channels.d.ts → update/channels.d.ts} +11 -0
  520. package/dist/src/setup/scaffold/update/channels.js +7 -0
  521. package/dist/src/setup/scaffold/{connections.d.ts → update/connections.d.ts} +1 -1
  522. package/dist/src/setup/scaffold/update/connections.js +21 -0
  523. package/dist/src/setup/scaffold/version-tokens.d.ts +11 -0
  524. package/dist/src/setup/scaffold/version-tokens.js +1 -0
  525. package/dist/src/setup/{scaffold/steps/setup-slackbot.d.ts → slackbot.d.ts} +24 -20
  526. package/dist/src/setup/slackbot.js +1 -0
  527. package/dist/src/setup/state.d.ts +62 -15
  528. package/dist/src/setup/state.js +1 -1
  529. package/dist/src/setup/step.d.ts +9 -18
  530. package/dist/src/setup/vercel-project.d.ts +15 -8
  531. package/dist/src/setup/vercel-project.js +1 -1
  532. package/dist/src/shared/agent-definition.d.ts +5 -3
  533. package/dist/src/shared/default-agent-model.d.ts +5 -0
  534. package/dist/src/shared/default-agent-model.js +1 -0
  535. package/dist/src/source-change/apply-model-name.d.ts +25 -0
  536. package/dist/src/source-change/apply-model-name.js +2 -0
  537. package/dist/src/source-change/static-source-change.d.ts +36 -0
  538. package/dist/src/source-change/static-source-change.js +1 -0
  539. package/dist/src/svelte/index.js +1 -1
  540. package/dist/src/svelte/use-eve-agent.js +1 -1
  541. package/dist/src/vue/index.js +1 -1
  542. package/dist/src/vue/use-eve-agent.js +1 -1
  543. package/package.json +22 -42
  544. package/dist/docs/evals-v2-plan.md +0 -939
  545. package/dist/docs/public/advanced/dev-tui.md +0 -52
  546. package/dist/docs/public/advanced/evals.md +0 -158
  547. package/dist/docs/public/reference/faqs.md +0 -48
  548. package/dist/src/cli/commands/setup.d.ts +0 -55
  549. package/dist/src/cli/commands/setup.js +0 -1
  550. package/dist/src/cli/dev/repl/input-requests.d.ts +0 -38
  551. package/dist/src/cli/dev/repl/input-requests.js +0 -1
  552. package/dist/src/cli/dev/repl/input.d.ts +0 -19
  553. package/dist/src/cli/dev/repl/input.js +0 -1
  554. package/dist/src/cli/dev/repl/repl.d.ts +0 -62
  555. package/dist/src/cli/dev/repl/repl.js +0 -2
  556. package/dist/src/cli/dev/repl/terminal.d.ts +0 -21
  557. package/dist/src/cli/dev/repl/terminal.js +0 -5
  558. package/dist/src/compiled/_chunks/workflow/resume-hook-0Zk0zSvq.js +0 -12
  559. package/dist/src/compiled/_chunks/workflow/sleep-DXZr2BgM.js +0 -1
  560. package/dist/src/compiled/_chunks/workflow/symbols-BWCAoPHE.js +0 -48
  561. package/dist/src/evals/checks/checks.d.ts +0 -66
  562. package/dist/src/evals/checks/checks.js +0 -2
  563. package/dist/src/evals/checks/index.d.ts +0 -21
  564. package/dist/src/evals/checks/index.js +0 -1
  565. package/dist/src/evals/checks/match.js +0 -1
  566. package/dist/src/evals/define-eval-suite.d.ts +0 -18
  567. package/dist/src/evals/define-eval-suite.js +0 -1
  568. package/dist/src/evals/runner/execute-case.d.ts +0 -23
  569. package/dist/src/evals/runner/execute-case.js +0 -1
  570. package/dist/src/evals/runner/execute-suite.d.ts +0 -24
  571. package/dist/src/evals/runner/execute-suite.js +0 -1
  572. package/dist/src/evals/scorers/autoevals-client.js +0 -2
  573. package/dist/src/evals/scorers/autoevals.d.ts +0 -58
  574. package/dist/src/evals/scorers/autoevals.js +0 -1
  575. package/dist/src/evals/scorers/json.d.ts +0 -10
  576. package/dist/src/evals/scorers/json.js +0 -1
  577. package/dist/src/evals/scorers/model-marker.d.ts +0 -12
  578. package/dist/src/evals/scorers/model-marker.js +0 -1
  579. package/dist/src/evals/scorers/run.d.ts +0 -24
  580. package/dist/src/evals/scorers/run.js +0 -1
  581. package/dist/src/evals/scorers/sql.d.ts +0 -9
  582. package/dist/src/evals/scorers/sql.js +0 -1
  583. package/dist/src/evals/scorers/text.d.ts +0 -18
  584. package/dist/src/evals/scorers/text.js +0 -1
  585. package/dist/src/evals/scores/index.d.ts +0 -72
  586. package/dist/src/evals/scores/index.js +0 -1
  587. package/dist/src/execution/tool-compaction.d.ts +0 -9
  588. package/dist/src/execution/tool-compaction.js +0 -1
  589. package/dist/src/services/dev-client/stream.d.ts +0 -5
  590. package/dist/src/services/dev-client/stream.js +0 -1
  591. package/dist/src/services/dev-client/url.d.ts +0 -11
  592. package/dist/src/services/dev-client/url.js +0 -1
  593. package/dist/src/setup/channel-setup-prompter.d.ts +0 -8
  594. package/dist/src/setup/channel-setup-prompter.js +0 -1
  595. package/dist/src/setup/scaffold/channels.js +0 -7
  596. package/dist/src/setup/scaffold/cli/channel-add-prompter.d.ts +0 -12
  597. package/dist/src/setup/scaffold/cli/channel-add-prompter.js +0 -1
  598. package/dist/src/setup/scaffold/cli/channel-setup-prompter.d.ts +0 -56
  599. package/dist/src/setup/scaffold/cli/connection-add-prompter.d.ts +0 -44
  600. package/dist/src/setup/scaffold/cli/connection-add-prompter.js +0 -1
  601. package/dist/src/setup/scaffold/cli/index.js +0 -1
  602. package/dist/src/setup/scaffold/cli/prompt-ui.js +0 -5
  603. package/dist/src/setup/scaffold/cli/select-component.js +0 -1
  604. package/dist/src/setup/scaffold/cli/select-state.js +0 -1
  605. package/dist/src/setup/scaffold/connections.js +0 -21
  606. package/dist/src/setup/scaffold/pnpm-workspace.d.ts +0 -3
  607. package/dist/src/setup/scaffold/pnpm-workspace.js +0 -11
  608. package/dist/src/setup/scaffold/primitives/detect-deployment.d.ts +0 -13
  609. package/dist/src/setup/scaffold/primitives/detect-deployment.js +0 -1
  610. package/dist/src/setup/scaffold/primitives/index.d.ts +0 -3
  611. package/dist/src/setup/scaffold/primitives/index.js +0 -1
  612. package/dist/src/setup/scaffold/primitives/pnpm-invocation.d.ts +0 -12
  613. package/dist/src/setup/scaffold/primitives/pnpm-invocation.js +0 -1
  614. package/dist/src/setup/scaffold/primitives/run-pnpm.d.ts +0 -17
  615. package/dist/src/setup/scaffold/primitives/run-pnpm.js +0 -1
  616. package/dist/src/setup/scaffold/primitives/run-vercel.js +0 -1
  617. package/dist/src/setup/scaffold/project.d.ts +0 -21
  618. package/dist/src/setup/scaffold/project.js +0 -80
  619. package/dist/src/setup/scaffold/steps/deploy-to-vercel.d.ts +0 -17
  620. package/dist/src/setup/scaffold/steps/deploy-to-vercel.js +0 -1
  621. package/dist/src/setup/scaffold/steps/index.d.ts +0 -4
  622. package/dist/src/setup/scaffold/steps/index.js +0 -1
  623. package/dist/src/setup/scaffold/steps/project-resolution.d.ts +0 -19
  624. package/dist/src/setup/scaffold/steps/project-resolution.js +0 -1
  625. package/dist/src/setup/scaffold/steps/run-add-connection.d.ts +0 -40
  626. package/dist/src/setup/scaffold/steps/run-add-connection.js +0 -1
  627. package/dist/src/setup/scaffold/steps/run-add-to-agent.d.ts +0 -81
  628. package/dist/src/setup/scaffold/steps/run-add-to-agent.js +0 -2
  629. package/dist/src/setup/scaffold/steps/setup-connection.js +0 -1
  630. package/dist/src/setup/scaffold/steps/setup-slackbot.js +0 -1
  631. /package/dist/docs/public/{frontend → guides/frontend}/meta.json +0 -0
  632. /package/dist/docs/public/{advanced → guides}/remote-agents.md +0 -0
  633. /package/dist/src/{setup/scaffold/cli/channel-setup-prompter.js → cli/dev/tui/setup-flow.js} +0 -0
  634. /package/dist/src/evals/{scorers/autoevals-client.d.ts → autoevals-client.d.ts} +0 -0
  635. /package/dist/src/setup/{scaffold/cli → cli}/command-output.d.ts +0 -0
  636. /package/dist/src/setup/{scaffold/cli → cli}/command-output.js +0 -0
  637. /package/dist/src/setup/{scaffold/human-action.d.ts → human-action.d.ts} +0 -0
  638. /package/dist/src/setup/{scaffold/human-action.js → human-action.js} +0 -0
  639. /package/dist/src/setup/{scaffold/primitives → primitives}/process-output.d.ts +0 -0
  640. /package/dist/src/setup/{scaffold/primitives → primitives}/process-output.js +0 -0
  641. /package/dist/src/setup/scaffold/{web-template.d.ts → create/web-template.d.ts} +0 -0
  642. /package/dist/src/setup/scaffold/{web-template.js → create/web-template.js} +0 -0
  643. /package/dist/src/setup/scaffold/{module-files.d.ts → update/module-files.d.ts} +0 -0
  644. /package/dist/src/setup/scaffold/{module-files.js → update/module-files.js} +0 -0
  645. /package/dist/src/setup/scaffold/{package-json.d.ts → update/package-json.d.ts} +0 -0
  646. /package/dist/src/setup/scaffold/{package-json.js → update/package-json.js} +0 -0
  647. /package/dist/src/setup/scaffold/{primitives → update}/update-connection-connector.d.ts +0 -0
  648. /package/dist/src/setup/scaffold/{primitives → update}/update-connection-connector.js +0 -0
  649. /package/dist/src/setup/scaffold/{primitives → update}/update-slack-channel.d.ts +0 -0
  650. /package/dist/src/setup/scaffold/{primitives → update}/update-slack-channel.js +0 -0
@@ -1,939 +0,0 @@
1
- # Evals v2: One Runner for Quality Evals and End-to-End Verification
2
-
3
- Status: proposal
4
- Owner: framework
5
- Scope: `packages/eve` (evals, client, CLI), `e2e/`, CI
6
-
7
- ## Summary
8
-
9
- Eve's eval suites (`defineEvalSuite` + `eve eval`) and the hand-rolled e2e smoke
10
- surface (`e2e/tests/**` + `ExampleClient`) are two implementations of the same
11
- idea: drive a real agent over HTTP and judge what happened. The eval runner has
12
- the right bones — filesystem discovery, target resolution, stream capture,
13
- concurrency, scoring, reporting, artifacts — but cannot express scripted
14
- multi-turn interactions, HITL approvals, tool-call assertions, channel ingress,
15
- or hard pass/fail. The e2e surface can express all of that, but as 78
16
- framework-less scripts with a duplicated client, triplicated stream readers, and
17
- bespoke `throw new Error(...)` assertions.
18
-
19
- This plan extends the eval API into the single runner for both jobs and then
20
- deletes the hand-rolled e2e harness. We keep the declarative dataset-and-scorer
21
- core (it is the right shape for quality evals) and add three things it is
22
- missing:
23
-
24
- 1. **An imperative interaction API** — `run(ctx)` with a typed `EvalSession`
25
- driver for multi-turn control flow, HITL responses, approvals, structured
26
- output, attachments, and multi-session scenarios. Available at suite level
27
- (`task.run`, the shared default for dataset evals) and per case
28
- (`case.run`), so one suite groups many distinct scripted behaviors the way
29
- a test file groups `it` blocks.
30
- 2. **A hard-assertion tier** — `checks`, distinct from `scores`. Check failures
31
- fail the case and the process; scores remain soft, thresholded data.
32
- 3. **A URL-shaped target model with verified requirements** — a target is
33
- always just a URL. `eve eval` obtains one (boots the dev server, as today)
34
- or `--url` brings your own (a built server in CI, a preview deployment).
35
- Suites never declare _how_ to provision the agent — they declare
36
- `requires` assumptions (mock models, dev routes, sidecar env) that the
37
- runner verifies against the live target and skips or errors on, visibly.
38
- Provisioning (build, start, env injection, sidecars, secondary agents)
39
- stays external in v1: a thin script or CI step boots things and points
40
- `eve eval --url` at them — the composition that
41
- `e2e/tests/basic-runtime/evals.ts` already proves out today.
42
-
43
- The end state: every behavior currently proven by `e2e/tests/**` (except the
44
- TUI tests, see Non-goals) is an eval suite in a fixture app's `evals/`
45
- directory, CI runs `eve eval --strict` per fixture app per matrix mode, and
46
- `e2e/lib/client.ts` plus the per-file harness code are deleted. Users get the
47
- same machinery for their own agents: smoke-test your agent the way Eve
48
- smoke-tests itself.
49
-
50
- ## Why not a `describe`/`it` runner
51
-
52
- We considered replacing `defineEvalSuite` with a vitest/`node:test`-shaped API.
53
- Rejected, for these reasons:
54
-
55
- 1. **Evals have semantics tests don't.** Scores in `[0,1]`, per-scorer
56
- thresholds, LLM judges, datasets/loaders, Braintrust experiments, per-scorer
57
- averages, JSON artifacts. `describe`/`it` collapses everything to boolean
58
- pass/fail; we would immediately reinvent scorers and reporters _inside_ `it`
59
- blocks and lose the structured result model that powers `--json` and the
60
- Braintrust reporter.
61
- 2. **Suite identity is path-derived** (repo principle 5, enforced by
62
- `scripts/guard-invariants.mjs`). One suite per `evals/<path>.eval.ts` file
63
- maps cleanly onto the filesystem-first philosophy; nested describe blocks
64
- fight it.
65
- 3. **The dataset model is the right default.** "Load 200 cases from YAML, fan
66
- out with bounded concurrency, score with a judge" is the 80% case for users.
67
- A block-based runner makes that the awkward case (loops generating `it`s).
68
- 4. **What e2e actually needs is not blocks** — it is control flow _inside a
69
- case_, a typed session driver, and assertion helpers over the event stream.
70
- All three fit inside the existing suite shape: `run` at the task and case
71
- level provides the control flow, and scripted cases give the same grouping
72
- a `describe` file gives `it` blocks (see "Scripted cases" below).
73
- 5. **Wrapping vitest violates principle 3** (wrap third-party deps; don't
74
- expose them), and writing a bespoke block runner re-solves problems the
75
- eval runner already solves (discovery, concurrency, timeouts, reporting).
76
-
77
- The one philosophical gap — "low scores are data, not failures" vs. "smoke
78
- tests must fail the build" — is resolved by making the checks/scores split
79
- first-class rather than by switching paradigms.
80
-
81
- ## Goals
82
-
83
- - Scripted multi-turn evals with branching on prior responses.
84
- - HITL: assert an agent parks on `input.requested`, respond
85
- (approve/deny/select/freeform), assert resumption and approval persistence.
86
- - First-class assertions on tool calls: name, input, output, error state,
87
- order, and count — plus subagent calls, messages, structured output, and
88
- arbitrary event predicates.
89
- - Hard pass/fail semantics suitable for CI gating, coexisting with soft scores.
90
- - One target model: any URL — the runner-booted dev server, a locally built
91
- `eve start` process, or a deployed instance — with suite-declared
92
- requirements verified against the live target instead of suite-owned
93
- provisioning.
94
- - Drive surfaces beyond the session route: channels (webhook ingress) and
95
- schedules (dev dispatch), with stream consumption for sessions the suite did
96
- not create.
97
- - One HTTP client: `eve/client` absorbs everything `e2e/lib/client.ts` does;
98
- `ExampleClient` is deleted.
99
- - Replace `e2e/tests/**` (minus TUI) with eval suites; CI becomes
100
- `eve eval --strict` runs.
101
-
102
- ## Non-goals
103
-
104
- - **TUI tests** (`e2e/tests/tui-client/`) stay as scripts. They test the TUI
105
- client and renderer, not agent behavior; an agent-eval runner is the wrong
106
- tool. They already use the public `Client` and the built TUI test harness.
107
- - **Unit/integration/scenario tiers** are unchanged. Evals replace the _e2e_
108
- tier only.
109
- - **Braintrust/autoevals integration** is unchanged in shape (still wrapped
110
- per principle 3).
111
- - **No authored suite `id`/`name`.** Identity stays path-derived; the
112
- invariant guard keeps enforcing it.
113
-
114
- ## Current state (abridged; see code for detail)
115
-
116
- - Suite API: `defineEvalSuite({ cases | load, task, scores, model, thresholds,
117
- reporters, ... })` in `packages/eve/src/evals/define-eval-suite.ts` and
118
- `types.ts`. `task` is `prompt(case)` or `messages(case) => string[]` — a
119
- static list, no branching, no HITL (`runner/execute-case.ts:38` only ever
120
- sends `{ message }`).
121
- - Facts: `EveEvalDerivedFacts` exposes tool call **names only**
122
- (`runner/derive-run-facts.ts`); inputs/outputs exist in the captured events
123
- but have no typed surface.
124
- - Exit code: execution errors only; sub-threshold scores never fail the
125
- process (`evals/cli/eval.ts`). A session parked on an approval ends
126
- `"waiting"`, which `Run.didNotFail()` counts as success.
127
- - Target: `eve eval` boots a dev server in-process or hits `--url`. No built
128
- (`eve start`) target, no env injection, no mock-model switch (only the
129
- server-side `EVE_MOCK_AUTHORED_MODELS` env var), no sidecars, no secondary
130
- targets.
131
- - Dead surface: case `tags` are documented but never read; `--all` is parsed
132
- but unused.
133
- - e2e: `e2e/lib/run.ts` + `e2e/target/local-environment.ts` own
134
- build/spawn/health/teardown; `e2e/lib/client.ts` (`ExampleClient`)
135
- duplicates `ClientSession` and adds pending-input tracking, turn-failure
136
- errors, and two retry loops papering over the POST→GET stream-registration
137
- race; assertions are ad-hoc per file.
138
-
139
- ---
140
-
141
- ## The v2 API
142
-
143
- ### Suite shape
144
-
145
- ```ts
146
- import { defineEvalSuite } from "eve/evals";
147
- import { Checks } from "eve/evals/checks";
148
- import { Run, Text } from "eve/evals/scores";
149
-
150
- export default defineEvalSuite({
151
- description: "HITL approval flows: park, approve, deny, persist.",
152
-
153
- // Suite-level checks apply to every case.
154
- checks: [Checks.didNotFail()],
155
- scores: [Run.didNotFail()],
156
-
157
- cases: [
158
- {
159
- id: "approve-then-persist",
160
- async run({ session }) {
161
- await session.send("run `pwd`");
162
- const [request] = session.expectInputRequests();
163
- await session.respond({ requestId: request.requestId, optionId: "approve" });
164
- const turn = await session.send("read the `weather-codes.md` file");
165
- return turn.message;
166
- },
167
- checks: [Checks.toolCalled("bash", { input: { command: /pwd/ } }), Checks.completed()],
168
- },
169
- {
170
- id: "deny-regates",
171
- async run({ session }) {
172
- await session.send("run `pwd`");
173
- await session.respondAll("deny");
174
- await session.send("run `pwd` again");
175
- session.expectInputRequests({ toolName: "bash" }); // re-gated after denial
176
- },
177
- checks: [Checks.waiting()],
178
- },
179
- ],
180
- });
181
- ```
182
-
183
- Changes to `EveEvalSuiteInput` (`packages/eve/src/evals/types.ts`):
184
-
185
- | Field | Change |
186
- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
187
- | `task.run` | New third task variant, mutually exclusive with `prompt`/`messages`. `prompt` and `messages` become sugar implemented on top of `run`. |
188
- | scripted cases | `EveEvalCase` becomes a union: a **data case** (`input`/`expected`, run through the suite `task` — today's shape) or a **scripted case** (`run`, no `input` required). `case.run` overrides `suite.task`. See "Scripted cases" below. |
189
- | `checks` | New optional array of `EveEvalCheck`, at suite level and per case (case-level appends to suite-level). Hard assertions; any failure marks the case failed and flips the CLI exit code. |
190
- | `requires` | New optional requirement list (suite- and case-level): `"mockModels"`, `"devRoutes"`, `"env:<NAME>"`. Verified against the live target; cases the target cannot support are skipped with a reported verdict, so one suite runs against the dev server, a local build, and a deployed URL. See "Targets". Suites own no provisioning config — no kind/env/setup. |
191
- | `model` | **Now optional.** Required only when a model-backed scorer is present; validation moves from "always required" to "required if any scorer declares it needs a judge" (built-in autoevals scorers carry a marker; custom scorers that read `args.model` get `undefined` unless the suite provides one). Kills the `"eve-bootstrap-model"` dummy-model workaround. |
192
- | `trials` | New optional `number` (default 1). Runs each case N times; per-trial results are reported, a case passes only if every trial's checks pass, and scores aggregate as mean. For nondeterministic model-backed suites. |
193
- | `tags` filtering | Case and suite `tags` become functional (CLI `--tag`). |
194
-
195
- Everything else (`cases`/`load`, `scores`, `thresholds`, `reporters`,
196
- `maxConcurrency`, `timeoutMs`, `metadata`) is unchanged. Existing suites keep
197
- working except where noted in Breaking changes.
198
-
199
- ### `task.run(ctx)` and the run context
200
-
201
- ```ts
202
- export type EveEvalTask =
203
- | { run(ctx: EveEvalRunContext): Promise<unknown | void>; prompt?: never; messages?: never; parseOutput?: ... }
204
- | { messages(testCase: EveEvalCase): string[]; ... } // unchanged, sugar over run
205
- | { prompt(testCase: EveEvalCase): string; ... } // unchanged, sugar over run
206
- | { parseOutput?(result: EveEvalTaskResult): unknown }; // unchanged default
207
-
208
- export interface EveEvalRunContext {
209
- /** The case under execution. */
210
- readonly case: EveEvalCase;
211
- /** Primary session, created lazily on first send. Fresh per case. */
212
- readonly session: EveEvalSession;
213
- /** Create an additional independent session against the same target. */
214
- newSession(): EveEvalSession;
215
- /** Handle to the agent server under test (channels, schedules, raw routes). */
216
- readonly target: EveEvalTargetHandle;
217
- /** Case timeout signal (from suite/CLI `timeoutMs`). */
218
- readonly signal: AbortSignal;
219
- /** Structured logger; lines land in the case artifact, and on stdout with `--verbose`. */
220
- readonly log: (message: string) => void;
221
- }
222
- ```
223
-
224
- Semantics:
225
-
226
- - The runner still owns session lifecycle, full-stream capture, derived facts,
227
- timeout, and post-run checks/scores. `run` only drives the interaction.
228
- - `run`'s return value becomes `result.output` (then `parseOutput` applies as
229
- today; default remains `finalMessage` when `run` returns `undefined`).
230
- - A `throw` inside `run` marks the case **failed** with the error recorded —
231
- imperative assertions inside `run` are first-class, same as a failing check.
232
- - Events from every turn of every session created in the case are accumulated
233
- into `result.events` (tagged with session id) so checks and scorers see the
234
- whole interaction. `derived` facts are computed over the primary session by
235
- default, with per-session access on the result.
236
-
237
- ### Scripted cases: making the suite a suite
238
-
239
- A suite-level `task` alone means "one interaction script × many data rows" —
240
- the dataset shape. The e2e surface is the inverse: many distinct scripts, one
241
- execution each (`tool-approval`, `tool-denial`, and `ask-question-flow` are
242
- three different `run` functions, not three rows). Forcing each script into its
243
- own suite file with a single dummy-input case would make "suite" a misnomer
244
- and multiply provisioning cost (a provisioned target per file instead of per
245
- behavior family).
246
-
247
- So `EveEvalCase` becomes a union:
248
-
249
- ```ts
250
- export type EveEvalCase = EveEvalDataCase | EveEvalScriptedCase;
251
-
252
- /** Today's shape: data routed through the suite-level task. */
253
- export interface EveEvalDataCase {
254
- readonly id: string;
255
- readonly input: string | Record<string, unknown>;
256
- readonly expected?: unknown;
257
- readonly checks?: readonly EveEvalCheck[]; // appended to suite-level
258
- readonly scores?: readonly EveEvalScorer[]; // appended to suite-level
259
- readonly tags?: readonly string[];
260
- readonly metadata?: Readonly<Record<string, unknown>>;
261
- }
262
-
263
- /** A self-contained interaction script. No input required. */
264
- export interface EveEvalScriptedCase {
265
- readonly id: string;
266
- run(ctx: EveEvalRunContext): Promise<unknown | void>;
267
- readonly expected?: unknown;
268
- readonly checks?: readonly EveEvalCheck[];
269
- readonly scores?: readonly EveEvalScorer[];
270
- readonly tags?: readonly string[];
271
- readonly metadata?: Readonly<Record<string, unknown>>;
272
- }
273
- ```
274
-
275
- Resolution rules:
276
-
277
- - `case.run` wins over `suite.task`; a data case without a suite `task` falls
278
- back to today's default (send `input` verbatim).
279
- - Case-level `checks`/`scores` **append** to suite-level ones; suite-level
280
- expresses invariants ("never fails"), case-level expresses the specific
281
- behavior under test.
282
- - A suite may freely mix data cases and scripted cases, though in practice
283
- quality suites are all-data and smoke suites are all-scripted.
284
-
285
- The conceptual model this lands on: **the suite file is the `describe`, cases
286
- are the `it`s** — grouping, one shared target, shared baseline checks and
287
- requirements — without a block API, and with path-derived identity intact. The suite-level `task` keeps its role as the shared default for
288
- dataset evals; it is no longer the only way to define behavior.
289
-
290
- Execution semantics are uniform across both case kinds:
291
-
292
- - **Concurrency**: scripted cases join the same bounded pool as data cases
293
- (each owns its sessions, so they parallelize safely). Suites whose cases
294
- mutate shared target state (e.g. `defineState` persistence tests) set
295
- `maxConcurrency: 1`.
296
- - **Timeout**: suite/CLI `timeoutMs` applies per case per trial; the signal is
297
- `ctx.signal` inside `run` and aborts in-flight sends.
298
- - **Trials**: apply identically — a scripted case under `trials: 3` runs its
299
- `run` three times against three fresh primary sessions.
300
- - **Loaders**: `load()` may return scripted cases too (the union is the return
301
- type), though in practice loaded datasets are data cases.
302
-
303
- ### `EvalSession`: the interaction driver
304
-
305
- A thin wrapper over the public `ClientSession` (`packages/eve/src/client/`),
306
- not a parallel implementation. Everything here is also useful to end users, so
307
- the driver lives in `eve/evals` but delegates transport entirely to
308
- `eve/client`.
309
-
310
- ```ts
311
- export interface EveEvalSession {
312
- /** Send one turn. Accepts the same SendTurnInput as ClientSession.send. */
313
- send(input: SendTurnInput): Promise<EveEvalTurn>;
314
- /** Sugar: text + file attachment inlined as a data: URL (multimodal turns). */
315
- sendFile(text: string, filePath: string, mediaType?: string): Promise<EveEvalTurn>;
316
-
317
- /**
318
- * Input requests left pending by the last turn (from `input.requested`).
319
- * Empty unless the last turn parked.
320
- */
321
- readonly pendingInputRequests: readonly InputRequest[];
322
- /**
323
- * Assert the last turn parked on HITL input. Throws (failing the case) when
324
- * nothing is pending or the filter matches nothing. Returns the requests.
325
- */
326
- expectInputRequests(filter?: {
327
- toolName?: string;
328
- display?: InputRequest["display"];
329
- }): readonly InputRequest[];
330
- /** Resolve specific pending requests and run the resumed turn. */
331
- respond(...responses: InputResponse[]): Promise<EveEvalTurn>;
332
- /** Resolve every pending request with one optionId ("approve" / "deny" / ...). */
333
- respondAll(optionId: string): Promise<EveEvalTurn>;
334
-
335
- /** All events observed on this session so far. */
336
- readonly events: readonly HandleMessageStreamEvent[];
337
- readonly sessionId: string | undefined;
338
- /** Serializable cursor (continuationToken / streamIndex), as ClientSession.state. */
339
- readonly state: SessionState;
340
- }
341
-
342
- export interface EveEvalTurn {
343
- readonly status: "completed" | "waiting" | "failed";
344
- readonly message: string | undefined;
345
- /** Structured output when the turn requested an outputSchema. */
346
- readonly data: unknown;
347
- readonly events: readonly HandleMessageStreamEvent[];
348
- /** Input requests raised by this turn (parked HITL). */
349
- readonly inputRequests: readonly InputRequest[];
350
- /** Typed tool calls completed during this turn (see facts v2). */
351
- readonly toolCalls: readonly EveEvalToolCall[];
352
- /** Throw EveEvalTurnFailedError unless status is "completed" or "waiting". */
353
- expectOk(): this;
354
- }
355
- ```
356
-
357
- Notes:
358
-
359
- - `send` does **not** throw on `turn.failed`/`session.failed` by default —
360
- failure handling belongs to checks (`Checks.completed()`) or explicit
361
- `turn.expectOk()`. This keeps negative-path suites (e.g. today's
362
- `remote-agent-start-failure.ts`, `tool-throw-recover.ts`) natural to write.
363
- - HITL coverage: `needsApproval` approvals (`approve`/`deny` option ids),
364
- framework `ask_question` selects (`optionId`), freeform answers
365
- (`InputResponse` with text), and tool/connection auth parks — all are just
366
- `expectInputRequests()` + `respond(...)`. Subagent approval proxying needs
367
- nothing extra: requests surface on the parent stream.
368
- - The static `messages` task compiles to
369
- `for (const m of messages(case)) await ctx.session.send(m)` — identical
370
- behavior to today.
371
-
372
- ### Checks: hard assertions
373
-
374
- ```ts
375
- export interface EveEvalCheckResult {
376
- readonly name: string;
377
- readonly passed: boolean;
378
- /** Human-readable failure detail, shown in console + artifacts. */
379
- readonly message?: string;
380
- readonly metadata?: Readonly<Record<string, unknown>>;
381
- }
382
-
383
- export interface EveEvalCheckArgs {
384
- readonly case: EveEvalCase;
385
- readonly result: EveEvalTaskResult; // same data as scorers, no judge model
386
- /** Target handle, so checks can reference the live target (URL, info). */
387
- readonly target: EveEvalTargetHandle;
388
- }
389
-
390
- export type EveEvalCheck = (
391
- args: EveEvalCheckArgs,
392
- ) => EveEvalCheckResult | Promise<EveEvalCheckResult>;
393
- ```
394
-
395
- Matcher options on built-in checks accept a literal, RegExp, predicate, or a
396
- resolver `(args: EveEvalCheckArgs) => value` — the last form is what lets a
397
- check compare against runner-assigned values like a secondary target's URL.
398
-
399
- Built-ins, exported from `eve/evals/checks`:
400
-
401
- | Check | Asserts |
402
- | ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
403
- | `Checks.completed()` | Final status is `"completed"` (not `"waiting"`, not `"failed"`) |
404
- | `Checks.waiting()` | Final status is `"waiting"` (for park-shaped suites) |
405
- | `Checks.didNotFail()` | Status is not `"failed"` and no `turn.failed`/`step.failed` events |
406
- | `Checks.messageIncludes(token)` | Joined `message.completed` text contains `token` (string or RegExp) |
407
- | `Checks.outputEquals(value)` / `Checks.outputMatches(schema)` | Deep-equal / Standard Schema validation of `result.output` |
408
- | `Checks.toolCalled(name, opts?)` | A tool call with `name` happened; `opts.input` partial-deep-matches the call input (values: literal, RegExp, or predicate); `opts.output` matches the result; `opts.times` constrains count; `opts.isError` constrains error state |
409
- | `Checks.toolNotCalled(name)` | No call to `name` |
410
- | `Checks.toolOrder([...names])` | Names appear in order (subsequence match) |
411
- | `Checks.noFailedActions()` | No `action.result` with `isError: true` |
412
- | `Checks.subagentCalled(name, opts?)` | Subagent delegation occurred; `opts.remoteUrl` matches `subagent.called` remote metadata; `opts.output` matches the `subagent.completed` output |
413
- | `Checks.event(predicate, label)` | Escape hatch: any predicate over the typed event stream |
414
-
415
- Pass/fail policy:
416
-
417
- - Any check returning `passed: false`, any throw inside `run`, and any
418
- execution error (timeout, transport) marks the case **failed**.
419
- - Failed cases always produce a non-zero `eve eval` exit code.
420
- - Scores keep today's semantics: thresholded, reported, never gate the exit
421
- code — **unless** `--strict` is passed, which additionally fails the process
422
- when any case scores below threshold. CI for fixture smoke suites runs
423
- `--strict`; users running exploratory quality evals don't.
424
-
425
- ### Derived facts v2 (breaking)
426
-
427
- `EveEvalDerivedFacts` gains typed records; counts stay for reporters.
428
-
429
- ```ts
430
- export interface EveEvalToolCall {
431
- readonly name: string;
432
- readonly input: JsonObject; // from actions.requested
433
- readonly output: unknown; // from the matching action.result
434
- readonly isError: boolean;
435
- readonly turnIndex: number;
436
- readonly sessionId: string;
437
- }
438
-
439
- export interface EveEvalDerivedFacts {
440
- readonly toolCalls: readonly EveEvalToolCall[]; // was readonly string[]
441
- readonly toolCallCount: number;
442
- readonly subagentCalls: readonly EveEvalSubagentCall[]; // { name, remoteUrl?, ... }
443
- readonly subagentCallCount: number;
444
- readonly inputRequests: readonly InputRequest[]; // NEW: all HITL requests raised
445
- readonly parked: boolean; // NEW: ended waiting on input
446
- readonly messageCount: number;
447
- readonly reasoningBlockCount: number;
448
- readonly failureCode?: string;
449
- }
450
- ```
451
-
452
- `Run.usedTool(name)` keeps working; it gains an optional second argument with
453
- the same matcher options as `Checks.toolCalled` so scorers can grade tool-input
454
- quality fractionally where checks assert it absolutely.
455
-
456
- ### Result model, reporters, and artifacts
457
-
458
- Checks and trials need a home in the result types
459
- (`packages/eve/src/evals/types.ts`):
460
-
461
- ```ts
462
- export interface EveEvalCaseResult {
463
- readonly case: EveEvalCase;
464
- readonly result: EveEvalTaskResult; // aggregated over sessions
465
- readonly checks: readonly EveEvalCheckResult[]; // NEW
466
- readonly scores: readonly EveEvalScorerResult[];
467
- /**
468
- * NEW: per-case verdict, computed by the runner:
469
- * "passed" — no error, all checks passed
470
- * "failed" — a check failed, run() threw, or execution errored
471
- * "scored" — passed checks but at least one score below threshold
472
- * "skipped" — an unmet `requires` entry
473
- */
474
- readonly verdict: "passed" | "failed" | "scored" | "skipped";
475
- readonly trials?: readonly EveEvalTrialResult[]; // present when trials > 1
476
- readonly error?: string;
477
- readonly skipReason?: string; // unmet requirement, when skipped
478
- }
479
-
480
- export interface EveEvalSuiteResult {
481
- readonly suite: string;
482
- readonly target: EveEvalTarget;
483
- readonly cases: readonly EveEvalCaseResult[];
484
- readonly startedAt: string;
485
- readonly completedAt: string;
486
- readonly passed: number;
487
- readonly failed: number; // NEW: check failures + run throws + exec errors
488
- readonly scored: number; // NEW: below-threshold-only cases
489
- readonly skipped: number; // NEW: requirement-skipped cases
490
- readonly errored: number; // retained: the execution-error subset of failed
491
- }
492
- ```
493
-
494
- Downstream effects:
495
-
496
- - **Console reporter**: today's `✓ / ○ / ✗` icons map onto
497
- `passed / scored / failed`, plus `-` for `skipped` (with the unmet
498
- requirement inline); failed checks print their `message` indented under the
499
- case line (replacing the bespoke error prose e2e tests hand-craft today).
500
- Summary adds check and skip totals.
501
- - **`EvalReporter` interface**: unchanged shape (`onSuiteStart` /
502
- `onCaseComplete` / `onSuiteComplete`) — the richer `EveEvalCaseResult`
503
- flows through existing hooks, so custom reporters keep compiling.
504
- - **Braintrust reporter**: checks log as binary scores under a `check:` name
505
- prefix (e.g. `check:toolCalled(bash)`) so experiments diff check regressions
506
- the same way they diff score regressions; `verdict` and failed-check
507
- messages land in span metadata.
508
- - **Artifacts**: `cases/<id>.json` gains `checks` and `verdict`;
509
- `summary.json` gains the new counters; per-trial event streams write as
510
- `cases/<id>.trial-<n>.events.ndjson`. Multi-session cases write one events
511
- file per session, keyed by session id.
512
- - **Reporter throughput**: `onCaseComplete` is currently awaited inline inside
513
- the case pool, so a slow reporter throttles execution; v2 queues reporter
514
- callbacks off the hot path (ordering preserved per suite).
515
-
516
- ### Targets: a target is a URL
517
-
518
- There is no suite-owned provisioning config. A target is always just a base
519
- URL, and the suite's job is to interact and assert — never to describe how the
520
- agent gets built or started. This is deliberate: properties like the agent's
521
- model or the mock-model adapter are **build/start-time properties of the
522
- server process**. A suite cannot control them at run time, so an API that lets
523
- a suite "declare" them is a footgun — it works only when the runner happens to
524
- be the thing booting the server, and silently means nothing otherwise.
525
-
526
- There are exactly two ways to get a target:
527
-
528
- | Invocation | Target | Use |
529
- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------- |
530
- | `eve eval` | Runner boots the local dev server in-process (today's behavior), with the standard `.env` cascade; `--mock-models` boots it with the deterministic mock adapter | Local iteration |
531
- | `eve eval --url <url>` | Bring your own: a locally built `eve start` process, a preview deployment, production | CI smoke legs, post-deploy verification |
532
-
533
- "Build mode" is not a runner concept — a built server is just a URL you
534
- obtained by running `eve build && eve start` yourself. Provisioning (build,
535
- start, env injection, sidecar fixtures, secondary agents for multi-agent
536
- topologies) lives in a script or CI step that boots things and then points
537
- `eve eval --url` at them. This composition already exists and works:
538
- `e2e/tests/basic-runtime/evals.ts` boots a fixture with
539
- `EVE_MOCK_AUTHORED_MODELS=1` and shells out to `eve eval --url` today. v1
540
- blesses that pattern instead of absorbing it.
541
-
542
- `EveEvalTargetHandle` (available on the run context) is how cases reach
543
- non-session surfaces:
544
-
545
- ```ts
546
- export interface EveEvalTargetHandle {
547
- readonly baseUrl: string;
548
- /** Discovered from the live target (/eve/v1/info), never declared. */
549
- readonly capabilities: { readonly devRoutes: boolean; readonly mockModels: boolean };
550
- /** Raw fetch against the target — webhook/channel ingress, health, info. */
551
- fetch(path: string, init?: RequestInit): Promise<Response>;
552
- /** Typed agent info (GET /eve/v1/info). */
553
- info(): Promise<AgentInfoResult>;
554
- /**
555
- * Dispatch a schedule via the dev-only route. Guarded by the devRoutes
556
- * capability (see Requirements below).
557
- */
558
- dispatchSchedule(scheduleId: string): Promise<{ sessionIds: readonly string[] }>;
559
- /**
560
- * Attach to a session this case did not create (channel- or
561
- * schedule-initiated). Consumes the durable stream from startIndex,
562
- * resolving at the turn boundary.
563
- */
564
- attachSession(sessionId: string, opts?: { startIndex?: number }): EveEvalSession;
565
- }
566
- ```
567
-
568
- This covers channel suites (POST a signed webhook via `target.fetch`, assert
569
- on an externally provisioned fake provider, attach to the created session) and
570
- schedule suites (`dispatchSchedule` + `attachSession`).
571
-
572
- The runner always performs the readiness/identity handshake regardless of who
573
- provisioned the target: `/eve/v1/health` polling, then `/eve/v1/info`
574
- verification that this is the expected agent (the same stale-server guard the
575
- e2e harness uses today). `/info` is extended to report mock-model state so
576
- capabilities are discovered, not assumed.
577
-
578
- ### Requirements: `requires`, verified against the live target
579
-
580
- Suites cannot control the target, but their assertions still _assume_ things
581
- about it — determinism via mock models, dev-only routes, a sidecar URL in the
582
- environment. v1 gives those assumptions exactly one surface:
583
-
584
- ```ts
585
- // Per suite (applies to all cases) and per case (additive):
586
- readonly requires?: readonly EveEvalRequirement[];
587
-
588
- type EveEvalRequirement =
589
- | "mockModels" // target runs the deterministic mock adapter (via /info)
590
- | "devRoutes" // dev-only routes are mounted (via /info)
591
- | `env:${string}`; // process env var is set in the eval process (sidecar URLs)
592
- ```
593
-
594
- Rules:
595
-
596
- 1. **Requirements are verified, never fulfilled.** The runner checks
597
- `mockModels`/`devRoutes` against the discovered capabilities and
598
- `env:<NAME>` against its own process environment. It never tries to make a
599
- requirement true.
600
- 2. **Unmet requirement → skip, visibly.** The case (or every case, for
601
- suite-level `requires`) gets `verdict: "skipped"` with the unmet
602
- requirement as the reason — reported in console, `--json`, and artifacts;
603
- never silently dropped, never failed. `--no-skips` turns skips into
604
- failures for legs that must prove full coverage. Skips don't otherwise
605
- affect the exit code.
606
- 3. **One convenience, because the runner boots the dev server:** plain
607
- `eve eval` with suites requiring `"mockModels"` either passes
608
- `--mock-models` or sees those suites skip with a message naming the flag.
609
- No auto-magic in v1 — explicit and predictable beats clever.
610
- 4. **Runtime guards back the declarations.** `target.dispatchSchedule` throws
611
- a requirement error when called by a case that didn't declare
612
- `"devRoutes"` — so undeclared dependencies surface as named failures on
613
- the local leg, not as mystery flakes on the remote leg.
614
-
615
- The deferred idea — a suite-owned `environment` block where the runner builds
616
- and starts targets, injects env, and manages sidecar lifecycles — is recorded
617
- under Open questions. It is not in v1: it duplicated CLI concerns, and it let
618
- suites express build-time properties they cannot actually own.
619
-
620
- ### The external provisioner pattern
621
-
622
- What v1 deliberately does not own — and where it lives instead:
623
-
624
- | Concern | v1 home |
625
- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
626
- | Building/starting the agent (`eve build` + `eve start`) | CI step or a thin script (the existing `e2e/lib/server.ts` logic, kept) |
627
- | Agent env (mock models, fake-provider URLs, feature flags) | The provisioner's environment for the server process |
628
- | Sidecars (fake Telegram/Discord APIs, MCP stubs, probes) | Started by the provisioner; their URLs exported as env to both the agent and the eval process |
629
- | Secondary agents (remote-subagent topologies) | Provisioner boots both servers and wires `EVE_*_HOST` env |
630
- | Data-dir isolation (`WORKFLOW_LOCAL_DATA_DIR`) | Provisioner-owned temp dirs |
631
-
632
- A provisioned smoke leg is two steps:
633
-
634
- ```sh
635
- node e2e/provision/<group>.ts & # build, sidecars, env, eve start, health-poll
636
- pnpm --filter <fixture-app> exec eve eval --strict --url "http://127.0.0.1:$PORT"
637
- ```
638
-
639
- Suites consume provisioner outputs through env (declared via `env:<NAME>`
640
- requirements): a channel suite reads `TELEGRAM_PROBE_URL` to query what the
641
- fake Bot API captured; a subagent suite reads `EVE_WEATHER_AGENT_HOST` to
642
- assert on `subagent.called` remote URLs. The probe/stub helpers
643
- (`startHttpProbe`, `startMcpStub`, generalized from `e2e/lib/`) ship in
644
- `eve/evals/environment` for provisioners — including users' own — to use, with
645
- an HTTP inspection endpoint so suites can assert on captured requests across
646
- the process boundary.
647
-
648
- ### CLI v2
649
-
650
- ```
651
- eve eval [suiteId...]
652
- --url <url> run against an existing target (built local server,
653
- preview deployment); without it, boots the dev server
654
- --mock-models boot the dev server with the deterministic mock
655
- adapter (invalid with --url; the target's mock state
656
- is discovered, not set)
657
- --tag <tag...> run only cases (or suites) carrying a tag
658
- --case <id...> run only specific case ids
659
- --strict sub-threshold scores also fail the exit code
660
- --no-skips requirement skips fail instead of skipping
661
- --trials <n> override suite trials
662
- --timeout <ms> per-case timeout (existing)
663
- --max-concurrency <n> (existing)
664
- --json structured stdout (existing)
665
- --skip-report skip suite reporters (existing)
666
- --list print discovered suites/cases without running
667
- --verbose stream per-case ctx.log and event summaries
668
- ```
669
-
670
- - Positional suite ids replace `--suite`; the dead `--all` flag is removed
671
- (no filter already means all).
672
- - Exit codes: `0` all cases passed checks (and thresholds under `--strict`);
673
- `1` any case failed (check failure, run throw, execution error, or strict
674
- threshold miss); `2` runner/configuration error. Unmet requirements skip
675
- (reported, exit-code-neutral); pass `--no-skips` to turn any skip into a
676
- failure when a leg must prove full coverage.
677
- - Console reporter renders check failures with their `message` inline (the
678
- bespoke error strings e2e tests craft today become structured output).
679
- - Artifacts gain `checks` per case in `cases/<id>.json` and `summary.json`.
680
- - New optional JUnit reporter (`eve/evals/reporters`) for CI annotation.
681
-
682
- ## Client consolidation: delete `ExampleClient`
683
-
684
- `ClientSession` already supports everything `ExampleClient` posts
685
- (`UserContent` messages, `clientContext`, `inputResponses`, `outputSchema` —
686
- see `packages/eve/src/client/types.ts`). What `ExampleClient` adds must move
687
- into the framework, then it dies:
688
-
689
- 1. **Pending input tracking**: `MessageResult` gains
690
- `inputRequests: readonly InputRequest[]` (collected from `input.requested`
691
- events for the consumed turn). The eval driver and the TUI both stop
692
- re-deriving it.
693
- 2. **Stream-registration race**: `fetchStreamWithRetry` /
694
- `postWithDeliverRetry` exist because the GET stream and input-response
695
- delivery race workflow registration after session-creating POSTs. Fix at
696
- the source where possible (the server should not 500 on
697
- `startIndex`-cursor GETs for a session it just acknowledged); where a
698
- genuine propagation window remains, put one bounded retry policy inside
699
- `eve/client` (`openStreamIterable` / `ClientSession.#postTurn`) so every
700
- consumer — TUI, evals, users — inherits it. Delete both e2e copies and the
701
- `tui-questions.ts` sleep.
702
- 3. **Turn failure surfacing**: export a typed
703
- `isTurnFailureEvent(event)` narrowing helper and the
704
- `EveEvalTurn.expectOk()` driver method instead of `TurnFailedError`-style
705
- throw-by-default (negative-path suites need non-throwing sends).
706
- 4. **Multimodal sugar** (`sendTextWithImage`): becomes
707
- `EvalSession.sendFile`; the data-URL/`FilePart` encoding helper is exported
708
- from `eve/client` for general use.
709
- 5. `e2e/lib/session-stream.ts` is subsumed by `target.attachSession`.
710
-
711
- ## Replacing the e2e surface
712
-
713
- ### Where suites live
714
-
715
- Each fixture app keeps owning its coverage: suites move into
716
- `e2e/fixtures/agent-*/evals/*.eval.ts` and
717
- `apps/fixtures/weather-fixture/evals/*.eval.ts`. Discovery already scans
718
- `<appRoot>/evals/`. The area-policy module becomes unnecessary — a suite can
719
- only target its own app, by construction. Provisioning scripts live next to
720
- the fixtures (`e2e/provision/`), built from today's `e2e/lib/server.ts` and
721
- `e2e/target/` logic rather than rewriting it.
722
-
723
- ### Coverage mapping
724
-
725
- Scripted cases let one suite absorb a whole e2e group: today's 78 script files
726
- consolidate into roughly one suite per behavior family (e.g.
727
- `agent-tools-hitl/evals/hitl.eval.ts` with `approve-then-persist`,
728
- `deny-regates`, `ask-question`, and `tool-auth` cases), each sharing one
729
- provisioned target. `--case` becomes the day-to-day tool for re-running a
730
- single behavior while debugging.
731
-
732
- | e2e group | v2 expression |
733
- | ------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
734
- | `basic-runtime/*` (basic, multi-turn history, client context, output schema, image, define-state) | `task.run` + `Checks.messageIncludes` / `outputMatches`; `sendFile` for image; `send({ clientContext })`; multi-turn token recall is two `send`s and one check |
735
- | `tools/*` (14 dynamic-tool files, MCP, multi-step loop, narrowing, throw-recover) | `Checks.toolCalled(name, { input, output, isError })`, `Checks.toolOrder`, `Checks.event` for ordering edge cases; MCP stub started by the provisioner (`startMcpStub`), addressed via `env:` requirement |
736
- | `tools-hitl/*` (approval, denial, ask-question, tool auth) | `expectInputRequests` + `respond`/`respondAll`; auth flows keep the IdP emulator as a provisioner sidecar |
737
- | `tools-sandbox/*` | `task.run` + `Checks.toolCalled("bash", ...)`; snapshot suite tagged `requires-credentials`, excluded in CI via `--tag` |
738
- | `channels/*` | provisioner starts the fake provider (one shared `startHttpProbe`) and exports `*_API_BASE_URL`; case does `target.fetch` webhook ingress, asserts on probe captures via its inspection endpoint, `attachSession` for the created session |
739
- | `schedules/*` | `target.dispatchSchedule` + `attachSession`; stream-resume test asserts `attachSession({ startIndex })` replay |
740
- | `subagents/*` (incl. remote delegation, callbacks, failures) | provisioner boots both agents and wires `EVE_WEATHER_AGENT_HOST`; `Checks.subagentCalled(name, { remoteUrl })`; callback retry/bypass probes as provisioner sidecars |
741
- | `codemode/*` | Unchanged suite bodies; CI matrix env (`EVE_EXPERIMENTAL_CODE_MODE=1`) is set by the CI job / provisioner when starting the server |
742
- | `tui-client/*` | **Stays a script harness** (non-goal) |
743
-
744
- ### Worked example: porting `remote-agent-delegation`
745
-
746
- Today's `e2e/tests/subagents/remote-agent-delegation.ts` is 101 lines: manual
747
- port arithmetic, two `resolveTarget` calls threading `startEnv` by hand, and a
748
- 58-line `assertRemoteDelegation` function of filter/narrow/throw prose. The v2
749
- suite, at `e2e/fixtures/agent-subagents/evals/remote-delegation.eval.ts`:
750
-
751
- ```ts
752
- import { defineEvalSuite } from "eve/evals";
753
- import { Checks } from "eve/evals/checks";
754
-
755
- const CITY = "Lisbon";
756
-
757
- export default defineEvalSuite({
758
- description: "Remote subagent delegation over HTTP to a second local agent.",
759
-
760
- // Assumptions about the target, verified by the runner. The provisioner
761
- // boots both agents with mock models and wires EVE_WEATHER_AGENT_HOST.
762
- requires: ["mockModels", "env:EVE_WEATHER_AGENT_HOST"],
763
-
764
- checks: [Checks.didNotFail()],
765
- scores: [],
766
-
767
- cases: [
768
- {
769
- id: "weather-result-reaches-parent",
770
- async run({ session }) {
771
- await session.send(
772
- `Use the weather remote agent to get the weather for ${CITY}. ` +
773
- "Include its result in the final reply.",
774
- );
775
- },
776
- checks: [
777
- Checks.subagentCalled("weather", {
778
- remoteUrl: () => process.env.EVE_WEATHER_AGENT_HOST!,
779
- output: /Sunny[\s\S]*72F/,
780
- }),
781
- Checks.messageIncludes(CITY),
782
- Checks.messageIncludes("Sunny"),
783
- Checks.messageIncludes("72F"),
784
- ],
785
- },
786
- ],
787
- });
788
- ```
789
-
790
- The two-server topology moves to a ~20-line provisioner
791
- (`e2e/provision/subagents.ts`, reusing today's `startAgentServer`): boot the
792
- weather fixture with mocks, boot the parent with mocks +
793
- `EVE_WEATHER_AGENT_HOST`, then run
794
- `eve eval --strict --url "$PARENT_URL" remote-delegation`.
795
-
796
- What the diff buys, beyond line count:
797
-
798
- - **Clean separation** — the suite holds only interaction and assertions; the
799
- topology lives in one provisioner shared by every subagent case. The suite
800
- states its assumptions (`requires`) and the runner enforces them, so running
801
- it against an unprovisioned target skips with a named reason instead of
802
- failing mysteriously.
803
- - **Structured failures** — `Checks.subagentCalled` failing reports the
804
- observed `subagent.called` events in its result metadata; today that detail
805
- exists only because someone hand-built the error string.
806
- - **Free reporting** — the case lands in `--json`, artifacts, and (if
807
- configured) Braintrust like any other eval, and `--case
808
- weather-result-reaches-parent` reruns it in isolation.
809
- - **Room to grow** — `remote-agent-callback-retry`, `-bypass`, and
810
- `-start-failure` become sibling cases in the same suite, sharing one
811
- provisioned topology instead of re-spawning per file.
812
-
813
- ### CI
814
-
815
- `.github/workflows/smoke.yml` discovery changes from globbing
816
- `e2e/tests/*/*.ts` to globbing fixture apps with `evals/` directories. Each
817
- matrix leg provisions, then runs the suites against the resulting URL:
818
-
819
- ```sh
820
- node e2e/provision/<group>.ts & # build + sidecars + env + eve start
821
- pnpm --filter <fixture-app> exec eve eval --strict --json --url "$TARGET_URL"
822
- ```
823
-
824
- twice (direct / `EVE_EXPERIMENTAL_CODE_MODE=1` set by the provisioner),
825
- `fail-fast: false`, JUnit reporter for annotations. Per-suite artifacts under
826
- `.eve/evals/` upload on failure — strictly better debuggability than today's
827
- stdout scraping. A post-deploy leg is the same invocation pointed at a preview
828
- deployment; requirement-incompatible cases skip visibly.
829
-
830
- ### What gets deleted (end state)
831
-
832
- - `e2e/lib/client.ts`, `e2e/lib/session-stream.ts`,
833
- `e2e/lib/schedule-dispatch.ts`, the duplicated retry helpers, the per-file
834
- fake provider servers, `e2e/lib/area-policy.ts`.
835
- - `e2e/lib/run.ts`'s assertion-side surface; `e2e/lib/server.ts` and
836
- `e2e/target/*` shrink into thin provisioners under `e2e/provision/` (their
837
- self-tests come along) instead of being absorbed into the framework.
838
- - All of `e2e/tests/**` except `tui-client/`.
839
-
840
- ## Implementation phases
841
-
842
- Each phase ships independently, keeps `pnpm test` green, and includes docs
843
- (`docs/public/advanced/evals.md`) + changesets per repo policy.
844
-
845
- ### Phase 1 — Assertions and pass/fail (no breaking interaction changes)
846
-
847
- 1. Derived facts v2: typed `toolCalls`/`subagentCalls`/`inputRequests`/`parked`
848
- (breaking type change; update `Run` scorers and Braintrust reporter).
849
- 2. `checks` field, `EveEvalCheck` types, `eve/evals/checks` built-ins with the
850
- matcher mini-language (literal / RegExp / predicate, partial deep match).
851
- 3. Exit-code policy + `--strict`; console/JSON/artifact rendering of checks.
852
- 4. Hygiene: `--tag`/`--case` filtering, `--list`, remove `--all`, make `model`
853
- conditionally required, surface `parked` so `Run.didNotFail` stops silently
854
- passing parked sessions.
855
-
856
- ### Phase 2 — Interaction API
857
-
858
- 5. Client groundwork: `MessageResult.inputRequests`, retry policy moved into
859
- `eve/client`, server-side fix for the stream-registration race, multimodal
860
- helper export.
861
- 6. `EvalSession` driver + `EveEvalTurn`; `task.run` variant and the
862
- scripted-case union (`case.run`, optional `input`, per-case
863
- `checks`/`scores`); reimplement `prompt`/`messages` as sugar over `run`;
864
- multi-session support (`newSession`), accumulated multi-session event
865
- capture.
866
- 7. HITL surface: `expectInputRequests`, `respond`, `respondAll`; checks for
867
- parked/resumed flows. Port `tools-hitl` smokes as the proving ground.
868
-
869
- ### Phase 3 — Targets, requirements, and non-session surfaces
870
-
871
- 8. `EveEvalTargetHandle`: `baseUrl`, `fetch`, `info`, `capabilities`,
872
- `dispatchSchedule`, `attachSession`; `--mock-models` for the runner-booted
873
- dev server; readiness/identity handshake for `--url` targets.
874
- 9. Requirements: suite/case `requires` (`mockModels` / `devRoutes` /
875
- `env:<NAME>`), `skipped` verdict + `--no-skips`, runtime guards on
876
- requirement-gated handle methods, and `/eve/v1/info` reporting mock-model
877
- state so requirements are verified against the live target.
878
- 10. Provisioner helpers in `eve/evals/environment`: `startHttpProbe` (with
879
- HTTP inspection endpoint) and `startMcpStub`, generalized from
880
- `e2e/lib/`; restructure `e2e/lib/server.ts` + `e2e/target/*` into
881
- `e2e/provision/` scripts.
882
- 11. `trials` + JUnit reporter.
883
-
884
- ### Phase 4 — Migration and deletion
885
-
886
- 12. Port suites group by group in the order: `basic-runtime` → `tools` →
887
- `tools-hitl` → `subagents` → `schedules` → `channels` → `codemode` →
888
- `tools-sandbox`. Each ported group flips its CI matrix entry from
889
- `node e2e/tests/...` to provision + `eve eval --strict --url` in the same
890
- PR; both runners coexist until the last group lands.
891
- 13. Delete `e2e/tests/**` (minus `tui-client/`) and `ExampleClient`; retire
892
- area policy; shrink `e2e/lib`/`e2e/target` into `e2e/provision/`; update
893
- `e2e/README.md` and AGENTS.md smoke-test guidance to point at `eve eval`.
894
- 14. Optional follow-on: add a post-deploy `--url` leg against preview
895
- deployments for the requirement-compatible subset of fixture suites.
896
-
897
- ## Breaking changes
898
-
899
- Pre-1.0, breaking is preferred over compatibility shims (principle 4). Minor
900
- changesets for each:
901
-
902
- - `EveEvalDerivedFacts.toolCalls` / `subagentCalls` change from `string[]` to
903
- typed records.
904
- - `EveEvalCase` becomes a data/scripted union; `input` is no longer required
905
- on scripted cases. Existing data cases are unaffected.
906
- - `model` becomes optional; suites passing dummy models can drop them.
907
- - `--suite` replaced by positional ids; `--all` removed.
908
- - Exit-code semantics gain check failures (suites without `checks` see no
909
- change unless `--strict`).
910
-
911
- ## Risks and open questions
912
-
913
- - **Cost/flakiness of model-backed suites in CI.** Mitigation: the
914
- `"mockModels"` requirement makes determinism a declared, verified property
915
- instead of a per-file accident; `trials` + `--strict` thresholds handle the
916
- suites that must use real models. The migration should explicitly decide,
917
- per ported smoke, mock vs real — today's split is undocumented.
918
- - **The stream-registration race** may not be fully fixable server-side in
919
- Phase 2; the client-level bounded retry is the documented fallback. Track it
920
- as its own issue rather than letting retry constants drift again.
921
- - **Channel suites with real credentials** (`slack-thread-context`) and
922
- sandbox snapshot suites stay tag-gated (`requires-credentials`) and excluded
923
- from default CI, as today.
924
- - **Provisioner drift on `--url` legs.** Beyond what `/info` exposes
925
- (identity, mode, mock-model state) and `env:<NAME>` presence checks, the
926
- runner trusts the provisioner. A suite can pass locally and skip-or-fail
927
- remotely because the target was provisioned differently; requirements make
928
- this visible but cannot make it impossible.
929
- - **Open: should the runner ever own provisioning?** A suite-owned
930
- `environment` block (runner builds/starts targets, injects env, manages
931
- sidecar lifecycles, boots secondary agents) was considered and deliberately
932
- cut from v1: it duplicated CLI concerns and let suites declare build-time
933
- properties they cannot own at run time. Revisit only with concrete friction
934
- data from the migration — the likely v2 shape, if any, is a _project-level_
935
- provisioning config consumed by `eve eval` (like Playwright's `webServer`),
936
- not per-suite config.
937
- - **User-facing naming**: `checks` vs `scores` vs `thresholds` vs `requires`
938
- needs a docs pass so the soft/hard/assumption distinction is obvious; the
939
- evals doc gets a "smoke-testing your agent" section once Phase 2 lands.