@agent-native/core 0.52.0 → 0.54.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +41 -95
- package/blueprints/action/crud.md +98 -0
- package/blueprints/channel/discord.md +74 -0
- package/blueprints/provider/stripe.md +87 -0
- package/blueprints/sandbox/docker.md +78 -0
- package/dist/action.d.ts +64 -1
- package/dist/action.d.ts.map +1 -1
- package/dist/action.js +73 -2
- package/dist/action.js.map +1 -1
- package/dist/agent/index.d.ts +1 -0
- package/dist/agent/index.d.ts.map +1 -1
- package/dist/agent/index.js +1 -0
- package/dist/agent/index.js.map +1 -1
- package/dist/agent/observational-memory/compactor.d.ts +43 -0
- package/dist/agent/observational-memory/compactor.d.ts.map +1 -0
- package/dist/agent/observational-memory/compactor.js +50 -0
- package/dist/agent/observational-memory/compactor.js.map +1 -0
- package/dist/agent/observational-memory/config.d.ts +37 -0
- package/dist/agent/observational-memory/config.d.ts.map +1 -0
- package/dist/agent/observational-memory/config.js +48 -0
- package/dist/agent/observational-memory/config.js.map +1 -0
- package/dist/agent/observational-memory/index.d.ts +26 -0
- package/dist/agent/observational-memory/index.d.ts.map +1 -0
- package/dist/agent/observational-memory/index.js +25 -0
- package/dist/agent/observational-memory/index.js.map +1 -0
- package/dist/agent/observational-memory/internal-run.d.ts +37 -0
- package/dist/agent/observational-memory/internal-run.d.ts.map +1 -0
- package/dist/agent/observational-memory/internal-run.js +59 -0
- package/dist/agent/observational-memory/internal-run.js.map +1 -0
- package/dist/agent/observational-memory/message-text.d.ts +13 -0
- package/dist/agent/observational-memory/message-text.d.ts.map +1 -0
- package/dist/agent/observational-memory/message-text.js +46 -0
- package/dist/agent/observational-memory/message-text.js.map +1 -0
- package/dist/agent/observational-memory/migrations.d.ts +13 -0
- package/dist/agent/observational-memory/migrations.d.ts.map +1 -0
- package/dist/agent/observational-memory/migrations.js +43 -0
- package/dist/agent/observational-memory/migrations.js.map +1 -0
- package/dist/agent/observational-memory/observer.d.ts +37 -0
- package/dist/agent/observational-memory/observer.d.ts.map +1 -0
- package/dist/agent/observational-memory/observer.js +82 -0
- package/dist/agent/observational-memory/observer.js.map +1 -0
- package/dist/agent/observational-memory/plugin.d.ts +16 -0
- package/dist/agent/observational-memory/plugin.d.ts.map +1 -0
- package/dist/agent/observational-memory/plugin.js +26 -0
- package/dist/agent/observational-memory/plugin.js.map +1 -0
- package/dist/agent/observational-memory/prompts.d.ts +27 -0
- package/dist/agent/observational-memory/prompts.d.ts.map +1 -0
- package/dist/agent/observational-memory/prompts.js +42 -0
- package/dist/agent/observational-memory/prompts.js.map +1 -0
- package/dist/agent/observational-memory/read.d.ts +45 -0
- package/dist/agent/observational-memory/read.d.ts.map +1 -0
- package/dist/agent/observational-memory/read.js +97 -0
- package/dist/agent/observational-memory/read.js.map +1 -0
- package/dist/agent/observational-memory/reflector.d.ts +31 -0
- package/dist/agent/observational-memory/reflector.d.ts.map +1 -0
- package/dist/agent/observational-memory/reflector.js +76 -0
- package/dist/agent/observational-memory/reflector.js.map +1 -0
- package/dist/agent/observational-memory/schema.d.ts +267 -0
- package/dist/agent/observational-memory/schema.d.ts.map +1 -0
- package/dist/agent/observational-memory/schema.js +48 -0
- package/dist/agent/observational-memory/schema.js.map +1 -0
- package/dist/agent/observational-memory/store.d.ts +52 -0
- package/dist/agent/observational-memory/store.d.ts.map +1 -0
- package/dist/agent/observational-memory/store.js +197 -0
- package/dist/agent/observational-memory/store.js.map +1 -0
- package/dist/agent/observational-memory/types.d.ts +61 -0
- package/dist/agent/observational-memory/types.d.ts.map +1 -0
- package/dist/agent/observational-memory/types.js +9 -0
- package/dist/agent/observational-memory/types.js.map +1 -0
- package/dist/agent/processors.d.ts +146 -0
- package/dist/agent/processors.d.ts.map +1 -0
- package/dist/agent/processors.js +122 -0
- package/dist/agent/processors.js.map +1 -0
- package/dist/agent/production-agent.d.ts +25 -0
- package/dist/agent/production-agent.d.ts.map +1 -1
- package/dist/agent/production-agent.js +341 -1
- package/dist/agent/production-agent.js.map +1 -1
- package/dist/agent/run-loop-with-resume.d.ts.map +1 -1
- package/dist/agent/run-loop-with-resume.js +48 -0
- package/dist/agent/run-loop-with-resume.js.map +1 -1
- package/dist/agent/run-store.d.ts +17 -0
- package/dist/agent/run-store.d.ts.map +1 -1
- package/dist/agent/run-store.js +55 -0
- package/dist/agent/run-store.js.map +1 -1
- package/dist/agent/runtime-context.d.ts +30 -0
- package/dist/agent/runtime-context.d.ts.map +1 -1
- package/dist/agent/runtime-context.js +54 -1
- package/dist/agent/runtime-context.js.map +1 -1
- package/dist/agent/tool-call-journal.d.ts +99 -0
- package/dist/agent/tool-call-journal.d.ts.map +1 -0
- package/dist/agent/tool-call-journal.js +212 -0
- package/dist/agent/tool-call-journal.js.map +1 -0
- package/dist/agent/types.d.ts +35 -0
- package/dist/agent/types.d.ts.map +1 -1
- package/dist/agent/types.js.map +1 -1
- package/dist/cli/add.d.ts +109 -0
- package/dist/cli/add.d.ts.map +1 -0
- package/dist/cli/add.js +352 -0
- package/dist/cli/add.js.map +1 -0
- package/dist/cli/connect.d.ts +2 -2
- package/dist/cli/connect.d.ts.map +1 -1
- package/dist/cli/connect.js +92 -24
- package/dist/cli/connect.js.map +1 -1
- package/dist/cli/eval.d.ts +17 -0
- package/dist/cli/eval.d.ts.map +1 -0
- package/dist/cli/eval.js +121 -0
- package/dist/cli/eval.js.map +1 -0
- package/dist/cli/index.js +44 -3
- package/dist/cli/index.js.map +1 -1
- package/dist/cli/mcp.d.ts.map +1 -1
- package/dist/cli/mcp.js +11 -5
- package/dist/cli/mcp.js.map +1 -1
- package/dist/cli/plan-local.d.ts +66 -5
- package/dist/cli/plan-local.d.ts.map +1 -1
- package/dist/cli/plan-local.js +622 -21
- package/dist/cli/plan-local.js.map +1 -1
- package/dist/cli/skills.d.ts +2 -2
- package/dist/cli/skills.d.ts.map +1 -1
- package/dist/cli/skills.js +108 -62
- package/dist/cli/skills.js.map +1 -1
- package/dist/client/AssistantChat.d.ts.map +1 -1
- package/dist/client/AssistantChat.js +118 -92
- package/dist/client/AssistantChat.js.map +1 -1
- package/dist/client/agent-chat-adapter.d.ts.map +1 -1
- package/dist/client/agent-chat-adapter.js +16 -0
- package/dist/client/agent-chat-adapter.js.map +1 -1
- package/dist/client/chat/tool-call-display.d.ts +20 -1
- package/dist/client/chat/tool-call-display.d.ts.map +1 -1
- package/dist/client/chat/tool-call-display.js +32 -7
- package/dist/client/chat/tool-call-display.js.map +1 -1
- package/dist/client/sse-event-processor.d.ts +13 -0
- package/dist/client/sse-event-processor.d.ts.map +1 -1
- package/dist/client/sse-event-processor.js +21 -0
- package/dist/client/sse-event-processor.js.map +1 -1
- package/dist/coding-tools/run-code.d.ts.map +1 -1
- package/dist/coding-tools/run-code.js +18 -2
- package/dist/coding-tools/run-code.js.map +1 -1
- package/dist/db/client.d.ts +4 -2
- package/dist/db/client.d.ts.map +1 -1
- package/dist/db/client.js +6 -4
- package/dist/db/client.js.map +1 -1
- package/dist/deploy/route-discovery.d.ts.map +1 -1
- package/dist/deploy/route-discovery.js +1 -0
- package/dist/deploy/route-discovery.js.map +1 -1
- package/dist/eval/agent-runner.d.ts +63 -0
- package/dist/eval/agent-runner.d.ts.map +1 -0
- package/dist/eval/agent-runner.js +142 -0
- package/dist/eval/agent-runner.js.map +1 -0
- package/dist/eval/define-eval.d.ts +29 -0
- package/dist/eval/define-eval.d.ts.map +1 -0
- package/dist/eval/define-eval.js +43 -0
- package/dist/eval/define-eval.js.map +1 -0
- package/dist/eval/index.d.ts +18 -0
- package/dist/eval/index.d.ts.map +1 -0
- package/dist/eval/index.js +17 -0
- package/dist/eval/index.js.map +1 -0
- package/dist/eval/report.d.ts +8 -0
- package/dist/eval/report.d.ts.map +1 -0
- package/dist/eval/report.js +44 -0
- package/dist/eval/report.js.map +1 -0
- package/dist/eval/runner.d.ts +67 -0
- package/dist/eval/runner.d.ts.map +1 -0
- package/dist/eval/runner.js +256 -0
- package/dist/eval/runner.js.map +1 -0
- package/dist/eval/scorer.d.ts +83 -0
- package/dist/eval/scorer.d.ts.map +1 -0
- package/dist/eval/scorer.js +195 -0
- package/dist/eval/scorer.js.map +1 -0
- package/dist/eval/types.d.ts +162 -0
- package/dist/eval/types.d.ts.map +1 -0
- package/dist/eval/types.js +20 -0
- package/dist/eval/types.js.map +1 -0
- package/dist/extensions/fetch-tool.d.ts.map +1 -1
- package/dist/extensions/fetch-tool.js +80 -15
- package/dist/extensions/fetch-tool.js.map +1 -1
- package/dist/extensions/web-content.d.ts +61 -0
- package/dist/extensions/web-content.d.ts.map +1 -0
- package/dist/extensions/web-content.js +468 -0
- package/dist/extensions/web-content.js.map +1 -0
- package/dist/extensions/web-search-tool.js +3 -3
- package/dist/extensions/web-search-tool.js.map +1 -1
- package/dist/mcp/build-server.d.ts.map +1 -1
- package/dist/mcp/build-server.js +4 -1
- package/dist/mcp/build-server.js.map +1 -1
- package/dist/observability/traces.d.ts.map +1 -1
- package/dist/observability/traces.js +100 -1
- package/dist/observability/traces.js.map +1 -1
- package/dist/observability/tracing.d.ts +73 -0
- package/dist/observability/tracing.d.ts.map +1 -0
- package/dist/observability/tracing.js +126 -0
- package/dist/observability/tracing.js.map +1 -0
- package/dist/onboarding/default-steps.d.ts.map +1 -1
- package/dist/onboarding/default-steps.js +4 -1
- package/dist/onboarding/default-steps.js.map +1 -1
- package/dist/provider-api/actions/query-staged-dataset.d.ts +1 -1
- package/dist/provider-api/corpus-jobs.d.ts +80 -0
- package/dist/provider-api/corpus-jobs.d.ts.map +1 -1
- package/dist/provider-api/corpus-jobs.js +219 -22
- package/dist/provider-api/corpus-jobs.js.map +1 -1
- package/dist/provider-api/index.d.ts +24 -32
- package/dist/provider-api/index.d.ts.map +1 -1
- package/dist/provider-api/index.js +28 -1
- package/dist/provider-api/index.js.map +1 -1
- package/dist/scripts/agent-engines/list-agent-engines.d.ts.map +1 -1
- package/dist/scripts/agent-engines/list-agent-engines.js +10 -3
- package/dist/scripts/agent-engines/list-agent-engines.js.map +1 -1
- package/dist/server/action-discovery.d.ts.map +1 -1
- package/dist/server/action-discovery.js +4 -0
- package/dist/server/action-discovery.js.map +1 -1
- package/dist/server/agent-chat-plugin.d.ts +9 -0
- package/dist/server/agent-chat-plugin.d.ts.map +1 -1
- package/dist/server/agent-chat-plugin.js +119 -111
- package/dist/server/agent-chat-plugin.js.map +1 -1
- package/dist/server/agent-teams.d.ts +62 -0
- package/dist/server/agent-teams.d.ts.map +1 -1
- package/dist/server/agent-teams.js +99 -2
- package/dist/server/agent-teams.js.map +1 -1
- package/dist/server/better-auth-instance.d.ts +7 -0
- package/dist/server/better-auth-instance.d.ts.map +1 -1
- package/dist/server/better-auth-instance.js +90 -0
- package/dist/server/better-auth-instance.js.map +1 -1
- package/dist/server/core-routes-plugin.d.ts.map +1 -1
- package/dist/server/core-routes-plugin.js +7 -4
- package/dist/server/core-routes-plugin.js.map +1 -1
- package/dist/server/credential-provider.d.ts.map +1 -1
- package/dist/server/credential-provider.js +2 -0
- package/dist/server/credential-provider.js.map +1 -1
- package/dist/server/deep-link.d.ts +7 -0
- package/dist/server/deep-link.d.ts.map +1 -1
- package/dist/server/deep-link.js +13 -2
- package/dist/server/deep-link.js.map +1 -1
- package/dist/server/framework-request-handler.d.ts.map +1 -1
- package/dist/server/framework-request-handler.js +33 -1
- package/dist/server/framework-request-handler.js.map +1 -1
- package/dist/server/index.d.ts +2 -1
- package/dist/server/index.d.ts.map +1 -1
- package/dist/server/index.js +2 -1
- package/dist/server/index.js.map +1 -1
- package/dist/templates/default/.agents/skills/actions/SKILL.md +52 -1
- package/dist/templates/default/.agents/skills/security/SKILL.md +22 -0
- package/dist/templates/workspace-core/.agents/skills/actions/SKILL.md +52 -1
- package/dist/templates/workspace-core/.agents/skills/external-agents/SKILL.md +16 -4
- package/dist/templates/workspace-core/.agents/skills/harness-agents/SKILL.md +20 -0
- package/dist/templates/workspace-core/.agents/skills/observability/SKILL.md +31 -0
- package/dist/templates/workspace-core/.agents/skills/security/SKILL.md +22 -0
- package/docs/content/actions.md +50 -0
- package/docs/content/agent-teams.md +32 -0
- package/docs/content/blueprint-installer.md +73 -0
- package/docs/content/durable-resume.md +49 -0
- package/docs/content/evals.md +141 -0
- package/docs/content/external-agents.md +2 -2
- package/docs/content/human-approval.md +101 -0
- package/docs/content/observability.md +21 -0
- package/docs/content/observational-memory.md +63 -0
- package/docs/content/plan-plugin.md +5 -0
- package/docs/content/pr-visual-recap.md +9 -5
- package/docs/content/processors.md +99 -0
- package/docs/content/sandbox-adapters.md +134 -0
- package/docs/content/template-plan.md +97 -21
- package/package.json +10 -1
- package/src/templates/default/.agents/skills/actions/SKILL.md +52 -1
- package/src/templates/default/.agents/skills/security/SKILL.md +22 -0
- package/src/templates/workspace-core/.agents/skills/actions/SKILL.md +52 -1
- package/src/templates/workspace-core/.agents/skills/external-agents/SKILL.md +16 -4
- package/src/templates/workspace-core/.agents/skills/harness-agents/SKILL.md +20 -0
- package/src/templates/workspace-core/.agents/skills/observability/SKILL.md +31 -0
- package/src/templates/workspace-core/.agents/skills/security/SKILL.md +22 -0
|
@@ -80,6 +80,26 @@ existing run routes as `goalId=agent-harness`.
|
|
|
80
80
|
Preserve `defineAction` auth, request context, timeouts, truncation, and
|
|
81
81
|
read-only metadata.
|
|
82
82
|
|
|
83
|
+
## Code Execution Sandbox
|
|
84
|
+
|
|
85
|
+
- The `run-code` tool executes through a pluggable `SandboxAdapter`
|
|
86
|
+
(`packages/core/src/coding-tools/sandbox/`). The default
|
|
87
|
+
`LocalChildProcessAdapter` spawns a locked-down local Node child process;
|
|
88
|
+
swap it via `AGENT_NATIVE_SANDBOX` or `registerSandboxAdapter()` for a
|
|
89
|
+
Docker/remote/durable backend (the lever to exceed the hosted ~40s code-exec
|
|
90
|
+
ceiling). An adapter only runs the already-prepared, non-secret module source
|
|
91
|
+
— it never sees app secrets. See the Sandbox Adapters doc; `agent-native add
|
|
92
|
+
sandbox docker` emits a full Docker-adapter recipe.
|
|
93
|
+
|
|
94
|
+
## Sub-Agent Delegation Depth
|
|
95
|
+
|
|
96
|
+
- Sub-agent spawning is capped server-side (default depth `2`) so delegation
|
|
97
|
+
chains can't fan out indefinitely. Override at deploy time with
|
|
98
|
+
`AGENT_NATIVE_MAX_SUBAGENT_DEPTH` (`0` disables sub-agents; clamped to `16`).
|
|
99
|
+
Enforcement is ambient via `evaluateSubagentDepth` in
|
|
100
|
+
`packages/core/src/server/agent-teams.ts` — independent of any tool-level
|
|
101
|
+
guard. See the Agent Teams doc for the depth model.
|
|
102
|
+
|
|
83
103
|
## Don't
|
|
84
104
|
|
|
85
105
|
- Don't add Claude Code, Codex, Cursor, Mastra, or Pi as an `AgentEngine`.
|
|
@@ -75,6 +75,26 @@ const criteria: EvalCriteria = {
|
|
|
75
75
|
};
|
|
76
76
|
```
|
|
77
77
|
|
|
78
|
+
#### Evals (CI gate)
|
|
79
|
+
|
|
80
|
+
The three layers above score *real production runs* after the fact. For an active, deterministic gate, use the first-class `*.eval.ts` primitive from `@agent-native/core/eval` (source: `packages/core/src/eval/*`). It runs the actual agent loop against fixed inputs and exits non-zero below threshold, so it gates CI/deploys.
|
|
81
|
+
|
|
82
|
+
```ts
|
|
83
|
+
// evals/faq.eval.ts
|
|
84
|
+
import { defineEval, contains, llmJudge } from "@agent-native/core/eval";
|
|
85
|
+
|
|
86
|
+
export default defineEval({
|
|
87
|
+
name: "answers the FAQ",
|
|
88
|
+
input: { prompt: "What is your return policy?" },
|
|
89
|
+
threshold: 0.7,
|
|
90
|
+
scorers: [contains("30 days"), llmJudge({ criteria: "accuracy" })],
|
|
91
|
+
});
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
- Built-in scorers: `exactMatch` / `contains` / `usesTool` (pure JS) and `llmJudge` (provider-agnostic judge).
|
|
95
|
+
- Custom scorers: `createScorer` with the 4-step `preprocess → analyze → generateScore → generateReason` pipeline (only `generateScore` is required).
|
|
96
|
+
- Run as a gate: `agent-native eval [pattern] [--json] [--threshold N]` — discovers `**/*.eval.ts` and `evals/*.ts`, runs the agent, and exits non-zero if any eval is below its threshold. An app with no eval files exits `0`. Complements (does not replace) the post-hoc scoring in `evals.ts`. See the Evals doc.
|
|
97
|
+
|
|
78
98
|
### 4. Experiments
|
|
79
99
|
|
|
80
100
|
A/B testing with sticky user-level assignment:
|
|
@@ -200,3 +220,14 @@ await putSetting("observability-config", {
|
|
|
200
220
|
```
|
|
201
221
|
|
|
202
222
|
The framework emits `gen_ai.*` semantic convention spans compatible with Langfuse, Datadog, Grafana, New Relic, and any OTel-compatible backend.
|
|
223
|
+
|
|
224
|
+
## Live OpenTelemetry Spans (Optional)
|
|
225
|
+
|
|
226
|
+
Separate from the `exporters` config above (which ships the in-house traces to an OTLP endpoint), the agent loop can also emit **live OpenTelemetry spans** for every run, model call, and tool call, so a host that already runs an OTel collector sees agent activity alongside its other distributed traces.
|
|
227
|
+
|
|
228
|
+
This layer is optional and **no-op by default**:
|
|
229
|
+
|
|
230
|
+
- `@opentelemetry/api` is an **optional dependency**. If it isn't installed, the span helpers degrade to silent no-ops — they never throw into the agent loop.
|
|
231
|
+
- Even with the api package installed, it ships a default no-op tracer. Spans become real only once the **host registers a `TracerProvider`** (via `@opentelemetry/sdk-node` or similar). The framework deliberately does not depend on the heavy SDK/exporter packages and never registers a provider itself — instrumentation is opt-in by the embedding app.
|
|
232
|
+
|
|
233
|
+
The loop emits `agent.run` (with `agent.run_id`, `agent.thread_id`, `agent.user_id`, `agent.model`), `tool.call` (`tool.name` + status), and `llm.call` spans, each finished with OK/ERROR status. This is purely additive to the in-house `agent_trace_spans` / `agent_trace_summaries` tables. Source: `packages/core/src/observability/tracing.ts` + `traces.ts`. See the Observability doc for the full table.
|
|
@@ -139,6 +139,28 @@ export default defineEventHandler(async (event) => {
|
|
|
139
139
|
|
|
140
140
|
- Never create unprotected routes that modify data.
|
|
141
141
|
|
|
142
|
+
## Human-in-the-Loop Approval for High-Consequence Actions
|
|
143
|
+
|
|
144
|
+
For a small set of outward-facing, hard-to-undo operations — sending an email, charging a card, deleting an account, posting publicly — auth and access control are necessary but not sufficient: you also do not want the **agent** to perform them autonomously. Set `needsApproval` on the `defineAction` so the agent cannot run the action without a human approving the specific call.
|
|
145
|
+
|
|
146
|
+
```ts
|
|
147
|
+
export default defineAction({
|
|
148
|
+
description: "Send an email via Gmail.",
|
|
149
|
+
schema: z.object({ to: z.string(), subject: z.string(), body: z.string() }),
|
|
150
|
+
needsApproval: true, // or (args, ctx) => boolean | Promise<boolean>
|
|
151
|
+
run: async (args) => {
|
|
152
|
+
/* ...actually send... */
|
|
153
|
+
},
|
|
154
|
+
});
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
When the gate is truthy and the call is not yet approved, the loop emits an `approval_required` event and **stops the turn — `run()` never executes**. The human approves via the chat UI's Approve affordance, which re-issues the turn with the call's stable `approvalKey`; only then does the action run. A predicate gates conditionally (e.g. only external recipients) and **fails closed** — a throw is treated as "approval required".
|
|
158
|
+
|
|
159
|
+
Rules:
|
|
160
|
+
|
|
161
|
+
- Reach for `needsApproval` only for genuinely high-consequence operations. The default is off, and the framework intentionally keeps approvals rare — over-gating turns the agent into a click-through wizard. The canonical (and intentionally lone) framework example is Mail's `send-email`.
|
|
162
|
+
- `needsApproval` is **not** a substitute for `accessFilter` / `assertAccess` or for hiding sensitive operations from the model with `agentTool: false` / `toolCallable: false`. It is the layer for "a human must explicitly bless this specific outward-facing call," not for scoping data. See the `actions` skill for the full surface.
|
|
163
|
+
|
|
142
164
|
## Custom HTTP Routes Must Apply Access Control Themselves
|
|
143
165
|
|
|
144
166
|
This is the single most-failed rule in the codebase. Auto-mounted action routes (`/_agent-native/actions/...`) get a request context wired up automatically. **Hand-written `/api/*` Nitro routes do not.** If your handler queries an ownable resource (any table with `...ownableColumns()`), you MUST:
|
package/docs/content/actions.md
CHANGED
|
@@ -49,6 +49,35 @@ That's it. The framework auto-discovers every file in `actions/` and mounts them
|
|
|
49
49
|
|
|
50
50
|
The schema is converted to JSON Schema for the Claude API tool definition, _and_ used at runtime to validate inputs before `run()` fires. Invalid inputs never reach your handler.
|
|
51
51
|
|
|
52
|
+
### Validating the return value {#output-schema}
|
|
53
|
+
|
|
54
|
+
`schema` validates _inputs_. To also validate what an action **returns**, pass an `outputSchema` (any Standard Schema-compatible schema — Zod, Valibot, ArkType, same surface as `schema`). The framework validates the result _after_ `run()` resolves, composing with input validation: input validated before `run`, output validated after.
|
|
55
|
+
|
|
56
|
+
```ts
|
|
57
|
+
export default defineAction({
|
|
58
|
+
description: "Summarize a thread.",
|
|
59
|
+
schema: z.object({ threadId: z.string() }),
|
|
60
|
+
outputSchema: z.object({
|
|
61
|
+
summary: z.string(),
|
|
62
|
+
messageCount: z.number(),
|
|
63
|
+
}),
|
|
64
|
+
outputErrorStrategy: "warn", // default
|
|
65
|
+
run: async ({ threadId }) => {
|
|
66
|
+
/* ...returns { summary, messageCount } ... */
|
|
67
|
+
},
|
|
68
|
+
});
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
`outputErrorStrategy` controls what happens on a mismatch:
|
|
72
|
+
|
|
73
|
+
| Strategy | Behavior on mismatch |
|
|
74
|
+
| ------------ | -------------------------------------------------------------------------------------------------- |
|
|
75
|
+
| `"warn"` | **Default.** `console.warn` the issues and return the **original** result unchanged. Non-breaking. |
|
|
76
|
+
| `"strict"` | Throw a clear error so a buggy action surfaces loudly. |
|
|
77
|
+
| `"fallback"` | Return the provided `outputFallback` value in place of the invalid result. |
|
|
78
|
+
|
|
79
|
+
On success, the **validated** value is returned, so any coercion or defaults defined on the `outputSchema` take effect (mirroring the input path). When no `outputSchema` is supplied, behavior is byte-for-byte unchanged — there is no wrapping. This is borrowed from Mastra/Flue structured-output and kept dependency-free on the action layer.
|
|
80
|
+
|
|
52
81
|
### HTTP config {#http}
|
|
53
82
|
|
|
54
83
|
By default every action is exposed as `POST /_agent-native/actions/<name>`. Override with the `http` option:
|
|
@@ -221,6 +250,26 @@ export default defineAction({
|
|
|
221
250
|
|
|
222
251
|
For list and read actions, use `accessFilter` to scope the query to the current user and org. For actions that update or delete a specific row, use `assertAccess` to confirm the caller is allowed before writing. See [Security](/docs/security#access-guards) and [Sharing](/docs/sharing) for the full helper API.
|
|
223
252
|
|
|
253
|
+
### Human-in-the-loop approval {#needs-approval}
|
|
254
|
+
|
|
255
|
+
A handful of actions are too consequential to let the agent run autonomously — sending an email, charging a card, deleting an account. For those, set `needsApproval` to pause the loop and require a human to approve the specific call before `run()` executes:
|
|
256
|
+
|
|
257
|
+
```ts
|
|
258
|
+
export default defineAction({
|
|
259
|
+
description: "Send an email via Gmail.",
|
|
260
|
+
schema: z.object({ to: z.string(), subject: z.string(), body: z.string() }),
|
|
261
|
+
needsApproval: true, // pause; a human must approve this specific send
|
|
262
|
+
run: async (args) => {
|
|
263
|
+
/* ...actually send... */
|
|
264
|
+
},
|
|
265
|
+
});
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
`needsApproval` accepts a boolean or a predicate `(args, ctx) => boolean | Promise<boolean>` to gate conditionally (e.g. only external recipients, only above a threshold). The predicate **fails closed**: a throw is treated as "approval required". When the gate is truthy and the call isn't yet approved, the loop emits an `approval_required` event and stops the turn — the side effect never happens — and the action runs only once a human approves via the chat UI's Approve affordance.
|
|
269
|
+
|
|
270
|
+
> [!WARNING]
|
|
271
|
+
> Keep approvals rare. Each gated action is a hard stop in the agent loop. The default is **off**, and almost every action should leave it off. See [Human-in-the-Loop Approvals](/docs/human-approval) for the full flow.
|
|
272
|
+
|
|
224
273
|
## Calling it from the UI {#ui}
|
|
225
274
|
|
|
226
275
|
Two hooks, both in `@agent-native/core/client`. Types are inferred from your `defineAction` schemas — no manual type declarations.
|
|
@@ -390,6 +439,7 @@ const args = parseArgs(["--name", "Steve", "--verbose", "--count=3"]);
|
|
|
390
439
|
|
|
391
440
|
## What's next
|
|
392
441
|
|
|
442
|
+
- [**Human-in-the-Loop Approvals**](/docs/human-approval) — the `needsApproval` gate in depth
|
|
393
443
|
- [**Drop-in Agent**](/docs/drop-in-agent) — `useActionMutation` / `useActionQuery` in React
|
|
394
444
|
- [**Context Awareness**](/docs/context-awareness) — the `view-screen` + `navigate` pattern in depth
|
|
395
445
|
- [**A2A Protocol**](/docs/a2a-protocol) — how other agents discover and call your actions
|
|
@@ -141,6 +141,38 @@ You are a meticulous code reviewer. Be terse and concrete — cite file:line whe
|
|
|
141
141
|
|
|
142
142
|
Store at `agents/code-review.md` in the workspace. It appears in the `@mention` dropdown and is available to the main agent as a delegation target. See [Workspace — Custom Agents](/docs/workspace#custom-agents) for the full format including `tools`, `delegate-default`, and model overrides.
|
|
143
143
|
|
|
144
|
+
## Delegation depth guard {#depth-guard}
|
|
145
|
+
|
|
146
|
+
Sub-agents can spawn sub-agents, which is a runaway/cost risk: an unbounded chain of delegations could fan out indefinitely. The framework enforces a **hard cap on delegation depth**, server-side, independent of any tool-level guard.
|
|
147
|
+
|
|
148
|
+
The top-level chat is depth `0`. A sub-agent it spawns is depth `1`; that sub-agent may spawn once more (depth `2`); a spawn that would create a depth-`3` sub-agent is **refused**. The default cap is **2**.
|
|
149
|
+
|
|
150
|
+
```txt
|
|
151
|
+
depth 0 top-level chat (may spawn)
|
|
152
|
+
depth 1 sub-agent (may spawn)
|
|
153
|
+
depth 2 sub-agent's sub-agent (at the cap — may NOT spawn)
|
|
154
|
+
depth 3 refused
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Enforcement is ambient: each sub-agent runs inside an `AsyncLocalStorage` that records its own depth, so any `spawnTask` reached transitively from that run reads its parent's depth and refuses once the cap is hit — even if the `agent-teams` tool was handed to a sub-agent that should not have had it. The decision is exposed as a pure, unit-testable `evaluateSubagentDepth(parentDepth)`. A refused spawn returns a clear error: _"Delegation depth limit reached (max N); cannot spawn another sub-agent."_
|
|
158
|
+
|
|
159
|
+
### Configuring the cap {#depth-guard-config}
|
|
160
|
+
|
|
161
|
+
Override the default at deploy time with `AGENT_NATIVE_MAX_SUBAGENT_DEPTH`:
|
|
162
|
+
|
|
163
|
+
| Value | Effect |
|
|
164
|
+
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
|
165
|
+
| _(unset)_ | Default cap of `2`. |
|
|
166
|
+
| `0` | **No sub-agents may be spawned** — the top-level agent does all work. |
|
|
167
|
+
| `1`…`16` | That many levels of delegation. |
|
|
168
|
+
| invalid / `>16` | A non-integer / negative / NaN value falls back to `2`; anything above `16` is clamped to `16` so a typo can never disable the guard. |
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
AGENT_NATIVE_MAX_SUBAGENT_DEPTH=1 # sub-agents allowed, but they can't sub-delegate
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
When a sub-agent is at or below the cap, the framework injects a line into its runtime context telling it how deep it sits and whether it may delegate further, so the model spends its budget appropriately.
|
|
175
|
+
|
|
144
176
|
## What's next
|
|
145
177
|
|
|
146
178
|
- [**Workspace — Custom Agents**](/docs/workspace#custom-agents) — the profile format
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Blueprint Installer"
|
|
3
|
+
description: "agent-native add prints a curated Markdown integration recipe to stdout — pipe it into your coding agent, which applies the changes against your live repo."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Blueprint Installer
|
|
7
|
+
|
|
8
|
+
`agent-native add` is **not** a dumb scaffolder that writes files for you. It emits a curated Markdown _integration blueprint_ to stdout. You pipe that blueprint into your own coding agent (Claude Code, Codex, …), which applies the changes against the live repo with full context.
|
|
9
|
+
|
|
10
|
+
This fits the agent-applies-changes, filesystem-first house style: the framework supplies the recipe (the canonical files to touch, the rules to honor, the verification step), and the coding agent does the editing.
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
agent-native add provider stripe | claude
|
|
14
|
+
agent-native add channel discord | codex
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
## Usage {#usage}
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
agent-native add <kind> <name> # print a curated blueprint
|
|
21
|
+
agent-native add <kind> <https://docs…> # research-and-integrate from a URL
|
|
22
|
+
agent-native add --list # list available kinds and blueprints
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
- A bare **name** resolves a curated blueprint from `blueprints/<kind>/<name>.md`.
|
|
26
|
+
- A **URL** instead of a name emits a generic _research-and-integrate_ blueprint for that kind, with the URL embedded as the research starting point (a URL is a research seed, not a known recipe).
|
|
27
|
+
- The blueprint goes to **stdout**; diagnostics go to stderr, so `… | claude` only ever receives the blueprint.
|
|
28
|
+
|
|
29
|
+
## Seeded blueprints {#seeded}
|
|
30
|
+
|
|
31
|
+
`agent-native add --list` shows what ships in the box:
|
|
32
|
+
|
|
33
|
+
| Kind | Name | What it sets up |
|
|
34
|
+
| ---------- | --------- | ---------------------------------------------------------------------------------- |
|
|
35
|
+
| `provider` | `stripe` | Wire a provider into the `provider-api` substrate (catalog / docs / request trio). |
|
|
36
|
+
| `channel` | `discord` | Implement a `PlatformAdapter` inbound webhook channel and register it. |
|
|
37
|
+
| `sandbox` | `docker` | Implement the `SandboxAdapter` seam to run `run-code` in a Docker container. |
|
|
38
|
+
| `action` | `crud` | Add a single multi-surface `defineAction` with a Zod schema (one `update` over N). |
|
|
39
|
+
|
|
40
|
+
Each blueprint is self-contained: the coding agent reading it gets the files to touch, the framework rules to honor (actions are the single source of truth, never hardcode secrets, scope ownable data, add a changeset for `packages/*` source), and a concrete **Verify** section.
|
|
41
|
+
|
|
42
|
+
## URL → research blueprint {#url}
|
|
43
|
+
|
|
44
|
+
When you pass a URL the kind doesn't have a curated recipe for (or want a fresh integration), `add` emits a generic "research-and-integrate" blueprint with the URL as the seed:
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
agent-native add provider https://docs.example.com/api | claude
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
The generated blueprint tells the coding agent to fetch the URL (and the pages it links to) for the real endpoints, auth model, payload shapes, and signature/verification requirements — _not_ to guess from training data — then implement and verify. It also carries kind-specific guidance (e.g. a `provider` URL is steered toward the `provider-api` substrate; a `channel` URL toward a `PlatformAdapter`).
|
|
51
|
+
|
|
52
|
+
## Adding your own blueprint {#authoring}
|
|
53
|
+
|
|
54
|
+
Drop a Markdown file into `packages/core/blueprints/<kind>/<name>.md`. The kind is the subdirectory; the name is the filename without `.md`. It is picked up automatically — `--list`, name resolution, and the catalog all read the directory at runtime. No code change is needed to register it.
|
|
55
|
+
|
|
56
|
+
Blueprint `.md` files ship in the published package via the `blueprints` entry in `package.json` `files`, so they resolve at `node_modules/@agent-native/core/blueprints/**` for end users.
|
|
57
|
+
|
|
58
|
+
Write each blueprint as an instruction set for a coding agent with no other context. A good blueprint has:
|
|
59
|
+
|
|
60
|
+
1. **A one-line goal** and a "you are a coding agent in an agent-native app, apply these as real source changes" framing.
|
|
61
|
+
2. **Read first** — the exact files that _are_ the contract.
|
|
62
|
+
3. **Files to touch** — concrete paths and what each change does.
|
|
63
|
+
4. **Framework rules to honor** — actions-first, no hardcoded secrets, scope ownable data, add a changeset for publishable-package source.
|
|
64
|
+
5. **Verify** — typecheck, a focused `*.spec.ts`, and an end-to-end check.
|
|
65
|
+
|
|
66
|
+
> [!TIP]
|
|
67
|
+
> A new curated blueprint under an existing kind needs no code — but if you create a brand-new kind directory, that kind shows up in `--list` automatically too.
|
|
68
|
+
|
|
69
|
+
## What's next
|
|
70
|
+
|
|
71
|
+
- [**Sandbox Adapters**](/docs/sandbox-adapters) — the seam the `add sandbox docker` blueprint targets
|
|
72
|
+
- [**Actions**](/docs/actions) — the single source of truth every blueprint builds on
|
|
73
|
+
- [**External Agents**](/docs/external-agents) — connecting the coding agent you pipe blueprints into
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Durable Resume"
|
|
3
|
+
description: "When a hosted agent run is interrupted and resumes, completed side-effecting tool calls are not re-run — a tool-call journal derived from the durable ledger blocks duplicate sends, charges, and tickets."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Durable Resume
|
|
7
|
+
|
|
8
|
+
Hosted agent runs get interrupted: a serverless function hits its hard timeout mid-stream, a gateway drops the connection at 45s, a socket hangs up, the platform cold-starts. The framework already recovers from these by saving the conversation prefix and re-running the LLM call ("continue from where you left off"). But recovery alone has a sharp edge: if the interrupted attempt **already sent an email or created a ticket**, a naive resume could do it again.
|
|
9
|
+
|
|
10
|
+
Durable resume closes that gap. On resume, the framework knows which side-effecting tool calls already completed and refuses to re-run them — at two layers.
|
|
11
|
+
|
|
12
|
+
## The tool-call journal {#journal}
|
|
13
|
+
|
|
14
|
+
The journal is a **pure read over the durable run-event ledger** — there is no new recording hook in the hot path. It classifies the tool calls already recorded for the current turn:
|
|
15
|
+
|
|
16
|
+
- **Completed** — a `tool_start` with a matching `tool_done`. The call ran, its side effect happened, and its result was recorded. **Do not re-run.**
|
|
17
|
+
- **Interrupted** — a `tool_start` with **no** matching `tool_done`. The call began, its side effect may or may not have landed, and the interruption ate the result. Outcome unknown.
|
|
18
|
+
|
|
19
|
+
Matching mirrors how durable turns are rebuilt elsewhere: a `tool_done` pairs with the oldest still-open `tool_start` for the same tool name (FIFO per tool). A `clear` event (discarded partial output) resets the per-turn tally so abandoned partials don't leave phantom open calls.
|
|
20
|
+
|
|
21
|
+
## Layer 1: prompt-level journal note {#prompt-note}
|
|
22
|
+
|
|
23
|
+
When a run resumes (soft timeout, gateway timeout, or any resumable transport error), the framework appends a **structured journal note** to the resume prompt, right after the "continue from where you left off" nudge. The note tells the model, in plain text:
|
|
24
|
+
|
|
25
|
+
- which tool calls **already completed** (with short results) so it reuses them and does **not** re-run them, and
|
|
26
|
+
- which tool calls were **interrupted with unknown outcome** so it verifies state before assuming success or failure.
|
|
27
|
+
|
|
28
|
+
When the journal is empty (a turn with no tool activity, or a clean continuation), nothing extra is appended and resume behavior is byte-for-byte what it was before. The note is best-effort: a failed ledger read never blocks a recovery that would otherwise succeed.
|
|
29
|
+
|
|
30
|
+
## Layer 2: tool-layer hard-block {#hard-block}
|
|
31
|
+
|
|
32
|
+
The prompt note is advisory — a well-behaved model heeds it, but a model isn't a guarantee. So the loop also enforces it at the tool layer.
|
|
33
|
+
|
|
34
|
+
Before the loop runs in a resumed chunk, it snapshots the journal once (capturing only **prior** chunks of this logical turn). When the model re-dispatches a **write** tool whose tool name **and input** match a completed journal entry, the loop short-circuits: it returns the journaled result instead of executing the action, with a note that the call already completed in an earlier interrupted attempt and was not re-run to avoid a duplicate side effect.
|
|
35
|
+
|
|
36
|
+
Key properties:
|
|
37
|
+
|
|
38
|
+
- **Write tools only.** Read-only (`readOnly` / GET) actions are never blocked — re-reading is safe and idempotent.
|
|
39
|
+
- **Content-addressed.** The match is on tool name + input signature, so a resumed call sitting at a different position in the turn still matches; a _different_ call (different args) is treated as fresh and runs normally.
|
|
40
|
+
- **Consume-once.** Each completed entry is claimed when matched, so two genuinely-distinct identical fresh calls in the same turn don't both short-circuit on one journaled completion.
|
|
41
|
+
- **Fresh calls untouched.** A first-turn call sees an empty journal; nothing changes for normal runs.
|
|
42
|
+
|
|
43
|
+
Together the two layers mean an interrupted run that already had a real side effect resumes without repeating it — no duplicate emails, charges, or tickets — while genuinely new work still runs.
|
|
44
|
+
|
|
45
|
+
## Related
|
|
46
|
+
|
|
47
|
+
- [**Real-Time Sync**](/docs/real-time-collaboration) — how the durable run ledger streams to the client and replays on reconnect.
|
|
48
|
+
- [**Actions**](/docs/actions) — `readOnly` marks reads as safe to re-run; everything else is treated as side-effecting.
|
|
49
|
+
- [**In-Loop Processors**](/docs/processors) — another loop-internal hardening seam.
|
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Evals (CI Gate)"
|
|
3
|
+
description: "Write *.eval.ts test cases that run the real agent against fixed inputs, score the output with composable scorers, and gate CI/deploys on a threshold."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Evals (CI Gate)
|
|
7
|
+
|
|
8
|
+
Evals are a first-class testing primitive: you declare a prompt plus the behavior you expect, the runner **actually runs the agent loop** against that input, scores the output with composable scorers, and exits non-zero if any case scores below its threshold. That non-zero exit makes `agent-native eval` a drop-in CI deploy gate.
|
|
9
|
+
|
|
10
|
+
This is complementary to the post-hoc scoring in [Observability](/docs/observability):
|
|
11
|
+
|
|
12
|
+
- **Observability evals** (`observability/evals.ts`) — _"how did this real run do?"_ Passive, sampled, lives next to traces.
|
|
13
|
+
- **`*.eval.ts` (this primitive)** — _"does the agent do the right thing on this fixed input?"_ Active, deterministic, a CI gate run via the CLI.
|
|
14
|
+
|
|
15
|
+
The runner resolves a provider-agnostic engine/model from the existing registry — no model is hardcoded — so the same suite runs against whatever engine the app is configured for.
|
|
16
|
+
|
|
17
|
+
## Writing an eval {#writing}
|
|
18
|
+
|
|
19
|
+
Drop a `*.eval.ts` file anywhere in the app (or an `evals/*.ts` file). Each file `export default defineEval(...)` (or exports an array of them):
|
|
20
|
+
|
|
21
|
+
```ts
|
|
22
|
+
// evals/greeting.eval.ts
|
|
23
|
+
import { defineEval, contains, llmJudge } from "@agent-native/core/eval";
|
|
24
|
+
|
|
25
|
+
export default defineEval({
|
|
26
|
+
name: "greets the user by name",
|
|
27
|
+
input: { prompt: "Say hi to Ada." },
|
|
28
|
+
threshold: 0.7, // per-scorer pass bar; default 0.5
|
|
29
|
+
scorers: [
|
|
30
|
+
contains("Ada"),
|
|
31
|
+
llmJudge({ criteria: "friendliness", rubric: "1.0 = warm greeting" }),
|
|
32
|
+
],
|
|
33
|
+
});
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
An eval passes only when **every** scorer meets the threshold. Key `defineEval` fields:
|
|
37
|
+
|
|
38
|
+
| Field | Type | Notes |
|
|
39
|
+
| ----------- | --------------------- | ------------------------------------------------------------- |
|
|
40
|
+
| `name` | string | Required. Shown in the report. |
|
|
41
|
+
| `input` | `{ prompt, history }` | Required `prompt`; optional prior `{ role, text }` turns. |
|
|
42
|
+
| `scorers` | `Scorer[]` | Required, at least one. |
|
|
43
|
+
| `threshold` | number `0..1` | Per-scorer pass bar. Default `0.5`; overridable from the CLI. |
|
|
44
|
+
| `run` | function | Optional override for custom setup (seed data, multi-turn). |
|
|
45
|
+
|
|
46
|
+
The agent run handed to scorers is small and transport-agnostic:
|
|
47
|
+
|
|
48
|
+
```ts
|
|
49
|
+
interface AgentRunOutput {
|
|
50
|
+
text: string; // concatenated assistant text
|
|
51
|
+
toolCalls: readonly string[]; // tool/action names, in call order
|
|
52
|
+
ok: boolean; // completed without a terminal error
|
|
53
|
+
error?: string;
|
|
54
|
+
runId: string;
|
|
55
|
+
durationMs: number;
|
|
56
|
+
}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Built-in scorers {#built-in}
|
|
60
|
+
|
|
61
|
+
Imported from `@agent-native/core/eval`:
|
|
62
|
+
|
|
63
|
+
| Scorer | Score | Model? |
|
|
64
|
+
| ------------------------ | ----------------------------------------------------------------- | ------ |
|
|
65
|
+
| `exactMatch(expected)` | `1.0` if text equals `expected` (trimmed, case-insensitive) | No |
|
|
66
|
+
| `contains(needles)` | Fraction of required substrings present (so partial hits surface) | No |
|
|
67
|
+
| `usesTool(toolName)` | `1.0` if the agent invoked that tool/action at least once | No |
|
|
68
|
+
| `llmJudge({ criteria })` | LLM-as-judge scored against a natural-language rubric, → `0..1` | Yes |
|
|
69
|
+
|
|
70
|
+
`exactMatch` and `contains` take an optional `{ caseSensitive }`. `llmJudge` takes `{ criteria, rubric?, name?, scoreRange? }` — its output is normalized to `[0, 1]`, and the judge model is whatever the runner resolved (never a hardcoded provider).
|
|
71
|
+
|
|
72
|
+
## Custom scorers: the 4-step pipeline {#custom}
|
|
73
|
+
|
|
74
|
+
`createScorer` builds a scorer from a Mastra-style 4-step pipeline. Only `generateScore` is required:
|
|
75
|
+
|
|
76
|
+
```txt
|
|
77
|
+
preprocess(run) → x transform the run/output (optional)
|
|
78
|
+
analyze(x, ctx) → analysis plain JS OR an LLM judge (optional)
|
|
79
|
+
generateScore(a) → 0..1 REQUIRED, normalized
|
|
80
|
+
generateReason(...) → string human-readable why (optional)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
`preprocess` and `analyze` default to identity (the scorer sees the raw `AgentRunOutput`). The `analyze` step receives a `ctx` with a provider-agnostic `judge()` helper for LLM-backed scoring:
|
|
84
|
+
|
|
85
|
+
```ts
|
|
86
|
+
import { createScorer, clamp01 } from "@agent-native/core/eval";
|
|
87
|
+
|
|
88
|
+
// A scorer that rewards short, tool-using answers.
|
|
89
|
+
const concise = createScorer({
|
|
90
|
+
name: "concise_with_tool",
|
|
91
|
+
analyze(run) {
|
|
92
|
+
return {
|
|
93
|
+
words: run.text.trim().split(/\s+/).length,
|
|
94
|
+
usedTool: run.toolCalls.length > 0,
|
|
95
|
+
};
|
|
96
|
+
},
|
|
97
|
+
generateScore({ words, usedTool }) {
|
|
98
|
+
if (!usedTool) return 0;
|
|
99
|
+
return clamp01(1 - Math.max(0, words - 40) / 200);
|
|
100
|
+
},
|
|
101
|
+
generateReason({ analysis }) {
|
|
102
|
+
return `${analysis.words} words, tool used: ${analysis.usedTool}`;
|
|
103
|
+
},
|
|
104
|
+
});
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
## Running the gate {#cli}
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
agent-native eval # run every *.eval.ts; non-zero exit on failure
|
|
111
|
+
agent-native eval billing # only files whose path contains "billing"
|
|
112
|
+
agent-native eval --json # machine-readable report (for CI)
|
|
113
|
+
agent-native eval --threshold 0.8 # override every eval's pass threshold (0..1)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
The command discovers `**/*.eval.ts` and `evals/*.ts` under the current app, runs the agent for each input, scores it, prints a readable table (or JSON), and **exits non-zero if any eval scores below its threshold**.
|
|
117
|
+
|
|
118
|
+
Exit codes:
|
|
119
|
+
|
|
120
|
+
| Code | Meaning |
|
|
121
|
+
| ---- | --------------------------------------------------------------- |
|
|
122
|
+
| `0` | All evals passed — _or_ no eval files were found (CI-friendly). |
|
|
123
|
+
| `1` | At least one eval scored below threshold, or the suite errored. |
|
|
124
|
+
| `2` | Bad arguments (e.g. `--threshold` outside `[0, 1]`). |
|
|
125
|
+
|
|
126
|
+
### As a CI deploy gate {#ci}
|
|
127
|
+
|
|
128
|
+
Add it to the pipeline that runs before a deploy:
|
|
129
|
+
|
|
130
|
+
```yaml
|
|
131
|
+
# .github/workflows/deploy.yml (excerpt)
|
|
132
|
+
- run: npx agent-native eval --json
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
A regression that drops any scorer below threshold fails the step and blocks the deploy. An app with no eval files exits `0`, so adopting evals is opt-in per app.
|
|
136
|
+
|
|
137
|
+
## What's next
|
|
138
|
+
|
|
139
|
+
- [**Observability**](/docs/observability) — post-hoc scoring of real production runs (the complementary layer)
|
|
140
|
+
- [**Actions**](/docs/actions) — the tools/actions that show up in `toolCalls`
|
|
141
|
+
- [**Agent Teams**](/docs/agent-teams) — sub-agents an eval might exercise
|
|
@@ -386,7 +386,7 @@ Every allow-listed template that produces or lists a navigable resource ships a
|
|
|
386
386
|
|
|
387
387
|
## Authoring: the `link` builder {#link-builder}
|
|
388
388
|
|
|
389
|
-
This section is for template authors. `defineAction` accepts an optional `link` builder. When set, every MCP/A2A result for that tool auto-appends a markdown `[label →](absoluteUrl)` block and a structured `_meta["agent-native/openLink"] = { label, view, webUrl, desktopUrl }`. `tools/list` adds `annotations["agent-native/producesOpenLink"]` and a description suffix so the external agent knows the tool yields an openable link and should surface it.
|
|
389
|
+
This section is for template authors. `defineAction` accepts an optional `link` builder. When set, every MCP/A2A result for that tool auto-appends a markdown `[label →](absoluteUrl)` block and a structured `_meta["agent-native/openLink"] = { label, view, webUrl, desktopUrl, vscodeUrl }`. `tools/list` adds `annotations["agent-native/producesOpenLink"]` and a description suffix so the external agent knows the tool yields an openable link and should surface it.
|
|
390
390
|
|
|
391
391
|
Build the URL with `buildDeepLink(...)` — it is the single source of truth for the open-route format. Never hand-format the `/_agent-native/open` URL.
|
|
392
392
|
|
|
@@ -431,7 +431,7 @@ See [MCP Apps](/docs/mcp-apps) for the full authoring guide — `embedRoute` vs
|
|
|
431
431
|
|
|
432
432
|
The `link` builder is **pure and synchronous — no I/O, no awaits**. It runs best-effort: a throw, `null`, or `undefined` is swallowed and **never** fails the tool call. It only reads the call's `args` and `result`; it must not query the DB, read app-state, or call other actions. Return `null` when there's nothing to open.
|
|
433
433
|
|
|
434
|
-
`buildDeepLink({ app, view, params?, to?, compose? })` returns the app-relative path `/_agent-native/open?app=…&view=…&<recordId>=…`. The MCP layer turns that into an absolute web URL (`toAbsoluteOpenUrl`, using the request origin)
|
|
434
|
+
`buildDeepLink({ app, view, params?, to?, compose? })` returns the app-relative path `/_agent-native/open?app=…&view=…&<recordId>=…`. The MCP layer turns that into an absolute web URL (`toAbsoluteOpenUrl`, using the request origin), a desktop `agentnative://open?…` URL (`toDesktopOpenUrl`), and a VS Code extension URL (`toVsCodeOpenUrl`) for `vscode://builderio.agent-native/open?url=…`; the markdown link uses the desktop URL when the client signals `target: "desktop"`.
|
|
435
435
|
|
|
436
436
|
### The `/_agent-native/open` route {#open-route}
|
|
437
437
|
|
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Human-in-the-Loop Approvals"
|
|
3
|
+
description: "Pause the agent before a high-consequence action runs — defineAction's needsApproval gate emits an approval_required event, the human approves, and only then does the tool execute."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Human-in-the-Loop Approvals
|
|
7
|
+
|
|
8
|
+
Most actions should just run. A few — sending an email, charging a card, deleting an account — are outward-facing and hard to undo, and you don't want the agent to do them autonomously. For those, `defineAction` has an opt-in **approval gate**: when the agent tries to call the action, the loop pauses, surfaces an Approve/Deny affordance to the human, and runs the action _only_ after the human approves that specific call.
|
|
9
|
+
|
|
10
|
+
> [!WARNING]
|
|
11
|
+
> Keep approvals rare. Every gated action is a hard stop in the agent loop — it interrupts the run and demands a human round-trip. Use `needsApproval` only for genuinely high-consequence, hard-to-undo, outward-facing operations. If you find yourself gating reads or routine writes, you're holding it wrong. The default is **off**, and almost every action should leave it off.
|
|
12
|
+
|
|
13
|
+
## The `needsApproval` gate {#needs-approval}
|
|
14
|
+
|
|
15
|
+
Set `needsApproval` on a `defineAction`. It accepts a boolean or a predicate:
|
|
16
|
+
|
|
17
|
+
```ts
|
|
18
|
+
// actions/send-email.ts
|
|
19
|
+
export default defineAction({
|
|
20
|
+
description: "Send an email via Gmail.",
|
|
21
|
+
schema: z.object({
|
|
22
|
+
to: z.string(),
|
|
23
|
+
subject: z.string(),
|
|
24
|
+
body: z.string(),
|
|
25
|
+
}),
|
|
26
|
+
// Sending is outward-facing and hard to undo, so the agent can never send
|
|
27
|
+
// without a human approving the specific call. Drafting/queueing is
|
|
28
|
+
// unaffected — only the real send is gated.
|
|
29
|
+
needsApproval: true,
|
|
30
|
+
run: async (args) => {
|
|
31
|
+
/* ...actually send... */
|
|
32
|
+
},
|
|
33
|
+
});
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
- **`needsApproval: true`** — always require approval.
|
|
37
|
+
- **`needsApproval: (args, ctx) => boolean | Promise<boolean>`** — require approval only when the predicate returns true. Gate conditionally, e.g. only for external recipients or only above a dollar threshold:
|
|
38
|
+
|
|
39
|
+
```ts
|
|
40
|
+
needsApproval: (args) => !args.to.endsWith("@your-company.com"),
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Keep the predicate pure and fast. **It fails closed**: if the predicate throws, the framework treats that as "approval required" rather than silently running a high-consequence action.
|
|
44
|
+
|
|
45
|
+
When `needsApproval` is omitted, behavior is byte-for-byte unchanged — there is no extra cost on the common path.
|
|
46
|
+
|
|
47
|
+
This works the same for legacy `parameters`-style actions and schema-based actions, and for the in-app agent, sub-agents, A2A, and MCP callers (every agent surface routes through the same loop).
|
|
48
|
+
|
|
49
|
+
## How the loop pauses {#loop}
|
|
50
|
+
|
|
51
|
+
When the agent calls a gated action and this specific call has **not** already been approved, the loop does **not** execute `run()`. Instead it:
|
|
52
|
+
|
|
53
|
+
1. Resolves the gate. For a predicate, it calls `needsApproval(input, ctx)`; a throw is treated as "must approve" (fail closed).
|
|
54
|
+
2. Emits a `tool_start` event (so the UI shows the call) followed immediately by an **`approval_required`** event, then stops the turn. The action's side effect never happens.
|
|
55
|
+
|
|
56
|
+
The `approval_required` event carries everything the client needs to render an affordance:
|
|
57
|
+
|
|
58
|
+
| Field | Type | Notes |
|
|
59
|
+
| ------------- | -------- | ------------------------------------------------------------------- |
|
|
60
|
+
| `tool` | `string` | The action name the agent tried to call. |
|
|
61
|
+
| `input` | object | The arguments the agent passed. |
|
|
62
|
+
| `approvalKey` | `string` | **Stable key** the client echoes back to approve _this exact call_. |
|
|
63
|
+
| `toolCallId` | `string` | The model-side tool-call id, when available. |
|
|
64
|
+
|
|
65
|
+
The `approvalKey` is derived deterministically from the tool name plus its input, so the same logical call always produces the same key. The model never sees or sets it — it is purely a handshake between the framework and the human's Approve affordance.
|
|
66
|
+
|
|
67
|
+
The paused tool returns a result telling the model the turn is paused and not to retry, so the model doesn't spin.
|
|
68
|
+
|
|
69
|
+
## How the human approves {#approve}
|
|
70
|
+
|
|
71
|
+
On `approval_required`, the chat UI renders an **Approve / Deny** affordance on the paused tool call. This is wired automatically in `AssistantChat` — you don't build it per template.
|
|
72
|
+
|
|
73
|
+
- **Approve** re-issues the turn (an ordinary continuation message) carrying the call's key in `approvedToolCalls: [approvalKey]`. On the re-issued turn, the gate sees the key in the approved set and lets that specific call run normally.
|
|
74
|
+
- **Deny** dismisses the affordance locally; nothing is re-issued, so the action never runs.
|
|
75
|
+
|
|
76
|
+
`approvedToolCalls` is a field on the chat request (`AgentChatRequest.approvedToolCalls`). Keys not present in it stay paused — approving one call never blankly approves others. Because the key is content-addressed, an approval authorizes _that call with those arguments_; if the model later proposes a different send, that's a new key and a fresh approval.
|
|
77
|
+
|
|
78
|
+
## End-to-end {#flow}
|
|
79
|
+
|
|
80
|
+
```txt
|
|
81
|
+
agent calls send-email
|
|
82
|
+
│
|
|
83
|
+
▼
|
|
84
|
+
needsApproval truthy, call not yet approved
|
|
85
|
+
│ loop emits tool_start + approval_required { tool, input, approvalKey }
|
|
86
|
+
▼
|
|
87
|
+
turn pauses — run() did NOT execute
|
|
88
|
+
│
|
|
89
|
+
human clicks Approve in the chat UI
|
|
90
|
+
│ client re-issues the turn with approvedToolCalls: [approvalKey]
|
|
91
|
+
▼
|
|
92
|
+
gate sees the key → run() executes → email sends
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
The canonical (and intentionally rare) use of this gate in the framework is the Mail template's `send-email` action, which sets `needsApproval: true` so the agent can draft and queue freely but can never actually send a message without a human approving the specific send.
|
|
96
|
+
|
|
97
|
+
## Related
|
|
98
|
+
|
|
99
|
+
- [**Actions**](/docs/actions#needs-approval) — the full `defineAction` surface, including `outputSchema` for validating return values.
|
|
100
|
+
- [**Security**](/docs/security) — when to reach for an approval gate vs. hiding an action from the model.
|
|
101
|
+
- [**Mail template**](/docs/template-mail) — `send-email` is the reference example.
|
|
@@ -188,6 +188,27 @@ await putSetting("observability-config", {
|
|
|
188
188
|
|
|
189
189
|
The framework emits `gen_ai.*` semantic convention spans compatible with the OpenTelemetry GenAI spec.
|
|
190
190
|
|
|
191
|
+
## OpenTelemetry spans {#otel}
|
|
192
|
+
|
|
193
|
+
Separate from the `exporters` config above (which ships the in-house traces to an OTLP endpoint), the agent loop can also emit **live OpenTelemetry spans** for every run, model call, and tool call — so a host that already runs an OTel collector sees agent activity alongside the rest of its distributed traces.
|
|
194
|
+
|
|
195
|
+
This layer is **optional and no-op by default**:
|
|
196
|
+
|
|
197
|
+
- `@opentelemetry/api` is an **optional dependency**. If it isn't installed, the helpers degrade to silent no-ops — nothing here ever throws into the agent loop.
|
|
198
|
+
- Even when the api package _is_ present, it ships a default no-op tracer. Spans only become real once the **host registers a `TracerProvider`** (via `@opentelemetry/sdk-node` or similar). The framework deliberately does **not** depend on the heavy SDK/exporter packages or register a provider itself — instrumentation is opt-in by the embedding app.
|
|
199
|
+
|
|
200
|
+
So the cost when you haven't wired OTel is a couple of cached property reads per call. To turn it on, install the api package plus your SDK and register a provider at server startup the same way you would for any other Node service.
|
|
201
|
+
|
|
202
|
+
The agent loop emits three span kinds:
|
|
203
|
+
|
|
204
|
+
| Span | When | Attributes |
|
|
205
|
+
| ----------- | -------------------------- | ----------------------------------------------------------------- |
|
|
206
|
+
| `agent.run` | once per agent run | `agent.run_id`, `agent.thread_id`, `agent.user_id`, `agent.model` |
|
|
207
|
+
| `tool.call` | once per action invocation | `tool.name`, plus success/error status |
|
|
208
|
+
| `llm.call` | per model call | timing + OK/error status |
|
|
209
|
+
|
|
210
|
+
Spans are finished with OK/ERROR status and record the error message on failure. Zero/sentinel attribute values are pruned so spans aren't cluttered with noise. This OTel layer is purely additive to the in-house `agent_trace_spans` / `agent_trace_summaries` tables that power the dashboard above — both are produced from the same run events.
|
|
211
|
+
|
|
191
212
|
## Error reporting (Sentry) {#sentry}
|
|
192
213
|
|
|
193
214
|
Server-side errors that escape Nitro route handlers are reported to Sentry when a DSN is configured. Without it the SDK silently no-ops, so it's safe to leave the env vars unset in dev. Browser and server events can go to the same Sentry project; split them into separate projects only when you want operational separation for ownership, volume, quotas, or alert routing.
|