@elevasis/sdk 0.5.13 → 0.5.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,147 +1,163 @@
1
- ---
2
- title: Roadmap
3
- description: Planned SDK features -- error taxonomy, retry semantics, circuit breaker, metrics, alerting, and resource lifecycle extensions
4
- loadWhen: "Asking about future features or planned capabilities"
5
- ---
6
-
7
- **Status:** Planned -- none of these features are implemented yet.
8
-
9
- For currently implemented behavior, see [Runtime](runtime.mdx).
10
-
11
- ---
12
-
13
- ## Structured Error Taxonomy
14
-
15
- The current runtime reports errors as plain strings. A future SDK version will introduce a structured error taxonomy. All SDK errors will extend `ResourceError`, the base class for errors surfaced through the execution protocol.
16
-
17
- Every error carries: `message` (string), `code` (string enum), `details` (optional structured data), and `retryable` (boolean).
18
-
19
- **Error types:**
20
-
21
- - **`ResourceError`** -- Base class for all SDK errors
22
- - **`ValidationError`** -- Input or output schema validation failed. Thrown automatically when Zod `.parse()` fails. Code: `VALIDATION_ERROR`. Not retryable.
23
- - **`StepError`** -- A workflow step handler threw. Includes `stepId` and `stepName`. Code: `STEP_ERROR`. Retryable (transient failures may succeed on retry).
24
- - **`ToolError`** -- A tool execution failed. Includes `toolName`. Code: `TOOL_ERROR`. Retryability depends on the underlying error.
25
- - **`TimeoutError`** -- Execution exceeded the deadline. Code: `TIMEOUT`. Retryable.
26
- - **`CancellationError`** -- Execution was cancelled by the platform or user. Code: `CANCELLED`. Not retryable.
27
-
28
- ---
29
-
30
- ## Retry Semantics
31
-
32
- Retries are platform-side only -- workers are ephemeral and never retry internally.
33
-
34
- - **Configuration:** Per-resource via `maxRetries` (default: 0) and `backoffStrategy` (exponential with jitter)
35
- - **Retryable conditions:** Worker crash or timeout (worker terminated by `AbortSignal`)
36
- - **Non-retryable conditions:** Worker reports `status: 'failed'` (handler ran and returned an error -- application logic, not infrastructure failure), user cancellation
37
- - **Idempotency:** On retry, the same `executionId` is reused. Design handlers to be idempotent where possible.
38
-
39
- ---
40
-
41
- ## Workflow Step Failure
42
-
43
- Default behavior is fail-fast: when a step throws, the workflow fails immediately.
44
-
45
- - **Error handler:** Optional `onError` callback per step. The callback receives the error and can return a recovery value or rethrow to propagate the failure.
46
- - **Partial output:** Steps completed before the failure are included in the error response. The platform receives: `failedStepId`, `completedSteps[]` (IDs of successfully completed steps), and `partialOutput` (the last successful step's output).
47
-
48
- **Step error response format:**
49
-
50
- ```json
51
- {
52
- "status": "failed",
53
- "error": {
54
- "code": "STEP_ERROR",
55
- "message": "Email delivery failed: invalid address",
56
- "stepId": "send-welcome",
57
- "stepName": "Send Welcome Email"
58
- },
59
- "completedSteps": ["validate"],
60
- "partialOutput": { "clientName": "Jane", "isValid": true }
61
- }
62
- ```
63
-
64
- ---
65
-
66
- ## Agent Failure Modes
67
-
68
- Agents have multiple failure paths due to the LLM loop:
69
-
70
- - **Max iterations reached:** The agent returns the best output produced so far, plus a warning flag (`maxIterationsReached: true`). This is a graceful termination, not an error.
71
- - **Tool crash:** Tool errors are caught by the SDK runtime, formatted as a tool result, and sent back to the LLM. The LLM decides whether to retry, try a different approach, or give up.
72
- - **Model refusal:** If the model refuses the prompt, the SDK retries once with an adjusted system prompt. If the retry also refuses, the agent fails with `code: 'MODEL_REFUSAL'`.
73
- - **Model API error:** Network errors, rate limits, or server errors from the model provider. The SDK retries with exponential backoff (3 attempts, 1s/2s/4s), then fails with `code: 'MODEL_ERROR'`.
74
- - **Agent error response includes:** `iterationCount` (LLM iterations completed), `toolCallHistory` (array of tool calls made), `lastModelResponse` (final response from the LLM before failure).
75
-
76
- ---
77
-
78
- ## Circuit Breaker
79
-
80
- The platform will implement a circuit breaker to prevent runaway failures:
81
-
82
- - **Trip condition:** 5 consecutive failures on the same resource within a 10-minute window
83
- - **Action:** Pause executions for 60 seconds. New execution requests for that resource return `503` with: "Resource temporarily unavailable (circuit breaker tripped)"
84
- - **Auto-recovery:** After the 60-second pause, the next execution attempt is allowed through. If it succeeds, the circuit breaker resets. If it fails, the pause extends (120s, then 240s, capped at 5 minutes).
85
- - **Alerting:** You are notified via webhook callback or email when the circuit breaker trips. Configurable per organization.
86
-
87
- ---
88
-
89
- ## Metrics
90
-
91
- ### Auto-Collected
92
-
93
- The SDK runtime and platform will automatically collect these metrics for every execution:
94
-
95
- - `execution_duration_ms` -- Total wall-clock time from request received to result sent
96
- - `step_duration_ms` -- Per-step timing for workflows (array of `{ stepId, durationMs }`)
97
- - `iteration_count` -- Number of LLM loop iterations for agents
98
- - `ai_token_usage` -- Token counts per model call: `{ prompt_tokens, completion_tokens, total_tokens }`
99
- - `ai_cost_usd` -- Calculated from model pricing multiplied by token usage
100
- - `tool_call_count` -- Total number of tool invocations during the execution
101
- - `tool_call_duration_ms` -- Per-tool timing (array of `{ toolName, durationMs }`)
102
- - `error_count` -- Number of errors encountered (including recovered errors)
103
-
104
- ### Cost Attribution
105
-
106
- Metrics are aggregated at multiple levels:
107
-
108
- - **Per-execution:** Total AI spend and total duration
109
- - **Per-resource:** Aggregated over configurable time periods (daily, weekly, monthly)
110
- - **Per-organization:** Total platform cost (execution time + AI spend + managed hosting compute)
111
- - **Visibility:** Platform dashboard and the CLI via `elevasis-sdk executions <resourceId>`
112
-
113
- ### Developer-Defined Metrics
114
-
115
- A future SDK version will support custom metrics emitted from your handlers:
116
-
117
- - `sdk.metrics.counter('custom_name', value)` -- Increment a counter
118
- - `sdk.metrics.gauge('queue_depth', value)` -- Set a point-in-time gauge value
119
- - Custom metrics are stored alongside auto-collected metrics and queryable through the same APIs
120
-
121
- ---
122
-
123
- ## Alerting
124
-
125
- Developer-configurable alerts for production monitoring:
126
-
127
- - **Error rate threshold:** Notify when more than X% of executions fail within Y minutes
128
- - **Latency percentile:** Notify when p95 execution duration exceeds a threshold
129
- - **Cost budget:** Notify when daily or weekly AI spend exceeds a configured limit
130
- - **Channel:** Webhook callback to a developer-provided URL (integrates with Slack, PagerDuty, and similar services via webhook)
131
-
132
- ---
133
-
134
- ## Resource Lifecycle Extensions
135
-
136
- ### Deprecation Status
137
-
138
- Beyond the current `dev` and `production` statuses, two additional statuses are planned:
139
-
140
- - **`deprecated`** -- Marked via the platform UI. Existing executions continue working. New executions show a warning: "This resource is deprecated." The resource still appears in the platform and can still be triggered.
141
- - **`offline`** -- Set automatically when a deployment is unregistered (for example, after a failed deploy or explicit deletion). Clears automatically on the next successful deploy. No executions are accepted while a resource is offline.
142
-
143
- Deprecation requires no automatic removal -- the developer must explicitly delete the resource to remove it.
144
-
145
- ---
146
-
147
- **Last Updated:** 2026-02-25
1
+ ---
2
+ title: Roadmap
3
+ description: Planned SDK features -- error taxonomy, retry semantics, circuit breaker, metrics, alerting, and resource lifecycle extensions
4
+ loadWhen: "Asking about future features or planned capabilities"
5
+ ---
6
+
7
+ **Status:** Mixed -- some features below are implemented, others remain planned. Each section notes its current status.
8
+
9
+ For currently implemented behavior, see [Runtime](runtime.mdx).
10
+
11
+ ---
12
+
13
+ ## Structured Error Taxonomy
14
+
15
+ **Status: Partially implemented.** The runtime has a structured error hierarchy (`ExecutionError`, `PlatformToolError`, `ToolingError`) with error codes and context fields. The taxonomy below describes a planned _redesign_ that is not yet implemented.
16
+
17
+ The current runtime reports errors as plain strings. A future SDK version will introduce a structured error taxonomy. All SDK errors will extend `ResourceError`, the base class for errors surfaced through the execution protocol.
18
+
19
+ Every error carries: `message` (string), `code` (string enum), `details` (optional structured data), and `retryable` (boolean).
20
+
21
+ **Error types:**
22
+
23
+ - **`ResourceError`** -- Base class for all SDK errors
24
+ - **`ValidationError`** -- Input or output schema validation failed. Thrown automatically when Zod `.parse()` fails. Code: `VALIDATION_ERROR`. Not retryable.
25
+ - **`StepError`** -- A workflow step handler threw. Includes `stepId` and `stepName`. Code: `STEP_ERROR`. Retryable (transient failures may succeed on retry).
26
+ - **`ToolError`** -- A tool execution failed. Includes `toolName`. Code: `TOOL_ERROR`. Retryability depends on the underlying error.
27
+ - **`TimeoutError`** -- Execution exceeded the deadline. Code: `TIMEOUT`. Retryable.
28
+ - **`CancellationError`** -- Execution was cancelled by the platform or user. Code: `CANCELLED`. Not retryable.
29
+
30
+ ---
31
+
32
+ ## Retry Semantics
33
+
34
+ **Status: Planned.**
35
+
36
+ Retries are platform-side only -- workers are ephemeral and never retry internally.
37
+
38
+ - **Configuration:** Per-resource via `maxRetries` (default: 0) and `backoffStrategy` (exponential with jitter)
39
+ - **Retryable conditions:** Worker crash or timeout (worker terminated by `AbortSignal`)
40
+ - **Non-retryable conditions:** Worker reports `status: 'failed'` (handler ran and returned an error -- application logic, not infrastructure failure), user cancellation
41
+ - **Idempotency:** On retry, the same `executionId` is reused. Design handlers to be idempotent where possible.
42
+
43
+ ---
44
+
45
+ ## Workflow Step Failure
46
+
47
+ **Status: Planned.** The current runtime uses fail-fast behavior with `WorkflowStepError`. The `onError` callback, `completedSteps`, and `partialOutput` features described below are not yet implemented.
48
+
49
+ Default behavior is fail-fast: when a step throws, the workflow fails immediately.
50
+
51
+ - **Error handler:** Optional `onError` callback per step. The callback receives the error and can return a recovery value or rethrow to propagate the failure.
52
+ - **Partial output:** Steps completed before the failure are included in the error response. The platform receives: `failedStepId`, `completedSteps[]` (IDs of successfully completed steps), and `partialOutput` (the last successful step's output).
53
+
54
+ **Proposed step error response format** (not yet implemented -- subject to change):
55
+
56
+ ```json
57
+ {
58
+ "status": "failed",
59
+ "error": {
60
+ "code": "STEP_ERROR",
61
+ "message": "Email delivery failed: invalid address",
62
+ "stepId": "send-welcome",
63
+ "stepName": "Send Welcome Email"
64
+ },
65
+ "completedSteps": ["validate"],
66
+ "partialOutput": { "clientName": "Jane", "isValid": true }
67
+ }
68
+ ```
69
+
70
+ ---
71
+
72
+ ## Agent Failure Modes
73
+
74
+ **Status: Implemented** (SDK 0.4.2). Agent execution runs in ephemeral worker threads with full tool calling support via `PostMessageLLMAdapter`.
75
+
76
+ Agents have multiple failure paths due to the LLM loop:
77
+
78
+ - **Max iterations reached:** The agent returns the best output produced so far, plus a warning flag (`maxIterationsReached: true`). This is a graceful termination, not an error.
79
+ - **Tool crash:** Tool errors are caught by the SDK runtime, formatted as a tool result, and sent back to the LLM. The LLM decides whether to retry, try a different approach, or give up.
80
+ - **Model refusal:** If the model refuses the prompt, the SDK retries once with an adjusted system prompt. If the retry also refuses, the agent fails with `code: 'MODEL_REFUSAL'`.
81
+ - **Model API error:** Network errors, rate limits, or server errors from the model provider. The SDK retries with exponential backoff (3 attempts, 1s/2s/4s), then fails with `code: 'MODEL_ERROR'`.
82
+ - **Agent error response includes:** `iterationCount` (LLM iterations completed), `toolCallHistory` (array of tool calls made), `lastModelResponse` (final response from the LLM before failure).
83
+
84
+ ---
85
+
86
+ ## Circuit Breaker
87
+
88
+ **Status: Planned.**
89
+
90
+ The platform will implement a circuit breaker to prevent runaway failures:
91
+
92
+ - **Trip condition:** 5 consecutive failures on the same resource within a 10-minute window
93
+ - **Action:** Pause executions for 60 seconds. New execution requests for that resource return `503` with: "Resource temporarily unavailable (circuit breaker tripped)"
94
+ - **Auto-recovery:** After the 60-second pause, the next execution attempt is allowed through. If it succeeds, the circuit breaker resets. If it fails, the pause extends (120s, then 240s, capped at 5 minutes).
95
+ - **Alerting:** You are notified via webhook callback or email when the circuit breaker trips. Configurable per organization.
96
+
97
+ ---
98
+
99
+ ## Metrics
100
+
101
+ **Status: Planned.**
102
+
103
+ ### Auto-Collected
104
+
105
+ The SDK runtime and platform will automatically collect these metrics for every execution:
106
+
107
+ - `execution_duration_ms` -- Total wall-clock time from request received to result sent
108
+ - `step_duration_ms` -- Per-step timing for workflows (array of `{ stepId, durationMs }`)
109
+ - `iteration_count` -- Number of LLM loop iterations for agents
110
+ - `ai_token_usage` -- Token counts per model call: `{ prompt_tokens, completion_tokens, total_tokens }`
111
+ - `ai_cost_usd` -- Calculated from model pricing multiplied by token usage
112
+ - `tool_call_count` -- Total number of tool invocations during the execution
113
+ - `tool_call_duration_ms` -- Per-tool timing (array of `{ toolName, durationMs }`)
114
+ - `error_count` -- Number of errors encountered (including recovered errors)
115
+
116
+ ### Cost Attribution
117
+
118
+ Metrics are aggregated at multiple levels:
119
+
120
+ - **Per-execution:** Total AI spend and total duration
121
+ - **Per-resource:** Aggregated over configurable time periods (daily, weekly, monthly)
122
+ - **Per-organization:** Total platform cost (execution time + AI spend + managed hosting compute)
123
+ - **Visibility:** Platform dashboard and the CLI via `elevasis-sdk executions <resourceId>`
124
+
125
+ ### Developer-Defined Metrics
126
+
127
+ A future SDK version will support custom metrics emitted from your handlers:
128
+
129
+ - `sdk.metrics.counter('custom_name', value)` -- Increment a counter
130
+ - `sdk.metrics.gauge('queue_depth', value)` -- Set a point-in-time gauge value
131
+ - Custom metrics are stored alongside auto-collected metrics and queryable through the same APIs
132
+
133
+ ---
134
+
135
+ ## Alerting
136
+
137
+ **Status: Planned.**
138
+
139
+ Developer-configurable alerts for production monitoring:
140
+
141
+ - **Error rate threshold:** Notify when more than X% of executions fail within Y minutes
142
+ - **Latency percentile:** Notify when p95 execution duration exceeds a threshold
143
+ - **Cost budget:** Notify when daily or weekly AI spend exceeds a configured limit
144
+ - **Channel:** Webhook callback to a developer-provided URL (integrates with Slack, PagerDuty, and similar services via webhook)
145
+
146
+ ---
147
+
148
+ ## Resource Lifecycle Extensions
149
+
150
+ **Status: Planned.**
151
+
152
+ ### Deprecation Status
153
+
154
+ Beyond the current `dev` and `prod` statuses, two additional statuses are planned:
155
+
156
+ - **`deprecated`** -- Marked via the platform UI. Existing executions continue working. New executions show a warning: "This resource is deprecated." The resource still appears in the platform and can still be triggered.
157
+ - **`offline`** -- Set automatically when a deployment is unregistered (for example, after a failed deploy or explicit deletion). Clears automatically on the next successful deploy. No executions are accepted while a resource is offline.
158
+
159
+ Deprecation requires no automatic removal -- the developer must explicitly delete the resource to remove it.
160
+
161
+ ---
162
+
163
+ **Last Updated:** 2026-03-08
@@ -35,14 +35,9 @@ Timeouts are enforced by the platform. You do not handle them explicitly in your
35
35
  - **Per-resource override:** Agents can configure `constraints.timeout` in the agent definition. Workflows use the platform default.
36
36
  - **Enforcement:** When a timeout fires, the worker is terminated immediately -- even if it is stuck in a synchronous loop. No special handler signature is required on your part.
37
37
 
38
- ### Cancellation Protocol
38
+ ### Cancellation
39
39
 
40
- Cancellation is initiated by the platform and requires no special code in your handler:
41
-
42
- 1. A cancellation request is received -- via CLI, API, or platform timeout.
43
- 2. The platform aborts the execution controller registered for that `executionId`.
44
- 3. The worker is terminated immediately (kills the worker even if stuck in a synchronous loop).
45
- 4. The platform records the cancellation in the execution record.
40
+ Cancellation is initiated by the platform and requires no special code in your handler. The worker is terminated immediately and the execution is recorded as cancelled.
46
41
 
47
42
  **What triggers cancellation:**
48
43
 
@@ -121,15 +116,6 @@ v1 versioning is intentionally simple:
121
116
 
122
117
  ---
123
118
 
124
- ## Platform Storage
125
-
126
- Your deployed bundle is stored in Supabase Storage. On API restart, the platform re-registers all active deployments from the database and re-downloads bundles if needed. This means:
127
-
128
- - **No data loss on restart:** Your resources are automatically re-registered after platform restarts.
129
- - **No action required from you:** The platform handles recovery transparently.
130
-
131
- ---
132
-
133
119
  ## Resource Limits & Quotas
134
120
 
135
121
  ### Per-Worker Limits
@@ -168,15 +154,6 @@ Hard limits to prevent abuse and ensure platform stability:
168
154
 
169
155
  ---
170
156
 
171
- ## Platform Maintenance
172
-
173
- - **Rolling updates:** Platform upgrades re-register all active deployments on startup. No executions are lost.
174
- - **Node.js updates:** When the server's Node.js version is updated, worker threads pick up the new version on the next execution with no action required from you.
175
- - **In-flight executions during restart:** Already-running workers complete normally. New executions use the newly reloaded registry after restart.
176
- - **Advance notice:** 24-hour advance notice is provided for breaking maintenance windows. These are rare and reserved for major infrastructure changes.
177
-
178
- ---
179
-
180
157
  ## v1 Limitations
181
158
 
182
159
  | Limitation | Reason | Future path |