elasticdash-test 0.1.26 → 0.1.27

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/README.md +100 -0
  2. package/dist/cli.js +175 -0
  3. package/dist/cli.js.map +1 -1
  4. package/dist/index.cjs +62 -1
  5. package/dist/index.d.ts +2 -0
  6. package/dist/index.d.ts.map +1 -1
  7. package/dist/index.js +2 -0
  8. package/dist/index.js.map +1 -1
  9. package/dist/tool-registry.d.ts +31 -0
  10. package/dist/tool-registry.d.ts.map +1 -0
  11. package/dist/tool-registry.js +73 -0
  12. package/dist/tool-registry.js.map +1 -0
  13. package/dist/tool-runner-worker.js +19 -2
  14. package/dist/tool-runner-worker.js.map +1 -1
  15. package/dist/utils/debug.d.ts +1 -1
  16. package/dist/utils/debug.d.ts.map +1 -1
  17. package/dist/utils/debug.js +2 -2
  18. package/dist/utils/debug.js.map +1 -1
  19. package/docs/observability_contract.md +192 -0
  20. package/package.json +2 -2
  21. package/src/cli.ts +184 -0
  22. package/src/index.ts +4 -0
  23. package/src/tool-registry.ts +94 -0
  24. package/src/tool-runner-worker.ts +17 -2
  25. package/src/utils/debug.ts +2 -2
  26. package/dist/cloud-client.d.ts +0 -34
  27. package/dist/cloud-client.d.ts.map +0 -1
  28. package/dist/cloud-client.js +0 -103
  29. package/dist/cloud-client.js.map +0 -1
  30. package/dist/evaluators/determinism.d.ts +0 -3
  31. package/dist/evaluators/determinism.d.ts.map +0 -1
  32. package/dist/evaluators/determinism.js +0 -116
  33. package/dist/evaluators/determinism.js.map +0 -1
  34. package/dist/evaluators/index.d.ts +0 -4
  35. package/dist/evaluators/index.d.ts.map +0 -1
  36. package/dist/evaluators/index.js +0 -61
  37. package/dist/evaluators/index.js.map +0 -1
  38. package/dist/evaluators/latency-budget.d.ts +0 -3
  39. package/dist/evaluators/latency-budget.d.ts.map +0 -1
  40. package/dist/evaluators/latency-budget.js +0 -45
  41. package/dist/evaluators/latency-budget.js.map +0 -1
  42. package/dist/evaluators/llm-judge.d.ts +0 -3
  43. package/dist/evaluators/llm-judge.d.ts.map +0 -1
  44. package/dist/evaluators/llm-judge.js +0 -125
  45. package/dist/evaluators/llm-judge.js.map +0 -1
  46. package/dist/evaluators/output-contains.d.ts +0 -3
  47. package/dist/evaluators/output-contains.d.ts.map +0 -1
  48. package/dist/evaluators/output-contains.js +0 -52
  49. package/dist/evaluators/output-contains.js.map +0 -1
  50. package/dist/evaluators/output-schema.d.ts +0 -3
  51. package/dist/evaluators/output-schema.d.ts.map +0 -1
  52. package/dist/evaluators/output-schema.js +0 -58
  53. package/dist/evaluators/output-schema.js.map +0 -1
  54. package/dist/evaluators/token-budget.d.ts +0 -3
  55. package/dist/evaluators/token-budget.d.ts.map +0 -1
  56. package/dist/evaluators/token-budget.js +0 -45
  57. package/dist/evaluators/token-budget.js.map +0 -1
  58. package/dist/evaluators/types.d.ts +0 -104
  59. package/dist/evaluators/types.d.ts.map +0 -1
  60. package/dist/evaluators/types.js +0 -6
  61. package/dist/evaluators/types.js.map +0 -1
  62. package/dist/test-group/cli.d.ts +0 -8
  63. package/dist/test-group/cli.d.ts.map +0 -1
  64. package/dist/test-group/cli.js +0 -162
  65. package/dist/test-group/cli.js.map +0 -1
  66. package/dist/test-group/git-context.d.ts +0 -3
  67. package/dist/test-group/git-context.d.ts.map +0 -1
  68. package/dist/test-group/git-context.js +0 -59
  69. package/dist/test-group/git-context.js.map +0 -1
  70. package/dist/test-group/reporter.d.ts +0 -4
  71. package/dist/test-group/reporter.d.ts.map +0 -1
  72. package/dist/test-group/reporter.js +0 -54
  73. package/dist/test-group/reporter.js.map +0 -1
  74. package/dist/test-group/runner.d.ts +0 -18
  75. package/dist/test-group/runner.d.ts.map +0 -1
  76. package/dist/test-group/runner.js +0 -234
  77. package/dist/test-group/runner.js.map +0 -1
  78. package/dist/tracing-universal.d.ts +0 -13
  79. package/dist/tracing-universal.d.ts.map +0 -1
  80. package/dist/tracing-universal.js +0 -33
  81. package/dist/tracing-universal.js.map +0 -1
  82. package/docs/backend_rerun_alignment.md +0 -291
  83. package/docs/backend_traceid_update.md +0 -141
  84. package/docs/observability_backend_contract.md +0 -577
  85. package/docs/observability_rerun_backend_plan.md +0 -596
@@ -1,577 +0,0 @@
1
- # Observability Backend & Dashboard Contract
2
-
3
- This document specifies exactly what the backend and dashboard frontend need to implement to support the SDK's new observability mode. The SDK implementation is complete — it sends events to the endpoints described below.
4
-
5
- ---
6
-
7
- ## Part 1: Backend API Contract
8
-
9
- ### Authentication
10
-
11
- All observability endpoints require a project-scoped API key passed as:
12
-
13
- ```
14
- Authorization: Bearer <ELASTICDASH_API_KEY>
15
- ```
16
-
17
- Return `401 Unauthorized` if the key is missing or invalid. The key should resolve to a `projectId` used for multi-tenant scoping.
18
-
19
- ---
20
-
21
- ### Endpoint 1: `POST /api/observability/events`
22
-
23
- **Purpose:** Batch ingest of trace events from the SDK's `TelemetryBatcher`.
24
-
25
- **Request:**
26
-
27
- ```json
28
- {
29
- "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
30
- "serviceId": "my-ai-app",
31
- "events": [
32
- {
33
- "id": 1,
34
- "type": "ai",
35
- "name": "gpt-4o",
36
- "input": { "messages": [{ "role": "user", "content": "Hello" }] },
37
- "output": { "choices": [{ "message": { "role": "assistant", "content": "Hi!" } }] },
38
- "timestamp": 1712851200000,
39
- "durationMs": 1230,
40
- "usage": { "inputTokens": 5, "outputTokens": 3, "totalTokens": 8 },
41
- "streamed": false,
42
- "schemaVersion": 1,
43
- "traceId": "req-abc-123"
44
- },
45
- {
46
- "id": 2,
47
- "type": "tool",
48
- "name": "searchDB",
49
- "input": { "query": "pikachu" },
50
- "output": { "results": ["..."] },
51
- "timestamp": 1712851201230,
52
- "durationMs": 45,
53
- "schemaVersion": 1,
54
- "traceId": "req-abc-123"
55
- },
56
- {
57
- "id": 3,
58
- "type": "side_effect",
59
- "name": "__heartbeat__",
60
- "input": { "sessionId": "f47ac10b-...", "serviceId": "my-ai-app" },
61
- "output": { "uptime": 3600.5 },
62
- "timestamp": 1712851230000,
63
- "durationMs": 0,
64
- "schemaVersion": 1
65
- }
66
- ]
67
- }
68
- ```
69
-
70
- **Response:**
71
-
72
- ```
73
- 202 Accepted
74
- { "ok": true, "ingested": 3 }
75
- ```
76
-
77
- **Error responses:**
78
-
79
- | Status | Body | Meaning |
80
- |--------|------|---------|
81
- | `400` | `{ "ok": false, "error": "..." }` | Malformed payload (missing sessionId, invalid events array) |
82
- | `401` | `{ "ok": false, "error": "unauthorized" }` | Missing or invalid API key |
83
- | `429` | `{ "ok": false, "error": "rate_limited" }` | Per-serviceId rate limit exceeded — SDK retries with backoff |
84
- | `500+` | `{ "ok": false, "error": "..." }` | Server error — SDK retries with backoff |
85
-
86
- **SDK retry behavior:** On `429` or `5xx`, the SDK retries up to 3 times with exponential backoff (1s, 2s, 4s). After 3 failures, events are dropped. The backend should return `202` quickly and process events asynchronously.
87
-
88
- **Rate limiting recommendation:** 1000 events/min per `serviceId`.
89
-
90
- ---
91
-
92
- ### Endpoint 2: `POST /api/observability/sessions`
93
-
94
- **Purpose:** Register a new observability session. Called by the SDK on `initObservability()` (not yet implemented in SDK — backend should auto-create sessions from the first event batch as a fallback).
95
-
96
- **Request:**
97
-
98
- ```json
99
- {
100
- "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
101
- "serviceId": "my-ai-app",
102
- "metadata": {
103
- "nodeVersion": "20.11.0",
104
- "platform": "linux"
105
- },
106
- "startedAt": 1712851200000
107
- }
108
- ```
109
-
110
- **Response:**
111
-
112
- ```
113
- 201 Created
114
- { "ok": true, "sessionId": "f47ac10b-..." }
115
- ```
116
-
117
- ---
118
-
119
- ### Endpoint 3: `GET /api/observability/sessions`
120
-
121
- **Purpose:** List active and recent sessions for a service.
122
-
123
- **Query params:**
124
-
125
- | Param | Required | Description |
126
- |-------|----------|-------------|
127
- | `serviceId` | No | Filter by service |
128
- | `status` | No | `active` or `ended` (default: all) |
129
- | `limit` | No | Max results (default 50) |
130
- | `offset` | No | Pagination offset |
131
-
132
- **Response:**
133
-
134
- ```json
135
- {
136
- "sessions": [
137
- {
138
- "sessionId": "f47ac10b-...",
139
- "serviceId": "my-ai-app",
140
- "startedAt": 1712851200000,
141
- "lastHeartbeat": 1712854800000,
142
- "endedAt": null,
143
- "eventCount": 1247,
144
- "metadata": {}
145
- }
146
- ],
147
- "total": 1
148
- }
149
- ```
150
-
151
- **Session status inference:**
152
- - `active`: `endedAt` is null AND `lastHeartbeat` was within the last 90 seconds (3x heartbeat interval)
153
- - `ended`: `endedAt` is set OR `lastHeartbeat` is older than 90 seconds
154
-
155
- ---
156
-
157
- ### Endpoint 4: `GET /api/observability/events`
158
-
159
- **Purpose:** Query stored events with filters.
160
-
161
- **Query params:**
162
-
163
- | Param | Required | Description |
164
- |-------|----------|-------------|
165
- | `sessionId` | Yes* | Filter by session |
166
- | `serviceId` | Yes* | Filter by service (*at least one required) |
167
- | `type` | No | `ai`, `tool`, `http`, `db`, `side_effect` |
168
- | `name` | No | Event name (tool/model name) |
169
- | `traceId` | No | Request-level trace ID |
170
- | `from` | No | Start timestamp (ms) |
171
- | `to` | No | End timestamp (ms) |
172
- | `limit` | No | Max results (default 100, max 1000) |
173
- | `offset` | No | Pagination offset |
174
- | `sort` | No | `asc` or `desc` by timestamp (default `desc`) |
175
-
176
- **Response:**
177
-
178
- ```json
179
- {
180
- "events": [ /* WorkflowEvent[] */ ],
181
- "total": 1247,
182
- "hasMore": true
183
- }
184
- ```
185
-
186
- ---
187
-
188
- ### Endpoint 5: `GET /api/observability/stats`
189
-
190
- **Purpose:** Aggregated metrics for a service over a time window.
191
-
192
- **Query params:**
193
-
194
- | Param | Required | Description |
195
- |-------|----------|-------------|
196
- | `serviceId` | Yes | Service to aggregate |
197
- | `window` | No | `1h`, `6h`, `24h`, `7d` (default `1h`) |
198
- | `from` | No | Custom window start (overrides `window`) |
199
- | `to` | No | Custom window end |
200
-
201
- **Response:**
202
-
203
- ```json
204
- {
205
- "serviceId": "my-ai-app",
206
- "window": { "from": 1712847600000, "to": 1712851200000 },
207
- "summary": {
208
- "totalEvents": 1247,
209
- "aiCalls": 340,
210
- "toolCalls": 890,
211
- "errors": 17,
212
- "activeSessions": 2
213
- },
214
- "ai": {
215
- "byModel": {
216
- "gpt-4o": {
217
- "count": 200,
218
- "avgDurationMs": 1230,
219
- "p95DurationMs": 3400,
220
- "p99DurationMs": 5200,
221
- "totalInputTokens": 150000,
222
- "totalOutputTokens": 45000,
223
- "errorCount": 5
224
- },
225
- "claude-sonnet-4-6": {
226
- "count": 140,
227
- "avgDurationMs": 980,
228
- "p95DurationMs": 2100,
229
- "p99DurationMs": 3800,
230
- "totalInputTokens": 120000,
231
- "totalOutputTokens": 38000,
232
- "errorCount": 2
233
- }
234
- }
235
- },
236
- "tools": {
237
- "byName": {
238
- "searchDB": { "count": 450, "avgDurationMs": 45, "p95DurationMs": 120, "errorCount": 3 },
239
- "sendEmail": { "count": 440, "avgDurationMs": 200, "p95DurationMs": 500, "errorCount": 7 }
240
- }
241
- }
242
- }
243
- ```
244
-
245
- ---
246
-
247
- ### Storage Schema
248
-
249
- #### Table: `observability_events`
250
-
251
- | Column | Type | Index | Description |
252
- |--------|------|-------|-------------|
253
- | `id` | UUID / BIGSERIAL | PK | Internal row ID |
254
- | `project_id` | UUID | YES | Resolved from API key |
255
- | `session_id` | UUID | YES | From payload |
256
- | `service_id` | VARCHAR(255) | YES | From payload |
257
- | `event_id` | INT | — | The SDK-assigned sequential ID |
258
- | `event_type` | VARCHAR(20) | YES | `ai`, `tool`, `http`, `db`, `side_effect` |
259
- | `event_name` | VARCHAR(500) | YES | Tool/model name |
260
- | `trace_id` | VARCHAR(255) | YES | Request-level grouping |
261
- | `input` | JSONB | — | Event input payload |
262
- | `output` | JSONB | — | Event output payload |
263
- | `timestamp` | BIGINT | YES | Unix ms from SDK |
264
- | `duration_ms` | INT | — | Execution time |
265
- | `usage_input_tokens` | INT | — | LLM input tokens |
266
- | `usage_output_tokens` | INT | — | LLM output tokens |
267
- | `usage_total_tokens` | INT | — | LLM total tokens |
268
- | `streamed` | BOOLEAN | — | Was response streamed |
269
- | `stream_raw` | TEXT | — | Buffered stream text |
270
- | `schema_version` | SMALLINT | — | Default 1 |
271
- | `created_at` | TIMESTAMPTZ | — | Row insert time |
272
-
273
- **Composite indexes:**
274
- - `(project_id, session_id, timestamp)`
275
- - `(project_id, service_id, timestamp)`
276
- - `(project_id, service_id, event_type, timestamp)`
277
-
278
- #### Table: `observability_sessions`
279
-
280
- | Column | Type | Index | Description |
281
- |--------|------|-------|-------------|
282
- | `session_id` | UUID | PK | From SDK |
283
- | `project_id` | UUID | YES | Resolved from API key |
284
- | `service_id` | VARCHAR(255) | YES | From SDK |
285
- | `started_at` | BIGINT | YES | Unix ms |
286
- | `last_heartbeat` | BIGINT | — | Updated on `__heartbeat__` events |
287
- | `ended_at` | BIGINT | — | Set on `__session_end__` event |
288
- | `event_count` | INT | — | Running counter |
289
- | `metadata` | JSONB | — | Session metadata |
290
- | `created_at` | TIMESTAMPTZ | — | Row insert time |
291
-
292
- ---
293
-
294
- ### Event Processing Logic
295
-
296
- When receiving a batch at `POST /api/observability/events`:
297
-
298
- 1. Validate payload schema
299
- 2. Resolve `projectId` from API key
300
- 3. Upsert session in `observability_sessions` (create if first batch for this sessionId)
301
- 4. For each event:
302
- - If `name === '__heartbeat__'`: update `last_heartbeat` on the session row, do not store as event
303
- - If `name === '__session_end__'`: set `ended_at` on the session row, do not store as event
304
- - Otherwise: insert into `observability_events`
305
- 5. Increment `event_count` on the session row
306
- 6. Return `202 Accepted`
307
-
308
- ---
309
-
310
- ## Part 2: Dashboard Frontend Contract
311
-
312
- ### Navigation
313
-
314
- Add a top-level **Observability** nav item alongside the existing test/workflow views.
315
-
316
- ### Page 1: Services Overview (`/observability`)
317
-
318
- **Data source:** `GET /api/observability/sessions?status=active`
319
-
320
- **Display:**
321
- - Table of services with columns: Service ID, Active Sessions, Last Heartbeat (relative time), Total Events (24h)
322
- - Each row links to the service detail page
323
- - Auto-refresh every 10 seconds
324
-
325
- ### Page 2: Service Detail (`/observability/:serviceId`)
326
-
327
- **Data sources:**
328
- - `GET /api/observability/sessions?serviceId=X`
329
- - `GET /api/observability/stats?serviceId=X&window=1h`
330
-
331
- **Display:**
332
- - **Header:** Service name, active session count, status badge (active/idle)
333
- - **Metrics cards:** Total events, AI calls, tool calls, error rate (from stats endpoint)
334
- - **Charts:**
335
- - AI call volume over time (line chart, grouped by model)
336
- - Average latency over time per model (line chart)
337
- - Token usage breakdown (stacked bar: input vs output, by model)
338
- - Error rate trend (line chart)
339
- - **Sessions table:** List of recent sessions with sessionId, startedAt, duration, eventCount, status
340
- - Time window selector: 1h / 6h / 24h / 7d
341
-
342
- ### Page 3: Session Detail (`/observability/:serviceId/:sessionId`)
343
-
344
- **Data source:** `GET /api/observability/events?sessionId=X&limit=100`
345
-
346
- **Display:**
347
- - **Timeline/waterfall:** All events displayed chronologically as a vertical timeline
348
- - Each event shows: type icon (AI/tool/http/db), name, duration bar, timestamp
349
- - Color-coded by type
350
- - Errors highlighted in red
351
- - **Event detail panel** (click to expand):
352
- - Input viewer (syntax-highlighted JSON, collapsible)
353
- - Output viewer (syntax-highlighted JSON, collapsible)
354
- - Duration, token usage (for AI events)
355
- - Stream raw content viewer (for streamed responses)
356
- - **Filters:** Type dropdown, name search, time range
357
- - **Grouping:** Option to group by `traceId` (shows request-level waterfall)
358
- - **Live mode:** SSE/WebSocket connection to stream new events in real-time for active sessions
359
-
360
- ### Page 4: Live Trace View (`/observability/:serviceId/:sessionId/live`)
361
-
362
- **Data source:** SSE endpoint `GET /api/observability/events/stream?sessionId=X`
363
-
364
- **Display:**
365
- - Real-time event stream (newest at top or bottom, configurable)
366
- - Auto-scroll with pause button
367
- - Same event cards as session detail but streaming in
368
- - Connection status indicator (connected/reconnecting/disconnected)
369
-
370
- ---
371
-
372
- ## Part 3: SDK → Backend Event Reference
373
-
374
- ### Event Types the SDK Sends
375
-
376
- | `type` | `name` | When | Key fields |
377
- |--------|--------|------|------------|
378
- | `ai` | Model name (e.g. `gpt-4o`) | Every `wrapAI` call | `input`, `output`, `usage`, `durationMs`, `streamed` |
379
- | `tool` | Tool name (e.g. `searchDB`) | Every `wrapTool` call | `input`, `output`, `durationMs`, `streamed` |
380
- | `side_effect` | `__heartbeat__` | Every 30s (configurable) | `input.sessionId`, `output.uptime` |
381
- | `side_effect` | `__session_end__` | On `shutdownObservability()` | `input.sessionId`, `output.uptime` |
382
-
383
- ### Special Events (do not display in trace UI)
384
-
385
- - `__heartbeat__` — update session liveness, do not store as event
386
- - `__session_end__` — mark session as ended, do not store as event
387
-
388
- ### Streamed Events
389
-
390
- When `streamed === true`:
391
- - `output` is `null`
392
- - `streamRaw` contains the full buffered text of the stream
393
- - Display `streamRaw` as the output in the UI
394
-
395
- ### Error Events
396
-
397
- When a tool or AI call throws:
398
- - `output` is `{ "error": "Error message string" }`
399
- - `durationMs` reflects time until failure
400
- - Display with error styling in the UI
401
-
402
- ---
403
-
404
- ## Part 4: Portal (Remote Rerun Queue) Contract
405
-
406
- The SDK's `elasticdash portal` command starts an HTTP server that the backend can push rerun tasks to. The backend also needs endpoints to receive results.
407
-
408
- ### SDK Portal Endpoints (hosted on user's machine, default port 4574)
409
-
410
- These endpoints are served by the SDK. The backend calls them.
411
-
412
- #### `POST /api/portal/tasks` — Push a single rerun task
413
-
414
- **Request:**
415
- ```json
416
- {
417
- "taskId": "task-uuid-from-backend",
418
- "type": "tool",
419
- "name": "searchDB",
420
- "input": { "query": "pikachu" },
421
- "metadata": { "testGroupId": 42, "expectationIds": [1, 2, 3] }
422
- }
423
- ```
424
-
425
- For AI tasks:
426
- ```json
427
- {
428
- "taskId": "task-uuid-from-backend",
429
- "type": "ai",
430
- "name": "gpt-4o",
431
- "input": { "messages": [{ "role": "user", "content": "Hello" }] },
432
- "model": "gpt-4o",
433
- "provider": "openai",
434
- "modelParameters": { "temperature": 0.7, "max_tokens": 512 },
435
- "metadata": { "testGroupId": 42 }
436
- }
437
- ```
438
-
439
- **Response:** `202 Accepted`
440
- ```json
441
- { "ok": true, "taskId": "task-uuid-from-backend", "position": 3 }
442
- ```
443
-
444
- **Auth:** `Authorization: Bearer <api_key>` (validated if portal was started with `--api-key`)
445
-
446
- #### `POST /api/portal/tasks/batch` — Push multiple tasks
447
-
448
- **Request:**
449
- ```json
450
- { "tasks": [ /* PortalTask[] */ ] }
451
- ```
452
-
453
- **Response:** `202 Accepted`
454
- ```json
455
- { "ok": true, "tasks": [{ "taskId": "...", "position": 1 }, { "taskId": "...", "position": 2 }] }
456
- ```
457
-
458
- #### `GET /api/portal/status` — Health check
459
-
460
- **Response:**
461
- ```json
462
- {
463
- "ok": true,
464
- "queueLength": 5,
465
- "processing": "task-uuid-123",
466
- "completed": 12,
467
- "failed": 1
468
- }
469
- ```
470
-
471
- #### `DELETE /api/portal/tasks/:taskId` — Cancel a pending task
472
-
473
- **Response:** `200` if removed, `404` if not found or already processing.
474
-
475
- ---
476
-
477
- ### Backend Endpoints (needed for portal to work)
478
-
479
- These endpoints must be implemented on the backend. The SDK calls them.
480
-
481
- #### `POST /api/portal/register` — Portal registration
482
-
483
- Called by the SDK when `elasticdash portal` starts.
484
-
485
- **Request:**
486
- ```json
487
- {
488
- "portalUrl": "http://localhost:4574"
489
- }
490
- ```
491
-
492
- **Auth:** `Authorization: Bearer <api_key>`
493
-
494
- **Response:** `200 OK`
495
- ```json
496
- { "ok": true }
497
- ```
498
-
499
- The backend should store this portal URL and use it to push tasks. The registration should be scoped to the project resolved from the API key.
500
-
501
- #### `POST /api/portal/results/:taskId` — Receive task result
502
-
503
- Called by the SDK after each task completes (success or failure).
504
-
505
- **Request:**
506
- ```json
507
- {
508
- "taskId": "task-uuid-from-backend",
509
- "ok": true,
510
- "output": "The search returned 3 results for pikachu...",
511
- "durationMs": 245,
512
- "usage": {
513
- "inputTokens": 150,
514
- "outputTokens": 45,
515
- "totalTokens": 195
516
- },
517
- "metadata": { "testGroupId": 42, "expectationIds": [1, 2, 3] }
518
- }
519
- ```
520
-
521
- For failed tasks:
522
- ```json
523
- {
524
- "taskId": "task-uuid-from-backend",
525
- "ok": false,
526
- "output": null,
527
- "error": "Tool not found: \"searchDB\". Available tools: fetchData, sendEmail",
528
- "durationMs": 0,
529
- "metadata": { "testGroupId": 42 }
530
- }
531
- ```
532
-
533
- **Auth:** `Authorization: Bearer <api_key>`
534
-
535
- **Response:** `200 OK`
536
- ```json
537
- { "ok": true }
538
- ```
539
-
540
- **Backend responsibilities on receiving a result:**
541
- 1. Store the result (output, durationMs, usage, error)
542
- 2. Run evaluations (llm-judge, output-schema, token-budget, latency-budget, etc.)
543
- 3. Update the test group run status
544
- 4. If all tasks for a test group are complete, mark the run as finished
545
-
546
- ---
547
-
548
- ### Error Results Reference
549
-
550
- The SDK sends these error patterns — the backend should handle them gracefully:
551
-
552
- | Error pattern | Meaning |
553
- |--------------|---------|
554
- | `Tool not found: "<name>". Available tools: ...` | Tool doesn't exist in `ed_tools.ts` |
555
- | `Cannot find ed_tools.ts/js in workspace root.` | No tools module in the project |
556
- | `Unsupported AI provider: "<name>"` | Unknown provider string |
557
- | `Missing API key for provider "<name>". Expected environment variable: <VAR>` | LLM API key not configured |
558
- | `AI task input is empty; cannot execute.` | No prompt could be extracted from input |
559
- | `AI execution failed: <message>` | LLM API call failed (rate limit, network, invalid model) |
560
- | `Tool subprocess produced no output.` | Subprocess exited without result |
561
- | `Failed to spawn tool subprocess: <message>` | Could not start subprocess |
562
- | `Missing tool name on task.` | Task had no `name` field |
563
- | `Unknown task type: <type>` | Task type was neither `tool` nor `ai` |
564
-
565
- ---
566
-
567
- ## Part 5: Existing Endpoint Compatibility
568
-
569
- The existing test-run endpoints remain unchanged:
570
-
571
- | Endpoint | Used by | Change |
572
- |----------|---------|--------|
573
- | `POST /api/trace-events` | HTTP workflow test mode | No change — still fire-and-forget per event |
574
- | `GET /api/run-configs/:runId` | HTTP workflow test mode | No change |
575
- | `POST /api/validate-workflow` | Dashboard test runs | No change |
576
-
577
- Observability mode uses a completely separate set of endpoints (`/api/observability/*`). There is no overlap or conflict with test-run endpoints.