elasticdash-test 0.1.26 → 0.1.27

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/README.md +100 -0
  2. package/dist/cli.js +175 -0
  3. package/dist/cli.js.map +1 -1
  4. package/dist/index.cjs +62 -1
  5. package/dist/index.d.ts +2 -0
  6. package/dist/index.d.ts.map +1 -1
  7. package/dist/index.js +2 -0
  8. package/dist/index.js.map +1 -1
  9. package/dist/tool-registry.d.ts +31 -0
  10. package/dist/tool-registry.d.ts.map +1 -0
  11. package/dist/tool-registry.js +73 -0
  12. package/dist/tool-registry.js.map +1 -0
  13. package/dist/tool-runner-worker.js +19 -2
  14. package/dist/tool-runner-worker.js.map +1 -1
  15. package/dist/utils/debug.d.ts +1 -1
  16. package/dist/utils/debug.d.ts.map +1 -1
  17. package/dist/utils/debug.js +2 -2
  18. package/dist/utils/debug.js.map +1 -1
  19. package/docs/observability_contract.md +192 -0
  20. package/package.json +2 -2
  21. package/src/cli.ts +184 -0
  22. package/src/index.ts +4 -0
  23. package/src/tool-registry.ts +94 -0
  24. package/src/tool-runner-worker.ts +17 -2
  25. package/src/utils/debug.ts +2 -2
  26. package/dist/cloud-client.d.ts +0 -34
  27. package/dist/cloud-client.d.ts.map +0 -1
  28. package/dist/cloud-client.js +0 -103
  29. package/dist/cloud-client.js.map +0 -1
  30. package/dist/evaluators/determinism.d.ts +0 -3
  31. package/dist/evaluators/determinism.d.ts.map +0 -1
  32. package/dist/evaluators/determinism.js +0 -116
  33. package/dist/evaluators/determinism.js.map +0 -1
  34. package/dist/evaluators/index.d.ts +0 -4
  35. package/dist/evaluators/index.d.ts.map +0 -1
  36. package/dist/evaluators/index.js +0 -61
  37. package/dist/evaluators/index.js.map +0 -1
  38. package/dist/evaluators/latency-budget.d.ts +0 -3
  39. package/dist/evaluators/latency-budget.d.ts.map +0 -1
  40. package/dist/evaluators/latency-budget.js +0 -45
  41. package/dist/evaluators/latency-budget.js.map +0 -1
  42. package/dist/evaluators/llm-judge.d.ts +0 -3
  43. package/dist/evaluators/llm-judge.d.ts.map +0 -1
  44. package/dist/evaluators/llm-judge.js +0 -125
  45. package/dist/evaluators/llm-judge.js.map +0 -1
  46. package/dist/evaluators/output-contains.d.ts +0 -3
  47. package/dist/evaluators/output-contains.d.ts.map +0 -1
  48. package/dist/evaluators/output-contains.js +0 -52
  49. package/dist/evaluators/output-contains.js.map +0 -1
  50. package/dist/evaluators/output-schema.d.ts +0 -3
  51. package/dist/evaluators/output-schema.d.ts.map +0 -1
  52. package/dist/evaluators/output-schema.js +0 -58
  53. package/dist/evaluators/output-schema.js.map +0 -1
  54. package/dist/evaluators/token-budget.d.ts +0 -3
  55. package/dist/evaluators/token-budget.d.ts.map +0 -1
  56. package/dist/evaluators/token-budget.js +0 -45
  57. package/dist/evaluators/token-budget.js.map +0 -1
  58. package/dist/evaluators/types.d.ts +0 -104
  59. package/dist/evaluators/types.d.ts.map +0 -1
  60. package/dist/evaluators/types.js +0 -6
  61. package/dist/evaluators/types.js.map +0 -1
  62. package/dist/test-group/cli.d.ts +0 -8
  63. package/dist/test-group/cli.d.ts.map +0 -1
  64. package/dist/test-group/cli.js +0 -162
  65. package/dist/test-group/cli.js.map +0 -1
  66. package/dist/test-group/git-context.d.ts +0 -3
  67. package/dist/test-group/git-context.d.ts.map +0 -1
  68. package/dist/test-group/git-context.js +0 -59
  69. package/dist/test-group/git-context.js.map +0 -1
  70. package/dist/test-group/reporter.d.ts +0 -4
  71. package/dist/test-group/reporter.d.ts.map +0 -1
  72. package/dist/test-group/reporter.js +0 -54
  73. package/dist/test-group/reporter.js.map +0 -1
  74. package/dist/test-group/runner.d.ts +0 -18
  75. package/dist/test-group/runner.d.ts.map +0 -1
  76. package/dist/test-group/runner.js +0 -234
  77. package/dist/test-group/runner.js.map +0 -1
  78. package/dist/tracing-universal.d.ts +0 -13
  79. package/dist/tracing-universal.d.ts.map +0 -1
  80. package/dist/tracing-universal.js +0 -33
  81. package/dist/tracing-universal.js.map +0 -1
  82. package/docs/backend_rerun_alignment.md +0 -291
  83. package/docs/backend_traceid_update.md +0 -141
  84. package/docs/observability_backend_contract.md +0 -577
  85. package/docs/observability_rerun_backend_plan.md +0 -596
@@ -1,596 +0,0 @@
1
- # Backend Plan: Observability Rerun Pipeline
2
-
3
- This document describes everything the backend needs to implement so that the SDK's observability mode can receive rerun triggers, execute them, and return results for evaluation.
4
-
5
- The SDK side is complete. This plan covers only backend and dashboard work.
6
-
7
- ---
8
-
9
- ## How It Works (end-to-end)
10
-
11
- ```
12
- SDK (user's server) Backend
13
- ──────────────────── ────────
14
-
15
- 1. SDK sends event batch
16
- POST /api/observability/events ──────────> Ingests events into observability_events
17
- Checks sampling_configs for matches
18
- If match: creates trigger, includes in response
19
-
20
- 2. <── Response with trigger ────────────── { ok: true, ingested: 5, trigger: {
21
- triggerId: 42, runCount: 3,
22
- steps: [{ eventType, eventName,
23
- input, model, provider,
24
- originalEventDbId }]
25
- }}
26
-
27
- 3. SDK pre-validates each step:
28
- - Tool: checks ed_tools.ts has it
29
- - AI: checks API key env var exists
30
- Executes available steps × runCount
31
- Collects output, durationMs, token usage
32
-
33
- 4. SDK POSTs results
34
- POST /api/observability/triggers/42/results -> Receives results
35
- Runs evaluations per sampling config
36
- Stores results + evaluations as JSONB
37
- Alerts if thresholds breached
38
- ```
39
-
40
- ---
41
-
42
- ## Database Schema (2 tables)
43
-
44
- ### Table: `sampling_configs`
45
-
46
- Persistent user-defined rules. One config can produce many triggers over time.
47
-
48
- ```sql
49
- CREATE TABLE sampling_configs (
50
- id SERIAL PRIMARY KEY,
51
- project_id UUID NOT NULL REFERENCES projects(id),
52
- name VARCHAR(255) NOT NULL,
53
- enabled BOOLEAN NOT NULL DEFAULT true,
54
-
55
- -- Which traces to match
56
- trace_filter JSONB NOT NULL,
57
- -- Which steps within matched traces to rerun
58
- step_selector JSONB NOT NULL,
59
- -- How many times to re-execute each step
60
- run_count INT NOT NULL DEFAULT 1 CHECK (run_count BETWEEN 1 AND 50),
61
- -- Fraction of matching traces that actually trigger (0.0–1.0)
62
- sample_rate FLOAT NOT NULL DEFAULT 1.0 CHECK (sample_rate BETWEEN 0.0 AND 1.0),
63
- -- Evaluations to run on results
64
- evaluations JSONB NOT NULL DEFAULT '[]',
65
- -- Alert configuration
66
- alerting JSONB NOT NULL DEFAULT '{}',
67
-
68
- created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
69
- updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
70
- );
71
-
72
- CREATE INDEX idx_sampling_configs_project ON sampling_configs(project_id, enabled);
73
- ```
74
-
75
- **`trace_filter` format:**
76
-
77
- ```json
78
- {
79
- "serviceId": "my-ai-app",
80
- "eventTypes": ["tool", "ai"],
81
- "eventNames": ["searchDB", "gpt-4o"]
82
- }
83
- ```
84
-
85
- All fields are optional. Omitted fields match everything. Rules:
86
- - `serviceId` — exact match on session's service ID
87
- - `eventTypes` — match events of these types
88
- - `eventNames` — match events with these names (tool name or model name)
89
-
90
- **`step_selector` format:**
91
-
92
- ```json
93
- { "mode": "by_name", "names": ["searchDB", "gpt-4o"] }
94
- ```
95
- Or:
96
- ```json
97
- { "mode": "all", "types": ["tool", "ai"] }
98
- ```
99
-
100
- `by_name` selects specific steps. `all` selects all steps of the given types from the matched trace.
101
-
102
- **`evaluations` format:**
103
-
104
- ```json
105
- [
106
- { "type": "latency-budget", "maxDurationMs": 5000 },
107
- { "type": "token-budget", "maxTokens": 1000 },
108
- { "type": "output-contains", "containsText": "pikachu", "notContainsText": "error" },
109
- { "type": "output-schema", "jsonSchema": { "type": "object", "required": ["results"] } },
110
- { "type": "determinism", "similarityThreshold": 0.9 },
111
- {
112
- "type": "llm-judge",
113
- "judgePrompt": "Rate the quality and accuracy of this output on a scale of 1-10. Output format: SCORE: <number>",
114
- "threshold": 7.0,
115
- "judgeModel": "gpt-4o",
116
- "judgeProvider": "openai"
117
- }
118
- ]
119
- ```
120
-
121
- **`alerting` format:**
122
-
123
- ```json
124
- {
125
- "onFailure": true,
126
- "webhookUrl": "https://hooks.slack.com/...",
127
- "emailTo": ["dev@company.com"]
128
- }
129
- ```
130
-
131
- ### Table: `triggers`
132
-
133
- One-time execution instance of a sampling config. Stores the full lifecycle: what was sent, what came back, and evaluation outcomes — all as JSONB.
134
-
135
- ```sql
136
- CREATE TABLE triggers (
137
- id SERIAL PRIMARY KEY,
138
- project_id UUID NOT NULL REFERENCES projects(id),
139
- sampling_config_id INT NOT NULL REFERENCES sampling_configs(id),
140
- session_id UUID NOT NULL,
141
- status VARCHAR(20) NOT NULL DEFAULT 'pending',
142
- -- pending: created, not yet included in a batch response
143
- -- sent: included in batch response to SDK, waiting for results
144
- -- completed: results received and evaluations stored
145
- -- timed_out: SDK didn't respond within 10 minutes
146
- run_count INT NOT NULL,
147
-
148
- -- What was sent to the SDK
149
- steps_sent JSONB NOT NULL,
150
-
151
- -- What came back from the SDK (null until completed)
152
- results JSONB,
153
-
154
- -- Evaluation outcomes computed by the backend (null until completed)
155
- evaluations JSONB,
156
-
157
- -- Summary counts (denormalized for list queries)
158
- steps_count INT,
159
- steps_available INT,
160
- steps_unavailable INT,
161
- evaluations_passed INT,
162
- evaluations_failed INT,
163
-
164
- created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
165
- sent_at TIMESTAMPTZ,
166
- completed_at TIMESTAMPTZ
167
- );
168
-
169
- CREATE INDEX idx_triggers_project_status ON triggers(project_id, status);
170
- CREATE INDEX idx_triggers_session ON triggers(session_id);
171
- CREATE INDEX idx_triggers_config ON triggers(sampling_config_id);
172
- CREATE INDEX idx_triggers_sent_timeout ON triggers(status, sent_at)
173
- WHERE status = 'sent';
174
- ```
175
-
176
- **`steps_sent` format (what goes to SDK):**
177
-
178
- ```json
179
- [
180
- {
181
- "eventId": 1,
182
- "eventType": "tool",
183
- "eventName": "searchDB",
184
- "originalEventDbId": 102,
185
- "input": { "query": "pikachu" }
186
- },
187
- {
188
- "eventId": 2,
189
- "eventType": "ai",
190
- "eventName": "gpt-4o",
191
- "originalEventDbId": 103,
192
- "input": { "messages": [{"role": "user", "content": "..."}] },
193
- "model": "gpt-4o",
194
- "provider": "openai"
195
- }
196
- ]
197
- ```
198
-
199
- **`results` format (what comes back from SDK):**
200
-
201
- ```json
202
- [
203
- {
204
- "originalEventDbId": 102,
205
- "eventType": "tool",
206
- "eventName": "searchDB",
207
- "available": true,
208
- "runs": [
209
- { "runIndex": 0, "input": {...}, "output": {...}, "durationMs": 45 },
210
- { "runIndex": 1, "input": {...}, "output": {...}, "durationMs": 52 }
211
- ]
212
- },
213
- {
214
- "originalEventDbId": 103,
215
- "eventType": "ai",
216
- "eventName": "gpt-4o",
217
- "available": false,
218
- "unavailableReason": "Missing API key for provider \"openai\". Expected: OPENAI_API_KEY",
219
- "runs": []
220
- }
221
- ]
222
- ```
223
-
224
- **`evaluations` format (computed by backend):**
225
-
226
- ```json
227
- [
228
- {
229
- "originalEventDbId": 102,
230
- "eventName": "searchDB",
231
- "results": [
232
- { "type": "latency-budget", "passed": true, "detail": { "maxDurationMs": 5000, "actualMaxMs": 52 } },
233
- { "type": "determinism", "passed": true, "detail": { "threshold": 0.9, "similarity": 0.95 } }
234
- ]
235
- },
236
- {
237
- "originalEventDbId": 103,
238
- "eventName": "gpt-4o",
239
- "results": [
240
- { "type": "availability", "passed": false, "detail": { "reason": "Missing API key..." } }
241
- ]
242
- }
243
- ]
244
- ```
245
-
246
- ---
247
-
248
- ## API Endpoints
249
-
250
- ### 1. `POST /api/observability/events` — Modify Existing
251
-
252
- After ingesting the event batch, check sampling configs:
253
-
254
- ```
255
- function handleObservabilityEvents(req, res):
256
- 1. Validate & ingest events (existing logic)
257
- 2. responseBody = { ok: true, ingested: events.length }
258
-
259
- 3. Find matching sampling_configs:
260
- SELECT * FROM sampling_configs
261
- WHERE project_id = :projectId
262
- AND enabled = true
263
-
264
- 4. For each config, check if trace_filter matches any event in this batch:
265
- - serviceId matches session's serviceId?
266
- - eventTypes includes any event.type in batch?
267
- - eventNames includes any event.name in batch?
268
-
269
- 5. If match AND Math.random() < config.sample_rate:
270
- a. Check no pending/sent trigger exists for this config + session:
271
- SELECT 1 FROM triggers
272
- WHERE sampling_config_id = :configId
273
- AND session_id = :sessionId
274
- AND status IN ('pending', 'sent')
275
- → If exists, skip (don't double-trigger)
276
-
277
- b. Select steps to rerun based on step_selector:
278
- - mode "by_name": find events in batch matching names
279
- - mode "all": find events in batch matching types
280
- For each selected event, look up its full input from observability_events:
281
- SELECT input, event_name, event_type
282
- FROM observability_events
283
- WHERE id = :eventDbId
284
-
285
- c. Build trigger steps with full input data:
286
- steps = [{
287
- eventId: <sequential>,
288
- eventType: event.event_type,
289
- eventName: event.event_name,
290
- originalEventDbId: event.id,
291
- input: event.input, // REQUIRED: SDK needs this to re-execute
292
- model: event.event_name, // For AI events
293
- provider: <inferred> // For AI events: infer from model name
294
- }]
295
-
296
- d. INSERT INTO triggers (
297
- project_id, sampling_config_id, session_id,
298
- status, run_count, steps_sent, steps_count, sent_at
299
- ) VALUES (
300
- :projectId, :configId, :sessionId,
301
- 'sent', :config.run_count, :steps, :steps.length, NOW()
302
- )
303
-
304
- e. Add trigger to response:
305
- responseBody.trigger = {
306
- triggerId: trigger.id,
307
- runCount: config.run_count,
308
- steps: steps
309
- }
310
-
311
- 6. Return responseBody (status 202)
312
- ```
313
-
314
- **Important:** Only one trigger per response. If multiple configs match, pick the highest priority or oldest. Queue others for the next batch.
315
-
316
- ### 2. `POST /api/observability/triggers/:triggerId/results` — New
317
-
318
- Receives execution results from the SDK.
319
-
320
- **Request body (from SDK):**
321
-
322
- ```json
323
- {
324
- "steps": [
325
- {
326
- "originalEventDbId": 102,
327
- "eventType": "tool",
328
- "eventName": "searchDB",
329
- "available": true,
330
- "runs": [
331
- { "runIndex": 0, "input": {...}, "output": {...}, "durationMs": 45 },
332
- { "runIndex": 1, "input": {...}, "output": {...}, "durationMs": 52 }
333
- ]
334
- },
335
- {
336
- "originalEventDbId": 103,
337
- "eventType": "ai",
338
- "eventName": "gpt-4o",
339
- "available": false,
340
- "unavailableReason": "Missing API key for provider \"openai\". Expected: OPENAI_API_KEY",
341
- "runs": []
342
- }
343
- ]
344
- }
345
- ```
346
-
347
- **Processing logic:**
348
-
349
- ```
350
- function handleTriggerResults(triggerId, body):
351
- 1. Find trigger:
352
- SELECT * FROM triggers WHERE id = :triggerId
353
- → 404 if not found
354
- → 409 if status != 'sent' (already completed or timed out)
355
-
356
- 2. Load sampling config:
357
- SELECT evaluations FROM sampling_configs WHERE id = trigger.sampling_config_id
358
-
359
- 3. Store raw results:
360
- UPDATE triggers SET results = body.steps WHERE id = :triggerId
361
-
362
- 4. Run evaluations for each step:
363
- allEvaluations = []
364
- passedCount = 0
365
- failedCount = 0
366
- availableCount = 0
367
- unavailableCount = 0
368
-
369
- For each step in body.steps:
370
- stepEvals = { originalEventDbId: step.originalEventDbId, eventName: step.eventName, results: [] }
371
-
372
- If step.available == false:
373
- unavailableCount++
374
- stepEvals.results.push({ type: 'availability', passed: false, detail: { reason: step.unavailableReason } })
375
- failedCount++
376
- Else:
377
- availableCount++
378
- For each evaluation in sampling_config.evaluations:
379
- result = evaluate(evaluation, step.runs)
380
- stepEvals.results.push(result)
381
- if result.passed: passedCount++ else: failedCount++
382
-
383
- allEvaluations.push(stepEvals)
384
-
385
- 5. Update trigger with everything:
386
- UPDATE triggers SET
387
- status = 'completed',
388
- completed_at = NOW(),
389
- results = :body.steps,
390
- evaluations = :allEvaluations,
391
- steps_available = :availableCount,
392
- steps_unavailable = :unavailableCount,
393
- evaluations_passed = :passedCount,
394
- evaluations_failed = :failedCount
395
- WHERE id = :triggerId
396
-
397
- 6. If failedCount > 0: fire alert (see Alerting below)
398
-
399
- 7. Return { ok: true, triggerId, evaluationsRun: passedCount + failedCount, passed: passedCount, failed: failedCount }
400
- ```
401
-
402
- ### 3. `GET /api/observability/triggers` — New
403
-
404
- List triggers for a project.
405
-
406
- **Query params:** `samplingConfigId?`, `sessionId?`, `status?`, `from?`, `to?`, `limit=50`, `offset=0`
407
-
408
- **Response:**
409
-
410
- ```json
411
- {
412
- "triggers": [
413
- {
414
- "id": 42,
415
- "samplingConfigId": 1,
416
- "samplingConfigName": "Monitor searchDB latency",
417
- "sessionId": "f47ac10b-...",
418
- "status": "completed",
419
- "runCount": 3,
420
- "stepsCount": 2,
421
- "stepsAvailable": 1,
422
- "stepsUnavailable": 1,
423
- "evaluationsPassed": 3,
424
- "evaluationsFailed": 1,
425
- "createdAt": "2026-04-11T10:00:00Z",
426
- "completedAt": "2026-04-11T10:00:05Z"
427
- }
428
- ],
429
- "total": 15
430
- }
431
- ```
432
-
433
- Note: summary counts come from the denormalized columns on `triggers` — no need to parse JSONB for list queries.
434
-
435
- ### 4. `GET /api/observability/triggers/:triggerId` — New
436
-
437
- Full detail. Returns the trigger row with `steps_sent`, `results`, and `evaluations` JSONB fields unpacked.
438
-
439
- ### 5. Sampling Config CRUD — New
440
-
441
- | Endpoint | Method | Purpose |
442
- |----------|--------|---------|
443
- | `/api/observability/sampling-configs` | POST | Create config |
444
- | `/api/observability/sampling-configs` | GET | List configs for project |
445
- | `/api/observability/sampling-configs/:id` | GET | Get config detail |
446
- | `/api/observability/sampling-configs/:id` | PUT | Update config |
447
- | `/api/observability/sampling-configs/:id` | DELETE | Delete config |
448
-
449
- All scoped to project via API key auth.
450
-
451
- ---
452
-
453
- ## Evaluation Engine
454
-
455
- Implement each evaluation type as a function:
456
-
457
- ### `latency-budget`
458
-
459
- ```
460
- Input: runs[], config: { maxDurationMs }
461
- Logic: passed = runs.every(r => r.durationMs <= maxDurationMs)
462
- Detail: { maxDurationMs, actualMaxMs: Math.max(...runs.map(r => r.durationMs)) }
463
- ```
464
-
465
- ### `token-budget`
466
-
467
- ```
468
- Input: runs[], config: { maxTokens }
469
- Logic: passed = runs.every(r => (r.usageTotalTokens ?? 0) <= maxTokens)
470
- Detail: { maxTokens, actualMaxTokens: Math.max(...) }
471
- ```
472
-
473
- ### `output-contains`
474
-
475
- ```
476
- Input: runs[], config: { containsText?, notContainsText? }
477
- Logic:
478
- outputStr = JSON.stringify(run.output)
479
- if containsText: passed &= outputStr.includes(containsText)
480
- if notContainsText: passed &= !outputStr.includes(notContainsText)
481
- Detail: { containsText, notContainsText, failedRunIndices: [...] }
482
- ```
483
-
484
- ### `output-schema`
485
-
486
- ```
487
- Input: runs[], config: { jsonSchema }
488
- Logic: validate each run.output against JSON schema (use ajv)
489
- passed = runs.every(r => ajv.validate(jsonSchema, r.output))
490
- Detail: { failedRunIndices: [...], errors: [...] }
491
- ```
492
-
493
- ### `determinism`
494
-
495
- ```
496
- Input: runs[] (needs runCount >= 2), config: { similarityThreshold }
497
- Logic:
498
- Compare all pairs of run outputs
499
- similarity = average pairwise similarity (cosine, jaccard, or string match ratio)
500
- passed = similarity >= similarityThreshold
501
- Detail: { threshold, actualSimilarity, pairComparisons: [...] }
502
- ```
503
-
504
- Simple approach: normalize outputs to strings, compare with Levenshtein distance / longest common subsequence ratio. For structured outputs, deep-equal check with tolerance.
505
-
506
- ### `llm-judge`
507
-
508
- ```
509
- Input: runs[], config: { judgePrompt, threshold, judgeModel, judgeProvider }
510
- Logic:
511
- For each run:
512
- prompt = judgePrompt + "\n\nOutput to evaluate:\n" + JSON.stringify(run.output)
513
- response = callLLM(judgeModel, judgeProvider, prompt)
514
- score = parseScore(response) // extract number from "SCORE: 8.5"
515
- avgScore = average(scores)
516
- passed = avgScore >= threshold
517
- Detail: { threshold, avgScore, perRunScores: [...] }
518
- ```
519
-
520
- ---
521
-
522
- ## Background Jobs
523
-
524
- ### Trigger Timeout Sweep
525
-
526
- Run every 5 minutes:
527
-
528
- ```sql
529
- UPDATE triggers
530
- SET status = 'timed_out'
531
- WHERE status = 'sent'
532
- AND sent_at < NOW() - INTERVAL '10 minutes';
533
- ```
534
-
535
- Optionally alert on timeouts (SDK may have crashed or process exited before completing).
536
-
537
- ### Stale Session Cleanup
538
-
539
- Run every hour:
540
-
541
- ```sql
542
- UPDATE observability_sessions
543
- SET ended_at = last_heartbeat
544
- WHERE ended_at IS NULL
545
- AND last_heartbeat < NOW() - INTERVAL '5 minutes';
546
- ```
547
-
548
- ---
549
-
550
- ## Alerting
551
-
552
- When evaluations fail, fire alerts:
553
-
554
- 1. **Dashboard notification** — mark trigger as having failures, show badge in UI
555
- 2. **Webhook** (future) — POST to `alerting.webhookUrl` with trigger ID + failed evaluations
556
- 3. **Email** (future) — send to `alerting.emailTo` addresses
557
-
558
- ---
559
-
560
- ## Dashboard Pages
561
-
562
- ### Sampling Configs Page (`/observability/sampling`)
563
-
564
- - Table: name, enabled toggle, filter summary, run count, sample rate, evaluation count
565
- - Create/edit modal with form fields for all config options
566
- - Evaluation builder: add rows with type dropdown + type-specific fields
567
-
568
- ### Trigger History Page (`/observability/triggers`)
569
-
570
- - Table: ID, config name, status badge, steps count, available/unavailable, pass/fail counts, timestamps
571
- - Filter by: config, status, date range
572
- - Click row → trigger detail page
573
-
574
- ### Trigger Detail Page (`/observability/triggers/:id`)
575
-
576
- - Header: status, config name, run count, timestamps
577
- - Per-step cards:
578
- - Tool/AI icon + name
579
- - Available badge (green / red with reason)
580
- - Run results table: runIndex, output preview, durationMs, tokens, error
581
- - Evaluation results: type, passed badge, detail expandable
582
- - Collapsible JSON viewers for input/output
583
-
584
- ---
585
-
586
- ## Implementation Order
587
-
588
- 1. **Schema**: Create `sampling_configs` + `triggers` tables
589
- 2. **Sampling configs CRUD**: endpoints + validation
590
- 3. **Modify `POST /api/observability/events`**: add sampling config matching + trigger creation + response body
591
- 4. **`POST /api/observability/triggers/:id/results`**: result storage + evaluation engine
592
- 5. **Evaluation engine**: implement all 6 types
593
- 6. **Query endpoints**: `GET /api/observability/triggers` list + detail
594
- 7. **Timeout sweep**: cron job
595
- 8. **Dashboard**: sampling configs page, trigger history, trigger detail
596
- 9. **Alerting**: dashboard notifications, then webhook/email