elasticdash-test 0.1.26 → 0.1.27
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +100 -0
- package/dist/cli.js +175 -0
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +62 -1
- package/dist/index.d.ts +2 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +2 -0
- package/dist/index.js.map +1 -1
- package/dist/tool-registry.d.ts +31 -0
- package/dist/tool-registry.d.ts.map +1 -0
- package/dist/tool-registry.js +73 -0
- package/dist/tool-registry.js.map +1 -0
- package/dist/tool-runner-worker.js +19 -2
- package/dist/tool-runner-worker.js.map +1 -1
- package/dist/utils/debug.d.ts +1 -1
- package/dist/utils/debug.d.ts.map +1 -1
- package/dist/utils/debug.js +2 -2
- package/dist/utils/debug.js.map +1 -1
- package/docs/observability_contract.md +192 -0
- package/package.json +2 -2
- package/src/cli.ts +184 -0
- package/src/index.ts +4 -0
- package/src/tool-registry.ts +94 -0
- package/src/tool-runner-worker.ts +17 -2
- package/src/utils/debug.ts +2 -2
- package/dist/cloud-client.d.ts +0 -34
- package/dist/cloud-client.d.ts.map +0 -1
- package/dist/cloud-client.js +0 -103
- package/dist/cloud-client.js.map +0 -1
- package/dist/evaluators/determinism.d.ts +0 -3
- package/dist/evaluators/determinism.d.ts.map +0 -1
- package/dist/evaluators/determinism.js +0 -116
- package/dist/evaluators/determinism.js.map +0 -1
- package/dist/evaluators/index.d.ts +0 -4
- package/dist/evaluators/index.d.ts.map +0 -1
- package/dist/evaluators/index.js +0 -61
- package/dist/evaluators/index.js.map +0 -1
- package/dist/evaluators/latency-budget.d.ts +0 -3
- package/dist/evaluators/latency-budget.d.ts.map +0 -1
- package/dist/evaluators/latency-budget.js +0 -45
- package/dist/evaluators/latency-budget.js.map +0 -1
- package/dist/evaluators/llm-judge.d.ts +0 -3
- package/dist/evaluators/llm-judge.d.ts.map +0 -1
- package/dist/evaluators/llm-judge.js +0 -125
- package/dist/evaluators/llm-judge.js.map +0 -1
- package/dist/evaluators/output-contains.d.ts +0 -3
- package/dist/evaluators/output-contains.d.ts.map +0 -1
- package/dist/evaluators/output-contains.js +0 -52
- package/dist/evaluators/output-contains.js.map +0 -1
- package/dist/evaluators/output-schema.d.ts +0 -3
- package/dist/evaluators/output-schema.d.ts.map +0 -1
- package/dist/evaluators/output-schema.js +0 -58
- package/dist/evaluators/output-schema.js.map +0 -1
- package/dist/evaluators/token-budget.d.ts +0 -3
- package/dist/evaluators/token-budget.d.ts.map +0 -1
- package/dist/evaluators/token-budget.js +0 -45
- package/dist/evaluators/token-budget.js.map +0 -1
- package/dist/evaluators/types.d.ts +0 -104
- package/dist/evaluators/types.d.ts.map +0 -1
- package/dist/evaluators/types.js +0 -6
- package/dist/evaluators/types.js.map +0 -1
- package/dist/test-group/cli.d.ts +0 -8
- package/dist/test-group/cli.d.ts.map +0 -1
- package/dist/test-group/cli.js +0 -162
- package/dist/test-group/cli.js.map +0 -1
- package/dist/test-group/git-context.d.ts +0 -3
- package/dist/test-group/git-context.d.ts.map +0 -1
- package/dist/test-group/git-context.js +0 -59
- package/dist/test-group/git-context.js.map +0 -1
- package/dist/test-group/reporter.d.ts +0 -4
- package/dist/test-group/reporter.d.ts.map +0 -1
- package/dist/test-group/reporter.js +0 -54
- package/dist/test-group/reporter.js.map +0 -1
- package/dist/test-group/runner.d.ts +0 -18
- package/dist/test-group/runner.d.ts.map +0 -1
- package/dist/test-group/runner.js +0 -234
- package/dist/test-group/runner.js.map +0 -1
- package/dist/tracing-universal.d.ts +0 -13
- package/dist/tracing-universal.d.ts.map +0 -1
- package/dist/tracing-universal.js +0 -33
- package/dist/tracing-universal.js.map +0 -1
- package/docs/backend_rerun_alignment.md +0 -291
- package/docs/backend_traceid_update.md +0 -141
- package/docs/observability_backend_contract.md +0 -577
- package/docs/observability_rerun_backend_plan.md +0 -596
|
@@ -1,596 +0,0 @@
|
|
|
1
|
-
# Backend Plan: Observability Rerun Pipeline
|
|
2
|
-
|
|
3
|
-
This document describes everything the backend needs to implement so that the SDK's observability mode can receive rerun triggers, execute them, and return results for evaluation.
|
|
4
|
-
|
|
5
|
-
The SDK side is complete. This plan covers only backend and dashboard work.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## How It Works (end-to-end)
|
|
10
|
-
|
|
11
|
-
```
|
|
12
|
-
SDK (user's server) Backend
|
|
13
|
-
──────────────────── ────────
|
|
14
|
-
|
|
15
|
-
1. SDK sends event batch
|
|
16
|
-
POST /api/observability/events ──────────> Ingests events into observability_events
|
|
17
|
-
Checks sampling_configs for matches
|
|
18
|
-
If match: creates trigger, includes in response
|
|
19
|
-
|
|
20
|
-
2. <── Response with trigger ────────────── { ok: true, ingested: 5, trigger: {
|
|
21
|
-
triggerId: 42, runCount: 3,
|
|
22
|
-
steps: [{ eventType, eventName,
|
|
23
|
-
input, model, provider,
|
|
24
|
-
originalEventDbId }]
|
|
25
|
-
}}
|
|
26
|
-
|
|
27
|
-
3. SDK pre-validates each step:
|
|
28
|
-
- Tool: checks ed_tools.ts has it
|
|
29
|
-
- AI: checks API key env var exists
|
|
30
|
-
Executes available steps × runCount
|
|
31
|
-
Collects output, durationMs, token usage
|
|
32
|
-
|
|
33
|
-
4. SDK POSTs results
|
|
34
|
-
POST /api/observability/triggers/42/results -> Receives results
|
|
35
|
-
Runs evaluations per sampling config
|
|
36
|
-
Stores results + evaluations as JSONB
|
|
37
|
-
Alerts if thresholds breached
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
---
|
|
41
|
-
|
|
42
|
-
## Database Schema (2 tables)
|
|
43
|
-
|
|
44
|
-
### Table: `sampling_configs`
|
|
45
|
-
|
|
46
|
-
Persistent user-defined rules. One config can produce many triggers over time.
|
|
47
|
-
|
|
48
|
-
```sql
|
|
49
|
-
CREATE TABLE sampling_configs (
|
|
50
|
-
id SERIAL PRIMARY KEY,
|
|
51
|
-
project_id UUID NOT NULL REFERENCES projects(id),
|
|
52
|
-
name VARCHAR(255) NOT NULL,
|
|
53
|
-
enabled BOOLEAN NOT NULL DEFAULT true,
|
|
54
|
-
|
|
55
|
-
-- Which traces to match
|
|
56
|
-
trace_filter JSONB NOT NULL,
|
|
57
|
-
-- Which steps within matched traces to rerun
|
|
58
|
-
step_selector JSONB NOT NULL,
|
|
59
|
-
-- How many times to re-execute each step
|
|
60
|
-
run_count INT NOT NULL DEFAULT 1 CHECK (run_count BETWEEN 1 AND 50),
|
|
61
|
-
-- Fraction of matching traces that actually trigger (0.0–1.0)
|
|
62
|
-
sample_rate FLOAT NOT NULL DEFAULT 1.0 CHECK (sample_rate BETWEEN 0.0 AND 1.0),
|
|
63
|
-
-- Evaluations to run on results
|
|
64
|
-
evaluations JSONB NOT NULL DEFAULT '[]',
|
|
65
|
-
-- Alert configuration
|
|
66
|
-
alerting JSONB NOT NULL DEFAULT '{}',
|
|
67
|
-
|
|
68
|
-
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
|
69
|
-
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
|
70
|
-
);
|
|
71
|
-
|
|
72
|
-
CREATE INDEX idx_sampling_configs_project ON sampling_configs(project_id, enabled);
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
**`trace_filter` format:**
|
|
76
|
-
|
|
77
|
-
```json
|
|
78
|
-
{
|
|
79
|
-
"serviceId": "my-ai-app",
|
|
80
|
-
"eventTypes": ["tool", "ai"],
|
|
81
|
-
"eventNames": ["searchDB", "gpt-4o"]
|
|
82
|
-
}
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
All fields are optional. Omitted fields match everything. Rules:
|
|
86
|
-
- `serviceId` — exact match on session's service ID
|
|
87
|
-
- `eventTypes` — match events of these types
|
|
88
|
-
- `eventNames` — match events with these names (tool name or model name)
|
|
89
|
-
|
|
90
|
-
**`step_selector` format:**
|
|
91
|
-
|
|
92
|
-
```json
|
|
93
|
-
{ "mode": "by_name", "names": ["searchDB", "gpt-4o"] }
|
|
94
|
-
```
|
|
95
|
-
Or:
|
|
96
|
-
```json
|
|
97
|
-
{ "mode": "all", "types": ["tool", "ai"] }
|
|
98
|
-
```
|
|
99
|
-
|
|
100
|
-
`by_name` selects specific steps. `all` selects all steps of the given types from the matched trace.
|
|
101
|
-
|
|
102
|
-
**`evaluations` format:**
|
|
103
|
-
|
|
104
|
-
```json
|
|
105
|
-
[
|
|
106
|
-
{ "type": "latency-budget", "maxDurationMs": 5000 },
|
|
107
|
-
{ "type": "token-budget", "maxTokens": 1000 },
|
|
108
|
-
{ "type": "output-contains", "containsText": "pikachu", "notContainsText": "error" },
|
|
109
|
-
{ "type": "output-schema", "jsonSchema": { "type": "object", "required": ["results"] } },
|
|
110
|
-
{ "type": "determinism", "similarityThreshold": 0.9 },
|
|
111
|
-
{
|
|
112
|
-
"type": "llm-judge",
|
|
113
|
-
"judgePrompt": "Rate the quality and accuracy of this output on a scale of 1-10. Output format: SCORE: <number>",
|
|
114
|
-
"threshold": 7.0,
|
|
115
|
-
"judgeModel": "gpt-4o",
|
|
116
|
-
"judgeProvider": "openai"
|
|
117
|
-
}
|
|
118
|
-
]
|
|
119
|
-
```
|
|
120
|
-
|
|
121
|
-
**`alerting` format:**
|
|
122
|
-
|
|
123
|
-
```json
|
|
124
|
-
{
|
|
125
|
-
"onFailure": true,
|
|
126
|
-
"webhookUrl": "https://hooks.slack.com/...",
|
|
127
|
-
"emailTo": ["dev@company.com"]
|
|
128
|
-
}
|
|
129
|
-
```
|
|
130
|
-
|
|
131
|
-
### Table: `triggers`
|
|
132
|
-
|
|
133
|
-
One-time execution instance of a sampling config. Stores the full lifecycle: what was sent, what came back, and evaluation outcomes — all as JSONB.
|
|
134
|
-
|
|
135
|
-
```sql
|
|
136
|
-
CREATE TABLE triggers (
|
|
137
|
-
id SERIAL PRIMARY KEY,
|
|
138
|
-
project_id UUID NOT NULL REFERENCES projects(id),
|
|
139
|
-
sampling_config_id INT NOT NULL REFERENCES sampling_configs(id),
|
|
140
|
-
session_id UUID NOT NULL,
|
|
141
|
-
status VARCHAR(20) NOT NULL DEFAULT 'pending',
|
|
142
|
-
-- pending: created, not yet included in a batch response
|
|
143
|
-
-- sent: included in batch response to SDK, waiting for results
|
|
144
|
-
-- completed: results received and evaluations stored
|
|
145
|
-
-- timed_out: SDK didn't respond within 10 minutes
|
|
146
|
-
run_count INT NOT NULL,
|
|
147
|
-
|
|
148
|
-
-- What was sent to the SDK
|
|
149
|
-
steps_sent JSONB NOT NULL,
|
|
150
|
-
|
|
151
|
-
-- What came back from the SDK (null until completed)
|
|
152
|
-
results JSONB,
|
|
153
|
-
|
|
154
|
-
-- Evaluation outcomes computed by the backend (null until completed)
|
|
155
|
-
evaluations JSONB,
|
|
156
|
-
|
|
157
|
-
-- Summary counts (denormalized for list queries)
|
|
158
|
-
steps_count INT,
|
|
159
|
-
steps_available INT,
|
|
160
|
-
steps_unavailable INT,
|
|
161
|
-
evaluations_passed INT,
|
|
162
|
-
evaluations_failed INT,
|
|
163
|
-
|
|
164
|
-
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
|
165
|
-
sent_at TIMESTAMPTZ,
|
|
166
|
-
completed_at TIMESTAMPTZ
|
|
167
|
-
);
|
|
168
|
-
|
|
169
|
-
CREATE INDEX idx_triggers_project_status ON triggers(project_id, status);
|
|
170
|
-
CREATE INDEX idx_triggers_session ON triggers(session_id);
|
|
171
|
-
CREATE INDEX idx_triggers_config ON triggers(sampling_config_id);
|
|
172
|
-
CREATE INDEX idx_triggers_sent_timeout ON triggers(status, sent_at)
|
|
173
|
-
WHERE status = 'sent';
|
|
174
|
-
```
|
|
175
|
-
|
|
176
|
-
**`steps_sent` format (what goes to SDK):**
|
|
177
|
-
|
|
178
|
-
```json
|
|
179
|
-
[
|
|
180
|
-
{
|
|
181
|
-
"eventId": 1,
|
|
182
|
-
"eventType": "tool",
|
|
183
|
-
"eventName": "searchDB",
|
|
184
|
-
"originalEventDbId": 102,
|
|
185
|
-
"input": { "query": "pikachu" }
|
|
186
|
-
},
|
|
187
|
-
{
|
|
188
|
-
"eventId": 2,
|
|
189
|
-
"eventType": "ai",
|
|
190
|
-
"eventName": "gpt-4o",
|
|
191
|
-
"originalEventDbId": 103,
|
|
192
|
-
"input": { "messages": [{"role": "user", "content": "..."}] },
|
|
193
|
-
"model": "gpt-4o",
|
|
194
|
-
"provider": "openai"
|
|
195
|
-
}
|
|
196
|
-
]
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
**`results` format (what comes back from SDK):**
|
|
200
|
-
|
|
201
|
-
```json
|
|
202
|
-
[
|
|
203
|
-
{
|
|
204
|
-
"originalEventDbId": 102,
|
|
205
|
-
"eventType": "tool",
|
|
206
|
-
"eventName": "searchDB",
|
|
207
|
-
"available": true,
|
|
208
|
-
"runs": [
|
|
209
|
-
{ "runIndex": 0, "input": {...}, "output": {...}, "durationMs": 45 },
|
|
210
|
-
{ "runIndex": 1, "input": {...}, "output": {...}, "durationMs": 52 }
|
|
211
|
-
]
|
|
212
|
-
},
|
|
213
|
-
{
|
|
214
|
-
"originalEventDbId": 103,
|
|
215
|
-
"eventType": "ai",
|
|
216
|
-
"eventName": "gpt-4o",
|
|
217
|
-
"available": false,
|
|
218
|
-
"unavailableReason": "Missing API key for provider \"openai\". Expected: OPENAI_API_KEY",
|
|
219
|
-
"runs": []
|
|
220
|
-
}
|
|
221
|
-
]
|
|
222
|
-
```
|
|
223
|
-
|
|
224
|
-
**`evaluations` format (computed by backend):**
|
|
225
|
-
|
|
226
|
-
```json
|
|
227
|
-
[
|
|
228
|
-
{
|
|
229
|
-
"originalEventDbId": 102,
|
|
230
|
-
"eventName": "searchDB",
|
|
231
|
-
"results": [
|
|
232
|
-
{ "type": "latency-budget", "passed": true, "detail": { "maxDurationMs": 5000, "actualMaxMs": 52 } },
|
|
233
|
-
{ "type": "determinism", "passed": true, "detail": { "threshold": 0.9, "similarity": 0.95 } }
|
|
234
|
-
]
|
|
235
|
-
},
|
|
236
|
-
{
|
|
237
|
-
"originalEventDbId": 103,
|
|
238
|
-
"eventName": "gpt-4o",
|
|
239
|
-
"results": [
|
|
240
|
-
{ "type": "availability", "passed": false, "detail": { "reason": "Missing API key..." } }
|
|
241
|
-
]
|
|
242
|
-
}
|
|
243
|
-
]
|
|
244
|
-
```
|
|
245
|
-
|
|
246
|
-
---
|
|
247
|
-
|
|
248
|
-
## API Endpoints
|
|
249
|
-
|
|
250
|
-
### 1. `POST /api/observability/events` — Modify Existing
|
|
251
|
-
|
|
252
|
-
After ingesting the event batch, check sampling configs:
|
|
253
|
-
|
|
254
|
-
```
|
|
255
|
-
function handleObservabilityEvents(req, res):
|
|
256
|
-
1. Validate & ingest events (existing logic)
|
|
257
|
-
2. responseBody = { ok: true, ingested: events.length }
|
|
258
|
-
|
|
259
|
-
3. Find matching sampling_configs:
|
|
260
|
-
SELECT * FROM sampling_configs
|
|
261
|
-
WHERE project_id = :projectId
|
|
262
|
-
AND enabled = true
|
|
263
|
-
|
|
264
|
-
4. For each config, check if trace_filter matches any event in this batch:
|
|
265
|
-
- serviceId matches session's serviceId?
|
|
266
|
-
- eventTypes includes any event.type in batch?
|
|
267
|
-
- eventNames includes any event.name in batch?
|
|
268
|
-
|
|
269
|
-
5. If match AND Math.random() < config.sample_rate:
|
|
270
|
-
a. Check no pending/sent trigger exists for this config + session:
|
|
271
|
-
SELECT 1 FROM triggers
|
|
272
|
-
WHERE sampling_config_id = :configId
|
|
273
|
-
AND session_id = :sessionId
|
|
274
|
-
AND status IN ('pending', 'sent')
|
|
275
|
-
→ If exists, skip (don't double-trigger)
|
|
276
|
-
|
|
277
|
-
b. Select steps to rerun based on step_selector:
|
|
278
|
-
- mode "by_name": find events in batch matching names
|
|
279
|
-
- mode "all": find events in batch matching types
|
|
280
|
-
For each selected event, look up its full input from observability_events:
|
|
281
|
-
SELECT input, event_name, event_type
|
|
282
|
-
FROM observability_events
|
|
283
|
-
WHERE id = :eventDbId
|
|
284
|
-
|
|
285
|
-
c. Build trigger steps with full input data:
|
|
286
|
-
steps = [{
|
|
287
|
-
eventId: <sequential>,
|
|
288
|
-
eventType: event.event_type,
|
|
289
|
-
eventName: event.event_name,
|
|
290
|
-
originalEventDbId: event.id,
|
|
291
|
-
input: event.input, // REQUIRED: SDK needs this to re-execute
|
|
292
|
-
model: event.event_name, // For AI events
|
|
293
|
-
provider: <inferred> // For AI events: infer from model name
|
|
294
|
-
}]
|
|
295
|
-
|
|
296
|
-
d. INSERT INTO triggers (
|
|
297
|
-
project_id, sampling_config_id, session_id,
|
|
298
|
-
status, run_count, steps_sent, steps_count, sent_at
|
|
299
|
-
) VALUES (
|
|
300
|
-
:projectId, :configId, :sessionId,
|
|
301
|
-
'sent', :config.run_count, :steps, :steps.length, NOW()
|
|
302
|
-
)
|
|
303
|
-
|
|
304
|
-
e. Add trigger to response:
|
|
305
|
-
responseBody.trigger = {
|
|
306
|
-
triggerId: trigger.id,
|
|
307
|
-
runCount: config.run_count,
|
|
308
|
-
steps: steps
|
|
309
|
-
}
|
|
310
|
-
|
|
311
|
-
6. Return responseBody (status 202)
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
**Important:** Only one trigger per response. If multiple configs match, pick the highest priority or oldest. Queue others for the next batch.
|
|
315
|
-
|
|
316
|
-
### 2. `POST /api/observability/triggers/:triggerId/results` — New
|
|
317
|
-
|
|
318
|
-
Receives execution results from the SDK.
|
|
319
|
-
|
|
320
|
-
**Request body (from SDK):**
|
|
321
|
-
|
|
322
|
-
```json
|
|
323
|
-
{
|
|
324
|
-
"steps": [
|
|
325
|
-
{
|
|
326
|
-
"originalEventDbId": 102,
|
|
327
|
-
"eventType": "tool",
|
|
328
|
-
"eventName": "searchDB",
|
|
329
|
-
"available": true,
|
|
330
|
-
"runs": [
|
|
331
|
-
{ "runIndex": 0, "input": {...}, "output": {...}, "durationMs": 45 },
|
|
332
|
-
{ "runIndex": 1, "input": {...}, "output": {...}, "durationMs": 52 }
|
|
333
|
-
]
|
|
334
|
-
},
|
|
335
|
-
{
|
|
336
|
-
"originalEventDbId": 103,
|
|
337
|
-
"eventType": "ai",
|
|
338
|
-
"eventName": "gpt-4o",
|
|
339
|
-
"available": false,
|
|
340
|
-
"unavailableReason": "Missing API key for provider \"openai\". Expected: OPENAI_API_KEY",
|
|
341
|
-
"runs": []
|
|
342
|
-
}
|
|
343
|
-
]
|
|
344
|
-
}
|
|
345
|
-
```
|
|
346
|
-
|
|
347
|
-
**Processing logic:**
|
|
348
|
-
|
|
349
|
-
```
|
|
350
|
-
function handleTriggerResults(triggerId, body):
|
|
351
|
-
1. Find trigger:
|
|
352
|
-
SELECT * FROM triggers WHERE id = :triggerId
|
|
353
|
-
→ 404 if not found
|
|
354
|
-
→ 409 if status != 'sent' (already completed or timed out)
|
|
355
|
-
|
|
356
|
-
2. Load sampling config:
|
|
357
|
-
SELECT evaluations FROM sampling_configs WHERE id = trigger.sampling_config_id
|
|
358
|
-
|
|
359
|
-
3. Store raw results:
|
|
360
|
-
UPDATE triggers SET results = body.steps WHERE id = :triggerId
|
|
361
|
-
|
|
362
|
-
4. Run evaluations for each step:
|
|
363
|
-
allEvaluations = []
|
|
364
|
-
passedCount = 0
|
|
365
|
-
failedCount = 0
|
|
366
|
-
availableCount = 0
|
|
367
|
-
unavailableCount = 0
|
|
368
|
-
|
|
369
|
-
For each step in body.steps:
|
|
370
|
-
stepEvals = { originalEventDbId: step.originalEventDbId, eventName: step.eventName, results: [] }
|
|
371
|
-
|
|
372
|
-
If step.available == false:
|
|
373
|
-
unavailableCount++
|
|
374
|
-
stepEvals.results.push({ type: 'availability', passed: false, detail: { reason: step.unavailableReason } })
|
|
375
|
-
failedCount++
|
|
376
|
-
Else:
|
|
377
|
-
availableCount++
|
|
378
|
-
For each evaluation in sampling_config.evaluations:
|
|
379
|
-
result = evaluate(evaluation, step.runs)
|
|
380
|
-
stepEvals.results.push(result)
|
|
381
|
-
if result.passed: passedCount++ else: failedCount++
|
|
382
|
-
|
|
383
|
-
allEvaluations.push(stepEvals)
|
|
384
|
-
|
|
385
|
-
5. Update trigger with everything:
|
|
386
|
-
UPDATE triggers SET
|
|
387
|
-
status = 'completed',
|
|
388
|
-
completed_at = NOW(),
|
|
389
|
-
results = :body.steps,
|
|
390
|
-
evaluations = :allEvaluations,
|
|
391
|
-
steps_available = :availableCount,
|
|
392
|
-
steps_unavailable = :unavailableCount,
|
|
393
|
-
evaluations_passed = :passedCount,
|
|
394
|
-
evaluations_failed = :failedCount
|
|
395
|
-
WHERE id = :triggerId
|
|
396
|
-
|
|
397
|
-
6. If failedCount > 0: fire alert (see Alerting below)
|
|
398
|
-
|
|
399
|
-
7. Return { ok: true, triggerId, evaluationsRun: passedCount + failedCount, passed: passedCount, failed: failedCount }
|
|
400
|
-
```
|
|
401
|
-
|
|
402
|
-
### 3. `GET /api/observability/triggers` — New
|
|
403
|
-
|
|
404
|
-
List triggers for a project.
|
|
405
|
-
|
|
406
|
-
**Query params:** `samplingConfigId?`, `sessionId?`, `status?`, `from?`, `to?`, `limit=50`, `offset=0`
|
|
407
|
-
|
|
408
|
-
**Response:**
|
|
409
|
-
|
|
410
|
-
```json
|
|
411
|
-
{
|
|
412
|
-
"triggers": [
|
|
413
|
-
{
|
|
414
|
-
"id": 42,
|
|
415
|
-
"samplingConfigId": 1,
|
|
416
|
-
"samplingConfigName": "Monitor searchDB latency",
|
|
417
|
-
"sessionId": "f47ac10b-...",
|
|
418
|
-
"status": "completed",
|
|
419
|
-
"runCount": 3,
|
|
420
|
-
"stepsCount": 2,
|
|
421
|
-
"stepsAvailable": 1,
|
|
422
|
-
"stepsUnavailable": 1,
|
|
423
|
-
"evaluationsPassed": 3,
|
|
424
|
-
"evaluationsFailed": 1,
|
|
425
|
-
"createdAt": "2026-04-11T10:00:00Z",
|
|
426
|
-
"completedAt": "2026-04-11T10:00:05Z"
|
|
427
|
-
}
|
|
428
|
-
],
|
|
429
|
-
"total": 15
|
|
430
|
-
}
|
|
431
|
-
```
|
|
432
|
-
|
|
433
|
-
Note: summary counts come from the denormalized columns on `triggers` — no need to parse JSONB for list queries.
|
|
434
|
-
|
|
435
|
-
### 4. `GET /api/observability/triggers/:triggerId` — New
|
|
436
|
-
|
|
437
|
-
Full detail. Returns the trigger row with `steps_sent`, `results`, and `evaluations` JSONB fields unpacked.
|
|
438
|
-
|
|
439
|
-
### 5. Sampling Config CRUD — New
|
|
440
|
-
|
|
441
|
-
| Endpoint | Method | Purpose |
|
|
442
|
-
|----------|--------|---------|
|
|
443
|
-
| `/api/observability/sampling-configs` | POST | Create config |
|
|
444
|
-
| `/api/observability/sampling-configs` | GET | List configs for project |
|
|
445
|
-
| `/api/observability/sampling-configs/:id` | GET | Get config detail |
|
|
446
|
-
| `/api/observability/sampling-configs/:id` | PUT | Update config |
|
|
447
|
-
| `/api/observability/sampling-configs/:id` | DELETE | Delete config |
|
|
448
|
-
|
|
449
|
-
All scoped to project via API key auth.
|
|
450
|
-
|
|
451
|
-
---
|
|
452
|
-
|
|
453
|
-
## Evaluation Engine
|
|
454
|
-
|
|
455
|
-
Implement each evaluation type as a function:
|
|
456
|
-
|
|
457
|
-
### `latency-budget`
|
|
458
|
-
|
|
459
|
-
```
|
|
460
|
-
Input: runs[], config: { maxDurationMs }
|
|
461
|
-
Logic: passed = runs.every(r => r.durationMs <= maxDurationMs)
|
|
462
|
-
Detail: { maxDurationMs, actualMaxMs: Math.max(...runs.map(r => r.durationMs)) }
|
|
463
|
-
```
|
|
464
|
-
|
|
465
|
-
### `token-budget`
|
|
466
|
-
|
|
467
|
-
```
|
|
468
|
-
Input: runs[], config: { maxTokens }
|
|
469
|
-
Logic: passed = runs.every(r => (r.usageTotalTokens ?? 0) <= maxTokens)
|
|
470
|
-
Detail: { maxTokens, actualMaxTokens: Math.max(...) }
|
|
471
|
-
```
|
|
472
|
-
|
|
473
|
-
### `output-contains`
|
|
474
|
-
|
|
475
|
-
```
|
|
476
|
-
Input: runs[], config: { containsText?, notContainsText? }
|
|
477
|
-
Logic:
|
|
478
|
-
outputStr = JSON.stringify(run.output)
|
|
479
|
-
if containsText: passed &= outputStr.includes(containsText)
|
|
480
|
-
if notContainsText: passed &= !outputStr.includes(notContainsText)
|
|
481
|
-
Detail: { containsText, notContainsText, failedRunIndices: [...] }
|
|
482
|
-
```
|
|
483
|
-
|
|
484
|
-
### `output-schema`
|
|
485
|
-
|
|
486
|
-
```
|
|
487
|
-
Input: runs[], config: { jsonSchema }
|
|
488
|
-
Logic: validate each run.output against JSON schema (use ajv)
|
|
489
|
-
passed = runs.every(r => ajv.validate(jsonSchema, r.output))
|
|
490
|
-
Detail: { failedRunIndices: [...], errors: [...] }
|
|
491
|
-
```
|
|
492
|
-
|
|
493
|
-
### `determinism`
|
|
494
|
-
|
|
495
|
-
```
|
|
496
|
-
Input: runs[] (needs runCount >= 2), config: { similarityThreshold }
|
|
497
|
-
Logic:
|
|
498
|
-
Compare all pairs of run outputs
|
|
499
|
-
similarity = average pairwise similarity (cosine, jaccard, or string match ratio)
|
|
500
|
-
passed = similarity >= similarityThreshold
|
|
501
|
-
Detail: { threshold, actualSimilarity, pairComparisons: [...] }
|
|
502
|
-
```
|
|
503
|
-
|
|
504
|
-
Simple approach: normalize outputs to strings, compare with Levenshtein distance / longest common subsequence ratio. For structured outputs, deep-equal check with tolerance.
|
|
505
|
-
|
|
506
|
-
### `llm-judge`
|
|
507
|
-
|
|
508
|
-
```
|
|
509
|
-
Input: runs[], config: { judgePrompt, threshold, judgeModel, judgeProvider }
|
|
510
|
-
Logic:
|
|
511
|
-
For each run:
|
|
512
|
-
prompt = judgePrompt + "\n\nOutput to evaluate:\n" + JSON.stringify(run.output)
|
|
513
|
-
response = callLLM(judgeModel, judgeProvider, prompt)
|
|
514
|
-
score = parseScore(response) // extract number from "SCORE: 8.5"
|
|
515
|
-
avgScore = average(scores)
|
|
516
|
-
passed = avgScore >= threshold
|
|
517
|
-
Detail: { threshold, avgScore, perRunScores: [...] }
|
|
518
|
-
```
|
|
519
|
-
|
|
520
|
-
---
|
|
521
|
-
|
|
522
|
-
## Background Jobs
|
|
523
|
-
|
|
524
|
-
### Trigger Timeout Sweep
|
|
525
|
-
|
|
526
|
-
Run every 5 minutes:
|
|
527
|
-
|
|
528
|
-
```sql
|
|
529
|
-
UPDATE triggers
|
|
530
|
-
SET status = 'timed_out'
|
|
531
|
-
WHERE status = 'sent'
|
|
532
|
-
AND sent_at < NOW() - INTERVAL '10 minutes';
|
|
533
|
-
```
|
|
534
|
-
|
|
535
|
-
Optionally alert on timeouts (SDK may have crashed or process exited before completing).
|
|
536
|
-
|
|
537
|
-
### Stale Session Cleanup
|
|
538
|
-
|
|
539
|
-
Run every hour:
|
|
540
|
-
|
|
541
|
-
```sql
|
|
542
|
-
UPDATE observability_sessions
|
|
543
|
-
SET ended_at = last_heartbeat
|
|
544
|
-
WHERE ended_at IS NULL
|
|
545
|
-
AND last_heartbeat < NOW() - INTERVAL '5 minutes';
|
|
546
|
-
```
|
|
547
|
-
|
|
548
|
-
---
|
|
549
|
-
|
|
550
|
-
## Alerting
|
|
551
|
-
|
|
552
|
-
When evaluations fail, fire alerts:
|
|
553
|
-
|
|
554
|
-
1. **Dashboard notification** — mark trigger as having failures, show badge in UI
|
|
555
|
-
2. **Webhook** (future) — POST to `alerting.webhookUrl` with trigger ID + failed evaluations
|
|
556
|
-
3. **Email** (future) — send to `alerting.emailTo` addresses
|
|
557
|
-
|
|
558
|
-
---
|
|
559
|
-
|
|
560
|
-
## Dashboard Pages
|
|
561
|
-
|
|
562
|
-
### Sampling Configs Page (`/observability/sampling`)
|
|
563
|
-
|
|
564
|
-
- Table: name, enabled toggle, filter summary, run count, sample rate, evaluation count
|
|
565
|
-
- Create/edit modal with form fields for all config options
|
|
566
|
-
- Evaluation builder: add rows with type dropdown + type-specific fields
|
|
567
|
-
|
|
568
|
-
### Trigger History Page (`/observability/triggers`)
|
|
569
|
-
|
|
570
|
-
- Table: ID, config name, status badge, steps count, available/unavailable, pass/fail counts, timestamps
|
|
571
|
-
- Filter by: config, status, date range
|
|
572
|
-
- Click row → trigger detail page
|
|
573
|
-
|
|
574
|
-
### Trigger Detail Page (`/observability/triggers/:id`)
|
|
575
|
-
|
|
576
|
-
- Header: status, config name, run count, timestamps
|
|
577
|
-
- Per-step cards:
|
|
578
|
-
- Tool/AI icon + name
|
|
579
|
-
- Available badge (green / red with reason)
|
|
580
|
-
- Run results table: runIndex, output preview, durationMs, tokens, error
|
|
581
|
-
- Evaluation results: type, passed badge, detail expandable
|
|
582
|
-
- Collapsible JSON viewers for input/output
|
|
583
|
-
|
|
584
|
-
---
|
|
585
|
-
|
|
586
|
-
## Implementation Order
|
|
587
|
-
|
|
588
|
-
1. **Schema**: Create `sampling_configs` + `triggers` tables
|
|
589
|
-
2. **Sampling configs CRUD**: endpoints + validation
|
|
590
|
-
3. **Modify `POST /api/observability/events`**: add sampling config matching + trigger creation + response body
|
|
591
|
-
4. **`POST /api/observability/triggers/:id/results`**: result storage + evaluation engine
|
|
592
|
-
5. **Evaluation engine**: implement all 6 types
|
|
593
|
-
6. **Query endpoints**: `GET /api/observability/triggers` list + detail
|
|
594
|
-
7. **Timeout sweep**: cron job
|
|
595
|
-
8. **Dashboard**: sampling configs page, trigger history, trigger detail
|
|
596
|
-
9. **Alerting**: dashboard notifications, then webhook/email
|