agent-regression-lab 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +53 -7
- package/dist/agent/factory.js +20 -6
- package/dist/agent/httpAdapter.js +5 -4
- package/dist/config.js +186 -3
- package/dist/evaluators.js +56 -1
- package/dist/index.js +143 -11
- package/dist/lib/id.js +3 -0
- package/dist/runOutput.js +46 -0
- package/dist/runner.js +31 -9
- package/dist/scenarios.js +90 -2
- package/dist/scoring.js +2 -2
- package/dist/storage.js +117 -7
- package/dist/tools.js +38 -0
- package/dist/trace.js +4 -2
- package/dist/ui/App.js +28 -2
- package/dist/ui-assets/client.js +82 -0
- package/docs/agents.md +143 -8
- package/docs/golden-suites.md +74 -0
- package/docs/integrations-and-live-services.md +58 -0
- package/docs/memory-and-stateful-agents.md +51 -0
- package/docs/release-checklist.md +30 -0
- package/docs/runtime-profiles.md +67 -0
- package/docs/scenarios.md +303 -56
- package/docs/troubleshooting.md +138 -0
- package/docs/variant-sets.md +63 -0
- package/package.json +2 -2
package/docs/scenarios.md
CHANGED
|
@@ -2,28 +2,38 @@
|
|
|
2
2
|
|
|
3
3
|
Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
agentlab supports two scenario types:
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
- `task` — a single-instruction job for a tool-using agent (default, no `type` field needed)
|
|
8
|
+
- `conversation` — a multi-turn dialog with an HTTP agent
|
|
8
9
|
|
|
9
|
-
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Task Scenarios
|
|
13
|
+
|
|
14
|
+
Task scenarios are the default format. They describe a single job for an agent that uses tools to complete it.
|
|
15
|
+
|
|
16
|
+
### Required Shape
|
|
17
|
+
|
|
18
|
+
Each task scenario should define:
|
|
10
19
|
|
|
11
20
|
- `id`
|
|
12
21
|
- `name`
|
|
13
22
|
- `suite`
|
|
14
23
|
- `task`
|
|
15
24
|
- `tools`
|
|
16
|
-
- `runtime`
|
|
17
25
|
- `evaluators`
|
|
18
26
|
|
|
19
|
-
Common optional fields
|
|
27
|
+
Common optional fields:
|
|
20
28
|
|
|
21
29
|
- `description`
|
|
22
30
|
- `difficulty`
|
|
23
31
|
- `tags`
|
|
32
|
+
- `runtime_profile`
|
|
33
|
+
- `runtime`
|
|
24
34
|
- task `context`
|
|
25
35
|
|
|
26
|
-
|
|
36
|
+
### Example
|
|
27
37
|
|
|
28
38
|
```yaml
|
|
29
39
|
id: support.refund-correct-order
|
|
@@ -65,32 +75,52 @@ evaluators:
|
|
|
65
75
|
- ord_1024
|
|
66
76
|
```
|
|
67
77
|
|
|
68
|
-
|
|
78
|
+
### Runtime Profiles
|
|
69
79
|
|
|
70
|
-
|
|
80
|
+
Task scenarios can reference a named `runtime_profile` from `agentlab.config.yaml`.
|
|
71
81
|
|
|
72
|
-
|
|
73
|
-
-
|
|
74
|
-
|
|
75
|
-
- `ops`
|
|
82
|
+
```yaml
|
|
83
|
+
runtime_profile: timeout-orders-tool
|
|
84
|
+
```
|
|
76
85
|
|
|
77
|
-
|
|
86
|
+
Runtime profiles let you apply reusable degraded-tool conditions without duplicating them across scenarios. Current shipped behavior:
|
|
78
87
|
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
```
|
|
88
|
+
- task scenarios: tool fault injection is active
|
|
89
|
+
- conversation scenarios: config reference is allowed for shared authoring, but ARL does not yet inject faults into the HTTP agent's internal tools
|
|
82
90
|
|
|
83
|
-
|
|
91
|
+
### Evaluators
|
|
84
92
|
|
|
85
|
-
|
|
86
|
-
|
|
93
|
+
Use deterministic evaluators only.
|
|
94
|
+
|
|
95
|
+
| Type | Description |
|
|
96
|
+
|------|-------------|
|
|
97
|
+
| `tool_call_assertion` | Assert a specific tool was called with specific input |
|
|
98
|
+
| `forbidden_tool` | Fail if a tool was called that should not have been |
|
|
99
|
+
| `final_answer_contains` | Check that the final output contains required substrings |
|
|
100
|
+
| `exact_final_answer` | Require an exact match on the final output |
|
|
101
|
+
| `step_count_max` | Fail if the agent used more steps than allowed |
|
|
102
|
+
| `tool_call_count_max` | Fail if the total number of tool calls exceeds a budget |
|
|
103
|
+
| `tool_repeat_max` | Fail if one tool is overused |
|
|
104
|
+
| `cost_max` | Fail if the run cost exceeds a configured USD budget |
|
|
105
|
+
|
|
106
|
+
Evaluator modes:
|
|
107
|
+
|
|
108
|
+
- `hard_gate` — failure immediately fails the run, regardless of other evaluators
|
|
109
|
+
- `weighted` — contributes to the weighted score (0–100)
|
|
110
|
+
|
|
111
|
+
### Runtime Limits
|
|
112
|
+
|
|
113
|
+
```yaml
|
|
114
|
+
runtime:
|
|
115
|
+
max_steps: 8
|
|
116
|
+
timeout_seconds: 60
|
|
87
117
|
```
|
|
88
118
|
|
|
89
|
-
|
|
119
|
+
Both are optional. `max_steps` defaults to 8. `timeout_seconds` is uncapped if not set.
|
|
90
120
|
|
|
91
|
-
|
|
121
|
+
### Tools
|
|
92
122
|
|
|
93
|
-
Each scenario declares its allowed tools:
|
|
123
|
+
Each task scenario declares its allowed tools:
|
|
94
124
|
|
|
95
125
|
```yaml
|
|
96
126
|
tools:
|
|
@@ -98,72 +128,289 @@ tools:
|
|
|
98
128
|
- crm.search_customer
|
|
99
129
|
- orders.list
|
|
100
130
|
- orders.refund
|
|
131
|
+
forbidden:
|
|
132
|
+
- orders.delete
|
|
101
133
|
```
|
|
102
134
|
|
|
103
|
-
Keep the
|
|
135
|
+
Keep the allowlist as narrow as possible. Broad allowlists weaken the benchmark.
|
|
104
136
|
|
|
105
|
-
|
|
137
|
+
### Budget And Governance Checks
|
|
106
138
|
|
|
107
|
-
|
|
108
|
-
- repo-local custom tools registered in `agentlab.config.yaml`
|
|
139
|
+
Operational regressions are often just as important as correctness regressions. Use budget evaluators to encode "technically worked, but unacceptable in production":
|
|
109
140
|
|
|
110
|
-
|
|
141
|
+
```yaml
|
|
142
|
+
evaluators:
|
|
143
|
+
- id: total-tool-budget
|
|
144
|
+
type: tool_call_count_max
|
|
145
|
+
mode: hard_gate
|
|
146
|
+
config:
|
|
147
|
+
max: 2
|
|
148
|
+
- id: no-repeat-order-list
|
|
149
|
+
type: tool_repeat_max
|
|
150
|
+
mode: hard_gate
|
|
151
|
+
config:
|
|
152
|
+
tool: orders.list
|
|
153
|
+
max: 1
|
|
154
|
+
```
|
|
111
155
|
|
|
112
|
-
|
|
113
|
-
- coding
|
|
114
|
-
- research
|
|
115
|
-
- ops
|
|
156
|
+
Use `cost_max` only where the run records cost metadata.
|
|
116
157
|
|
|
117
|
-
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## Conversation Scenarios
|
|
161
|
+
|
|
162
|
+
Conversation scenarios test HTTP agents through multi-turn dialogs. They require `type: conversation` and work exclusively with `provider: http` agents. The agent is responsible for maintaining its own conversation history.
|
|
163
|
+
|
|
164
|
+
### Required Shape
|
|
165
|
+
|
|
166
|
+
```yaml
|
|
167
|
+
type: conversation
|
|
168
|
+
id: internal-teams.memory-followup-recall
|
|
169
|
+
name: Follow-Up Recall Within Conversation
|
|
170
|
+
suite: internal-teams
|
|
171
|
+
steps:
|
|
172
|
+
- role: user
|
|
173
|
+
message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
|
|
174
|
+
- role: user
|
|
175
|
+
message: "What seat preference did I mention earlier?"
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Each step must have:
|
|
179
|
+
|
|
180
|
+
- `role: user`
|
|
181
|
+
- `message` — the message sent to the agent this turn
|
|
182
|
+
|
|
183
|
+
### Per-Step Evaluators
|
|
184
|
+
|
|
185
|
+
Evaluators can be attached to individual steps. They run immediately after the agent replies to that step.
|
|
186
|
+
|
|
187
|
+
```yaml
|
|
188
|
+
steps:
|
|
189
|
+
- role: user
|
|
190
|
+
message: "Where's my order #ORD-001?"
|
|
191
|
+
evaluators:
|
|
192
|
+
- type: response_contains
|
|
193
|
+
mode: hard_gate
|
|
194
|
+
config:
|
|
195
|
+
keywords: [shipped, tracking]
|
|
196
|
+
- type: response_latency_max
|
|
197
|
+
mode: hard_gate
|
|
198
|
+
config:
|
|
199
|
+
ms: 3000
|
|
200
|
+
- role: user
|
|
201
|
+
message: "What's the tracking number?"
|
|
202
|
+
evaluators:
|
|
203
|
+
- type: response_not_contains
|
|
204
|
+
mode: weighted
|
|
205
|
+
weight: 1
|
|
206
|
+
config:
|
|
207
|
+
keywords: ["don't know", error]
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
If a `hard_gate` per-step evaluator fails, the run stops immediately and remaining steps are skipped.
|
|
211
|
+
|
|
212
|
+
### Per-Step Evaluator Types
|
|
213
|
+
|
|
214
|
+
| Type | Config | Behavior |
|
|
215
|
+
|------|--------|----------|
|
|
216
|
+
| `response_contains` | `keywords: string[]` | Passes if ALL keywords appear in the reply (case-insensitive) |
|
|
217
|
+
| `response_not_contains` | `keywords: string[]` | Passes if NONE of the keywords appear in the reply (case-insensitive) |
|
|
218
|
+
| `response_matches_regex` | `pattern: string` | Passes if the reply matches the regex pattern (case-insensitive) |
|
|
219
|
+
| `response_latency_max` | `ms: number` | Passes if the HTTP response arrived within the time limit |
|
|
220
|
+
|
|
221
|
+
### Scenario Quality Rules
|
|
222
|
+
|
|
223
|
+
- prefer `hard_gate` for business-critical assertions
|
|
224
|
+
- use `weighted` checks for quality gradients, not for the single condition that makes the scenario trustworthy
|
|
225
|
+
- conversation scenarios must use `config.keywords` for `response_contains` and `response_not_contains`
|
|
226
|
+
- stale `config.text` authoring is rejected
|
|
227
|
+
- use conversation scenarios when the agent owns memory, tool execution, or conversation history internally
|
|
228
|
+
- keep golden suites focused on repeatable workflows, historical regressions, and ugly edge cases rather than one-off demos
|
|
229
|
+
|
|
230
|
+
### End-of-Run Evaluators
|
|
231
|
+
|
|
232
|
+
End-of-run evaluators run after all steps complete. They apply to the final reply.
|
|
233
|
+
|
|
234
|
+
```yaml
|
|
235
|
+
evaluators:
|
|
236
|
+
- type: step_count_max
|
|
237
|
+
mode: hard_gate
|
|
238
|
+
config:
|
|
239
|
+
max: 10
|
|
240
|
+
- type: final_answer_contains
|
|
241
|
+
mode: weighted
|
|
242
|
+
weight: 1
|
|
243
|
+
config:
|
|
244
|
+
keywords: [resolved, confirmed]
|
|
245
|
+
```
|
|
118
246
|
|
|
119
|
-
|
|
247
|
+
End-of-run evaluator types:
|
|
120
248
|
|
|
121
|
-
|
|
249
|
+
| Type | Config | Behavior |
|
|
250
|
+
|------|--------|----------|
|
|
251
|
+
| `step_count_max` | `max: number` | Passes if the number of completed turns is within the limit |
|
|
252
|
+
| `final_answer_contains` | `keywords: string[]` | Passes if ALL keywords appear in the final reply |
|
|
253
|
+
| `exact_final_answer` | `expected: string` | Passes if the final reply exactly matches the expected string |
|
|
122
254
|
|
|
123
|
-
|
|
124
|
-
|
|
255
|
+
### Conversation State
|
|
256
|
+
|
|
257
|
+
agentlab auto-generates a UUID `conversation_id` for each run. It is sent in every step request. The agent uses it to look up and maintain its own conversation history.
|
|
258
|
+
|
|
259
|
+
The `state` block is optional:
|
|
260
|
+
|
|
261
|
+
```yaml
|
|
262
|
+
state:
|
|
263
|
+
conversation_id: auto
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
`auto` is the only supported value. The UUID is always generated regardless of whether the `state` block is present.
|
|
267
|
+
|
|
268
|
+
### Restrictions
|
|
269
|
+
|
|
270
|
+
Conversation scenarios must not define a `tools:` field. HTTP agents manage their own tools internally. If `tools:` is present, validation will fail with a clear error.
|
|
271
|
+
|
|
272
|
+
Conversation scenarios may define `runtime_profile`, but today that is for shared scenario organization and future stateful hooks. ARL does not inject tool faults into HTTP agents.
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
276
|
+
## Suite Definitions
|
|
277
|
+
|
|
278
|
+
Scenario `suite` still groups related files, but operational launch workflows should use config-level `suite_definitions`.
|
|
125
279
|
|
|
126
280
|
Example:
|
|
127
281
|
|
|
128
282
|
```yaml
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
283
|
+
suite_definitions:
|
|
284
|
+
- name: pre_merge
|
|
285
|
+
include:
|
|
286
|
+
tags:
|
|
287
|
+
- smoke
|
|
288
|
+
- regression
|
|
132
289
|
```
|
|
133
290
|
|
|
134
|
-
|
|
291
|
+
Run one with:
|
|
135
292
|
|
|
136
|
-
|
|
293
|
+
```bash
|
|
294
|
+
agentlab run --suite-def pre_merge --agent mock-default
|
|
295
|
+
agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
|
|
296
|
+
```
|
|
137
297
|
|
|
138
|
-
Use
|
|
298
|
+
Use suite definitions for stable workflow units like:
|
|
299
|
+
|
|
300
|
+
- `smoke`
|
|
301
|
+
- `pre_merge`
|
|
302
|
+
- `release`
|
|
303
|
+
- `incident_regressions`
|
|
139
304
|
|
|
140
|
-
|
|
305
|
+
### Full Example
|
|
306
|
+
|
|
307
|
+
```yaml
|
|
308
|
+
type: conversation
|
|
309
|
+
id: internal-teams.memory-followup-recall
|
|
310
|
+
name: Follow-Up Recall Within Conversation
|
|
311
|
+
suite: internal-teams
|
|
312
|
+
description: Memoryful agent should recall a user-provided fact later in the same conversation.
|
|
313
|
+
difficulty: medium
|
|
314
|
+
tags:
|
|
315
|
+
- internal-teams
|
|
316
|
+
- conversation
|
|
317
|
+
|
|
318
|
+
steps:
|
|
319
|
+
- role: user
|
|
320
|
+
message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
|
|
321
|
+
evaluators:
|
|
322
|
+
- type: response_contains
|
|
323
|
+
mode: weighted
|
|
324
|
+
config:
|
|
325
|
+
keywords:
|
|
326
|
+
- aisle
|
|
327
|
+
|
|
328
|
+
- role: user
|
|
329
|
+
message: "What seat preference did I mention earlier?"
|
|
330
|
+
evaluators:
|
|
331
|
+
- type: response_contains
|
|
332
|
+
mode: hard_gate
|
|
333
|
+
config:
|
|
334
|
+
keywords:
|
|
335
|
+
- aisle
|
|
141
336
|
|
|
142
|
-
|
|
143
|
-
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
337
|
+
evaluators:
|
|
338
|
+
- type: step_count_max
|
|
339
|
+
mode: hard_gate
|
|
340
|
+
config:
|
|
341
|
+
max: 2
|
|
342
|
+
```
|
|
147
343
|
|
|
148
|
-
|
|
344
|
+
Run it with:
|
|
149
345
|
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
346
|
+
```bash
|
|
347
|
+
agentlab run internal-teams.memory-followup-recall --agent my-production-agent
|
|
348
|
+
```
|
|
153
349
|
|
|
154
|
-
|
|
350
|
+
Where `my-production-agent` is a named `http` agent in `agentlab.config.yaml`. See [agents.md](agents.md) for HTTP agent config.
|
|
155
351
|
|
|
156
|
-
|
|
352
|
+
### CLI Output
|
|
353
|
+
|
|
354
|
+
Conversation runs print a different output format from task runs:
|
|
355
|
+
|
|
356
|
+
```
|
|
357
|
+
run internal-teams.memory-followup-recall — PASS
|
|
358
|
+
agent: my-production-agent (http://localhost:3000/api/chat)
|
|
359
|
+
turns completed: 2/2
|
|
360
|
+
step 1: pass (response_contains ✓)
|
|
361
|
+
step 2: pass (response_contains ✓)
|
|
362
|
+
run id: run_20260407_001234
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
If a hard-gate fails mid-run:
|
|
366
|
+
|
|
367
|
+
```
|
|
368
|
+
run internal-teams.memory-followup-recall — FAIL
|
|
369
|
+
agent: my-production-agent (http://localhost:3000/api/chat)
|
|
370
|
+
turns completed: 1/2
|
|
371
|
+
step 1: FAIL (response_contains ✗)
|
|
372
|
+
run stopped (evaluator_failed)
|
|
373
|
+
run id: run_20260407_001235
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
---
|
|
377
|
+
|
|
378
|
+
## Suites
|
|
379
|
+
|
|
380
|
+
Both task and conversation scenarios can belong to a suite.
|
|
381
|
+
|
|
382
|
+
```yaml
|
|
383
|
+
suite: support
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Run an entire suite:
|
|
387
|
+
|
|
388
|
+
```bash
|
|
389
|
+
agentlab run --suite support --agent mock-default
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
`run --suite` skips conversation scenarios when using non-HTTP agents (conversation scenarios require `provider: http`). Task scenarios and conversation scenarios can coexist in the same suite directory.
|
|
393
|
+
|
|
394
|
+
`run --suite` prints a suite batch id at the end. That id is used for suite comparison:
|
|
395
|
+
|
|
396
|
+
```bash
|
|
397
|
+
agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
---
|
|
401
|
+
|
|
402
|
+
## Authoring Conventions
|
|
157
403
|
|
|
158
404
|
- `id` format: `<suite>.<short-name>`
|
|
159
405
|
- keep scenario jobs narrow and concrete
|
|
160
|
-
- keep fixture-backed context in `task.context`
|
|
406
|
+
- keep fixture-backed context in `task.context` (task scenarios)
|
|
161
407
|
- prefer deterministic fixture references over open-ended prompts
|
|
162
408
|
- include `difficulty`, `description`, and `tags` for every launch scenario
|
|
409
|
+
- for conversation scenarios, keep step count low (2–5) and evaluators specific
|
|
163
410
|
|
|
164
411
|
## Current Examples
|
|
165
412
|
|
|
166
|
-
|
|
413
|
+
Task scenario references in this repo:
|
|
167
414
|
|
|
168
415
|
- support: `scenarios/support/refund-correct-order.yaml`
|
|
169
416
|
- support with config tool: `scenarios/support/refund-via-config-tool.yaml`
|
package/docs/troubleshooting.md
CHANGED
|
@@ -25,6 +25,8 @@ Or skip linking and use:
|
|
|
25
25
|
npm run start -- --help
|
|
26
26
|
```
|
|
27
27
|
|
|
28
|
+
---
|
|
29
|
+
|
|
28
30
|
## `OPENAI_API_KEY is required`
|
|
29
31
|
|
|
30
32
|
You used an OpenAI-backed agent without exporting the API key.
|
|
@@ -36,6 +38,8 @@ export OPENAI_API_KEY=...
|
|
|
36
38
|
agentlab run support.refund-correct-order --agent openai-cheap
|
|
37
39
|
```
|
|
38
40
|
|
|
41
|
+
---
|
|
42
|
+
|
|
39
43
|
## `No scenarios found for suite ...`
|
|
40
44
|
|
|
41
45
|
The suite id must match a suite under `scenarios/`.
|
|
@@ -52,6 +56,9 @@ Current built-in suites in this repo include:
|
|
|
52
56
|
- `coding`
|
|
53
57
|
- `research`
|
|
54
58
|
- `ops`
|
|
59
|
+
- `internal-teams`
|
|
60
|
+
|
|
61
|
+
---
|
|
55
62
|
|
|
56
63
|
## `Run '<id>' not found`
|
|
57
64
|
|
|
@@ -70,6 +77,8 @@ agentlab show <run-id>
|
|
|
70
77
|
agentlab compare <baseline-run-id> <candidate-run-id>
|
|
71
78
|
```
|
|
72
79
|
|
|
80
|
+
---
|
|
81
|
+
|
|
73
82
|
## `Missing baseline or candidate suite batch id`
|
|
74
83
|
|
|
75
84
|
`compare --suite` does not use run ids. It uses suite batch ids printed by `run --suite`.
|
|
@@ -82,6 +91,8 @@ agentlab run --suite support --agent mock-default
|
|
|
82
91
|
agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
|
|
83
92
|
```
|
|
84
93
|
|
|
94
|
+
---
|
|
95
|
+
|
|
85
96
|
## Cross-suite suite comparison errors
|
|
86
97
|
|
|
87
98
|
Suite batch comparison is strict. Compare batches from the same suite only.
|
|
@@ -99,6 +110,8 @@ This is not valid:
|
|
|
99
110
|
|
|
100
111
|
If you are unsure which batch came from which suite, rerun the suite and record the printed batch ids.
|
|
101
112
|
|
|
113
|
+
---
|
|
114
|
+
|
|
102
115
|
## `agentlab ui` fails to load assets
|
|
103
116
|
|
|
104
117
|
Installed packages should already include the built UI assets.
|
|
@@ -116,6 +129,8 @@ If the problem persists, verify that these files exist:
|
|
|
116
129
|
- `dist/ui-assets/client.js`
|
|
117
130
|
- `dist/ui-assets/client.css`
|
|
118
131
|
|
|
132
|
+
---
|
|
133
|
+
|
|
119
134
|
## Config tool or agent not found
|
|
120
135
|
|
|
121
136
|
Typical reasons:
|
|
@@ -131,6 +146,127 @@ Working references in this repo:
|
|
|
131
146
|
- custom tool: `user_tools/findDuplicateCharge.ts`
|
|
132
147
|
- external agents: `custom_agents/node_agent.mjs`, `custom_agents/python_agent.py`
|
|
133
148
|
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## HTTP agent errors
|
|
152
|
+
|
|
153
|
+
### `HTTP agents require a configured url`
|
|
154
|
+
|
|
155
|
+
You ran a conversation scenario with `--provider http` but no HTTP agent config was found.
|
|
156
|
+
|
|
157
|
+
Fix: define a named http agent in `agentlab.config.yaml`:
|
|
158
|
+
|
|
159
|
+
```yaml
|
|
160
|
+
agents:
|
|
161
|
+
- name: my-agent
|
|
162
|
+
provider: http
|
|
163
|
+
url: http://localhost:3000/api/chat
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Then run with:
|
|
167
|
+
|
|
168
|
+
```bash
|
|
169
|
+
agentlab run internal-teams.memory-followup-recall --agent my-agent
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### `termination_reason: http_connection_failed`
|
|
173
|
+
|
|
174
|
+
agentlab could not connect to your agent's URL. The most common cause is that the agent service is not running.
|
|
175
|
+
|
|
176
|
+
Check:
|
|
177
|
+
|
|
178
|
+
- is the service running on the configured port?
|
|
179
|
+
- is the URL in `agentlab.config.yaml` correct?
|
|
180
|
+
- is there a firewall or proxy blocking the connection?
|
|
181
|
+
|
|
182
|
+
### `termination_reason: http_error`
|
|
183
|
+
|
|
184
|
+
Your agent returned an HTTP 4xx or 5xx response.
|
|
185
|
+
|
|
186
|
+
Check:
|
|
187
|
+
|
|
188
|
+
- is the route path correct?
|
|
189
|
+
- does your agent expect a different request shape? Use `request_template` if so.
|
|
190
|
+
- are there auth errors? Check `headers` config.
|
|
191
|
+
|
|
192
|
+
### `termination_reason: timeout_exceeded`
|
|
193
|
+
|
|
194
|
+
Your agent did not respond within `timeout_ms` (default 30 seconds).
|
|
195
|
+
|
|
196
|
+
Fix options:
|
|
197
|
+
|
|
198
|
+
- increase `timeout_ms` in the agent config
|
|
199
|
+
- investigate why the agent is slow for the given input
|
|
200
|
+
|
|
201
|
+
### `termination_reason: invalid_response_format`
|
|
202
|
+
|
|
203
|
+
Your agent either returned non-JSON or did not include the expected field.
|
|
204
|
+
|
|
205
|
+
Defaults: agentlab reads the `message` field from the JSON response. Override with `response_field` if your agent uses a different name:
|
|
206
|
+
|
|
207
|
+
```yaml
|
|
208
|
+
agents:
|
|
209
|
+
- name: my-agent
|
|
210
|
+
provider: http
|
|
211
|
+
url: http://localhost:3000/api/chat
|
|
212
|
+
response_field: reply
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
## `database is locked`
|
|
218
|
+
|
|
219
|
+
You hit SQLite write contention on the local artifacts DB.
|
|
220
|
+
|
|
221
|
+
Most common cause:
|
|
222
|
+
|
|
223
|
+
- multiple `agentlab` runs writing to the same `artifacts/agentlab.db` at the same time
|
|
224
|
+
|
|
225
|
+
Fix:
|
|
226
|
+
|
|
227
|
+
- wait for the current run to finish
|
|
228
|
+
- rerun sequentially instead of in parallel
|
|
229
|
+
- keep live HTTP fixture verification serialized when using the same local project directory
|
|
230
|
+
|
|
231
|
+
The product now uses a busy timeout, but sequential execution is still the safest path for local live verification.
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## Conversation scenario errors
|
|
236
|
+
|
|
237
|
+
### `Scenario '...' is a conversation scenario and requires provider: http`
|
|
238
|
+
|
|
239
|
+
You tried to run a `type: conversation` scenario with a non-HTTP agent (`mock`, `openai`, or `external_process`).
|
|
240
|
+
|
|
241
|
+
Conversation scenarios only work with `provider: http`. Configure an HTTP agent in `agentlab.config.yaml` and use `--agent <name>`.
|
|
242
|
+
|
|
243
|
+
### `Conversation scenario '...' must not define 'tools'`
|
|
244
|
+
|
|
245
|
+
Your conversation scenario YAML has a `tools:` field. HTTP agents manage their own tools internally — remove the `tools:` block.
|
|
246
|
+
|
|
247
|
+
### `Conversation scenario '...' must define at least one step`
|
|
248
|
+
|
|
249
|
+
The `steps:` list is empty or missing. Add at least one step:
|
|
250
|
+
|
|
251
|
+
```yaml
|
|
252
|
+
steps:
|
|
253
|
+
- role: user
|
|
254
|
+
message: "Hello"
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
### Per-step evaluator type rejected
|
|
258
|
+
|
|
259
|
+
Only these evaluator types are valid inside `steps[].evaluators`:
|
|
260
|
+
|
|
261
|
+
- `response_contains`
|
|
262
|
+
- `response_not_contains`
|
|
263
|
+
- `response_matches_regex`
|
|
264
|
+
- `response_latency_max`
|
|
265
|
+
|
|
266
|
+
End-of-run types (`step_count_max`, `final_answer_contains`, `exact_final_answer`) belong at the top-level `evaluators:` block, not inside individual steps.
|
|
267
|
+
|
|
268
|
+
---
|
|
269
|
+
|
|
134
270
|
## Global install behaves differently from repo mode
|
|
135
271
|
|
|
136
272
|
That usually means the current working directory is wrong.
|
|
@@ -143,6 +279,8 @@ The CLI operates on the current working directory and expects:
|
|
|
143
279
|
|
|
144
280
|
Run it from the project root you want to evaluate.
|
|
145
281
|
|
|
282
|
+
---
|
|
283
|
+
|
|
146
284
|
## Release Verification
|
|
147
285
|
|
|
148
286
|
Before publishing or cutting a release, run:
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
# Variant Sets
|
|
2
|
+
|
|
3
|
+
Variant sets are named comparison groups defined in `agentlab.config.yaml`.
|
|
4
|
+
|
|
5
|
+
They are the Tier 1 mechanism for prompt, model, tool-schema, and config experiments without turning every comparison into manual CLI bookkeeping.
|
|
6
|
+
|
|
7
|
+
## Why They Exist
|
|
8
|
+
|
|
9
|
+
Named agents remain the executable unit.
|
|
10
|
+
|
|
11
|
+
Variant sets sit on top of named agents so you can run the same scenario or suite against multiple variants and compare the results intentionally.
|
|
12
|
+
|
|
13
|
+
## Config Shape
|
|
14
|
+
|
|
15
|
+
```yaml
|
|
16
|
+
variant_sets:
|
|
17
|
+
- name: refund-agent-model-comparison
|
|
18
|
+
variants:
|
|
19
|
+
- agent: mock-default
|
|
20
|
+
label: baseline
|
|
21
|
+
prompt_version: prompt-v3
|
|
22
|
+
model_version: mock-model
|
|
23
|
+
tool_schema_version: support-tools-v1
|
|
24
|
+
config_label: baseline-refund-flow
|
|
25
|
+
- agent: mock-compact
|
|
26
|
+
label: concise
|
|
27
|
+
prompt_version: prompt-v4
|
|
28
|
+
model_version: mock-model
|
|
29
|
+
tool_schema_version: support-tools-v1
|
|
30
|
+
config_label: concise-refund-flow
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## CLI Usage
|
|
34
|
+
|
|
35
|
+
Run one scenario against all variants:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
agentlab run support.refund-correct-order --variant-set refund-agent-model-comparison
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Run one suite definition against all variants:
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## Stored Identity
|
|
48
|
+
|
|
49
|
+
Each resulting run stores and surfaces:
|
|
50
|
+
|
|
51
|
+
- `variant_set_name`
|
|
52
|
+
- `variant_label`
|
|
53
|
+
- `prompt_version`
|
|
54
|
+
- `model_version`
|
|
55
|
+
- `tool_schema_version`
|
|
56
|
+
- `config_label`
|
|
57
|
+
- `config_hash`
|
|
58
|
+
|
|
59
|
+
Those fields appear in CLI run summaries, `agentlab show`, run history, comparisons, and the UI.
|
|
60
|
+
|
|
61
|
+
## Design Rule
|
|
62
|
+
|
|
63
|
+
Use variant sets for intentional experiments. Keep named agents stable, and treat the variant set as the comparison layer.
|