agent-regression-lab 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,51 @@
1
+ # Memory And Stateful Agents
2
+
3
+ Memoryful agents are a distinct category in ARL.
4
+
5
+ Use `type: conversation` scenarios when the agent owns:
6
+
7
+ - conversation history
8
+ - internal memory
9
+ - internal tool execution
10
+ - session or conversation identifiers
11
+
12
+ ## What ARL Owns
13
+
14
+ For conversation scenarios, ARL owns:
15
+
16
+ - the ordered user steps
17
+ - the generated `conversation_id`
18
+ - per-step and end-of-run evaluation
19
+ - trace capture
20
+ - run storage and comparison
21
+
22
+ ## What The Agent Owns
23
+
24
+ For conversation scenarios, the agent owns:
25
+
26
+ - how it stores conversation state
27
+ - how it interprets `conversation_id`
28
+ - what internal tools it calls
29
+ - how it handles memory and recall across turns
30
+
31
+ ## How To Test Memoryful Agents
32
+
33
+ Good memory-focused scenarios should cover:
34
+
35
+ - follow-up recall within one conversation
36
+ - refusal to leak identity or state across sessions
37
+ - correct handling of repeated turns
38
+ - graceful behavior when earlier turns are ambiguous or incomplete
39
+
40
+ ## Recommended Stateful Regression Cases
41
+
42
+ - follow-up recall after two or more turns
43
+ - cross-session contamination
44
+ - stale memory overriding fresh input
45
+ - memory surviving the right turns but not the wrong sessions
46
+
47
+ ## Design Rule
48
+
49
+ Use task scenarios when the runner should stay authoritative for tools and turn control.
50
+
51
+ Use conversation scenarios when the agent itself is being tested for memory, session behavior, or internal orchestration.
@@ -0,0 +1,94 @@
1
+ # Release Checklist
2
+
3
+ Use this before publishing a new npm version or telling users to upgrade.
4
+
5
+ ## Verification
6
+
7
+ Run the full release gate:
8
+
9
+ ```bash
10
+ npm run check
11
+ npm test
12
+ npm run build
13
+ npm run smoke:cli
14
+ npm pack --dry-run
15
+ ```
16
+
17
+ ## Manual CLI Flow
18
+
19
+ Verify the canonical workflow:
20
+
21
+ ```bash
22
+ agentlab list scenarios
23
+ agentlab run support.refund-correct-order --agent mock-default
24
+ agentlab show <run-id>
25
+ agentlab run support.refund-correct-order --agent mock-default
26
+ agentlab compare <baseline-run-id> <candidate-run-id>
27
+ agentlab run --suite support --agent mock-default
28
+ agentlab run --suite support --agent mock-default
29
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
30
+ agentlab ui
31
+ ```
32
+
33
+ ## Extension Smoke
34
+
35
+ Verify at least one extension path:
36
+
37
+ - run `support.refund-via-config-tool` with `custom-node-agent`, or
38
+ - verify a repo-local custom tool still loads from `agentlab.config.yaml`
39
+
40
+ ## HTTP Provider Smoke
41
+
42
+ Verify the HTTP provider path for conversation scenarios:
43
+
44
+ 1. Start a minimal echo server (or any running HTTP agent service)
45
+ 2. Add a named `http` agent to `agentlab.config.yaml`:
46
+
47
+ ```yaml
48
+ agents:
49
+ - name: my-agent
50
+ provider: http
51
+ url: http://localhost:3000/api/chat
52
+ ```
53
+
54
+ 3. Run a conversation scenario:
55
+
56
+ ```bash
57
+ agentlab run internal-teams.memory-followup-recall --agent my-agent
58
+ ```
59
+
60
+ 4. Confirm the run produces a pass/fail result and the CLI output shows turn-by-turn step status
61
+
62
+ If no live HTTP service is available, confirm the HTTP error paths work correctly:
63
+
64
+ ```bash
65
+ agentlab run internal-teams.memory-followup-recall --agent my-agent
66
+ # (with no service running)
67
+ # Expected: status: error, terminationReason: http_connection_failed
68
+ ```
69
+
70
+ ## Docs Verification
71
+
72
+ Confirm these files match current behavior:
73
+
74
+ - `README.md`
75
+ - `docs/scenarios.md`
76
+ - `docs/tools.md`
77
+ - `docs/agents.md`
78
+ - `docs/troubleshooting.md`
79
+
80
+ Requirements:
81
+
82
+ - every command works as written
83
+ - every referenced path exists
84
+ - limitations are stated honestly
85
+ - `compare --suite` is documented using suite batch ids, not run ids
86
+
87
+ ## Publish Hygiene
88
+
89
+ Before `npm publish`:
90
+
91
+ - confirm the package version is correct
92
+ - confirm the git tree contains the intended release changes
93
+ - confirm packaged UI assets are included in the tarball
94
+ - confirm the npm metadata still points at the correct repo, homepage, and issues URL
@@ -0,0 +1,67 @@
1
+ # Runtime Profiles
2
+
3
+ Runtime profiles are reusable test-environment overlays defined in `agentlab.config.yaml`.
4
+
5
+ They let you keep degraded-tool conditions and state-related authoring metadata out of individual scenarios.
6
+
7
+ ## Why They Exist
8
+
9
+ Use a runtime profile when multiple scenarios should run under the same bad condition or seeded state instead of repeating that setup inline.
10
+
11
+ Typical uses:
12
+
13
+ - force one tool to time out
14
+ - return malformed or partial tool output
15
+ - keep a named profile for memory-related scenario setup
16
+
17
+ ## Config Shape
18
+
19
+ ```yaml
20
+ runtime_profiles:
21
+ - name: timeout-orders-tool
22
+ tool_faults:
23
+ - tool: orders.list
24
+ mode: timeout
25
+ timeout_ms: 1500
26
+
27
+ - name: malformed-docs-read
28
+ tool_faults:
29
+ - tool: docs.read
30
+ mode: malformed_output
31
+ ```
32
+
33
+ Supported tool fault modes:
34
+
35
+ - `timeout`
36
+ - `error`
37
+ - `malformed_output`
38
+ - `partial_output`
39
+
40
+ ## Scenario Usage
41
+
42
+ Reference the profile from the scenario:
43
+
44
+ ```yaml
45
+ runtime_profile: timeout-orders-tool
46
+ ```
47
+
48
+ Example command:
49
+
50
+ ```bash
51
+ agentlab run internal-teams.tool-timeout-profile --agent mock-default
52
+ ```
53
+
54
+ ## Current Execution Scope
55
+
56
+ Today, runtime-profile fault injection is active only for task scenarios where ARL owns the tool loop.
57
+
58
+ That means:
59
+
60
+ - task scenarios: tool faults are injected deterministically by the runner
61
+ - conversation scenarios: the reference is allowed, but ARL does not intercept the HTTP agent's internal tools
62
+
63
+ The `state` block is available in config for reusable authoring metadata, but automatic seeded-state execution is not yet applied by the runner.
64
+
65
+ ## Design Rule
66
+
67
+ Use runtime profiles for reusable conditions, not one-off scenario-specific quirks.
@@ -0,0 +1,419 @@
1
+ # Scenarios
2
+
3
+ Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
4
+
5
+ agentlab supports two scenario types:
6
+
7
+ - `task` — a single-instruction job for a tool-using agent (default, no `type` field needed)
8
+ - `conversation` — a multi-turn dialog with an HTTP agent
9
+
10
+ ---
11
+
12
+ ## Task Scenarios
13
+
14
+ Task scenarios are the default format. They describe a single job for an agent that uses tools to complete it.
15
+
16
+ ### Required Shape
17
+
18
+ Each task scenario should define:
19
+
20
+ - `id`
21
+ - `name`
22
+ - `suite`
23
+ - `task`
24
+ - `tools`
25
+ - `evaluators`
26
+
27
+ Common optional fields:
28
+
29
+ - `description`
30
+ - `difficulty`
31
+ - `tags`
32
+ - `runtime_profile`
33
+ - `runtime`
34
+ - task `context`
35
+
36
+ ### Example
37
+
38
+ ```yaml
39
+ id: support.refund-correct-order
40
+ name: Refund The Correct Order
41
+ suite: support
42
+ difficulty: easy
43
+ description: Refund only the duplicated charge.
44
+ tags:
45
+ - refund
46
+ - support
47
+ task:
48
+ instructions: |
49
+ The customer says they were charged twice.
50
+ Find the duplicated charge and refund only that order.
51
+ context:
52
+ customer_email: alice@example.com
53
+ tools:
54
+ allowed:
55
+ - crm.search_customer
56
+ - orders.list
57
+ - orders.refund
58
+ runtime:
59
+ max_steps: 8
60
+ timeout_seconds: 60
61
+ evaluators:
62
+ - id: refund-created
63
+ type: tool_call_assertion
64
+ mode: hard_gate
65
+ config:
66
+ tool: orders.refund
67
+ match:
68
+ order_id: ord_1024
69
+ - id: mentions-order
70
+ type: final_answer_contains
71
+ mode: weighted
72
+ weight: 1
73
+ config:
74
+ required_substrings:
75
+ - ord_1024
76
+ ```
77
+
78
+ ### Runtime Profiles
79
+
80
+ Task scenarios can reference a named `runtime_profile` from `agentlab.config.yaml`.
81
+
82
+ ```yaml
83
+ runtime_profile: timeout-orders-tool
84
+ ```
85
+
86
+ Runtime profiles let you apply reusable degraded-tool conditions without duplicating them across scenarios. Current shipped behavior:
87
+
88
+ - task scenarios: tool fault injection is active
89
+ - conversation scenarios: config reference is allowed for shared authoring, but ARL does not yet inject faults into the HTTP agent's internal tools
90
+
91
+ ### Evaluators
92
+
93
+ Use deterministic evaluators only.
94
+
95
+ | Type | Description |
96
+ |------|-------------|
97
+ | `tool_call_assertion` | Assert a specific tool was called with specific input |
98
+ | `forbidden_tool` | Fail if a tool was called that should not have been |
99
+ | `final_answer_contains` | Check that the final output contains required substrings |
100
+ | `exact_final_answer` | Require an exact match on the final output |
101
+ | `step_count_max` | Fail if the agent used more steps than allowed |
102
+ | `tool_call_count_max` | Fail if the total number of tool calls exceeds a budget |
103
+ | `tool_repeat_max` | Fail if one tool is overused |
104
+ | `cost_max` | Fail if the run cost exceeds a configured USD budget |
105
+
106
+ Evaluator modes:
107
+
108
+ - `hard_gate` — failure immediately fails the run, regardless of other evaluators
109
+ - `weighted` — contributes to the weighted score (0–100)
110
+
111
+ ### Runtime Limits
112
+
113
+ ```yaml
114
+ runtime:
115
+ max_steps: 8
116
+ timeout_seconds: 60
117
+ ```
118
+
119
+ Both are optional. `max_steps` defaults to 8. `timeout_seconds` is uncapped if not set.
120
+
121
+ ### Tools
122
+
123
+ Each task scenario declares its allowed tools:
124
+
125
+ ```yaml
126
+ tools:
127
+ allowed:
128
+ - crm.search_customer
129
+ - orders.list
130
+ - orders.refund
131
+ forbidden:
132
+ - orders.delete
133
+ ```
134
+
135
+ Keep the allowlist as narrow as possible. Broad allowlists weaken the benchmark.
136
+
137
+ ### Budget And Governance Checks
138
+
139
+ Operational regressions are often just as important as correctness regressions. Use budget evaluators to encode "technically worked, but unacceptable in production":
140
+
141
+ ```yaml
142
+ evaluators:
143
+ - id: total-tool-budget
144
+ type: tool_call_count_max
145
+ mode: hard_gate
146
+ config:
147
+ max: 2
148
+ - id: no-repeat-order-list
149
+ type: tool_repeat_max
150
+ mode: hard_gate
151
+ config:
152
+ tool: orders.list
153
+ max: 1
154
+ ```
155
+
156
+ Use `cost_max` only where the run records cost metadata.
157
+
158
+ ---
159
+
160
+ ## Conversation Scenarios
161
+
162
+ Conversation scenarios test HTTP agents through multi-turn dialogs. They require `type: conversation` and work exclusively with `provider: http` agents. The agent is responsible for maintaining its own conversation history.
163
+
164
+ ### Required Shape
165
+
166
+ ```yaml
167
+ type: conversation
168
+ id: internal-teams.memory-followup-recall
169
+ name: Follow-Up Recall Within Conversation
170
+ suite: internal-teams
171
+ steps:
172
+ - role: user
173
+ message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
174
+ - role: user
175
+ message: "What seat preference did I mention earlier?"
176
+ ```
177
+
178
+ Each step must have:
179
+
180
+ - `role: user`
181
+ - `message` — the message sent to the agent this turn
182
+
183
+ ### Per-Step Evaluators
184
+
185
+ Evaluators can be attached to individual steps. They run immediately after the agent replies to that step.
186
+
187
+ ```yaml
188
+ steps:
189
+ - role: user
190
+ message: "Where's my order #ORD-001?"
191
+ evaluators:
192
+ - type: response_contains
193
+ mode: hard_gate
194
+ config:
195
+ keywords: [shipped, tracking]
196
+ - type: response_latency_max
197
+ mode: hard_gate
198
+ config:
199
+ ms: 3000
200
+ - role: user
201
+ message: "What's the tracking number?"
202
+ evaluators:
203
+ - type: response_not_contains
204
+ mode: weighted
205
+ weight: 1
206
+ config:
207
+ keywords: ["don't know", error]
208
+ ```
209
+
210
+ If a `hard_gate` per-step evaluator fails, the run stops immediately and remaining steps are skipped.
211
+
212
+ ### Per-Step Evaluator Types
213
+
214
+ | Type | Config | Behavior |
215
+ |------|--------|----------|
216
+ | `response_contains` | `keywords: string[]` | Passes if ALL keywords appear in the reply (case-insensitive) |
217
+ | `response_not_contains` | `keywords: string[]` | Passes if NONE of the keywords appear in the reply (case-insensitive) |
218
+ | `response_matches_regex` | `pattern: string` | Passes if the reply matches the regex pattern (case-insensitive) |
219
+ | `response_latency_max` | `ms: number` | Passes if the HTTP response arrived within the time limit |
220
+
221
+ ### Scenario Quality Rules
222
+
223
+ - prefer `hard_gate` for business-critical assertions
224
+ - use `weighted` checks for quality gradients, not for the single condition that makes the scenario trustworthy
225
+ - conversation scenarios must use `config.keywords` for `response_contains` and `response_not_contains`
226
+ - stale `config.text` authoring is rejected
227
+ - use conversation scenarios when the agent owns memory, tool execution, or conversation history internally
228
+ - keep golden suites focused on repeatable workflows, historical regressions, and ugly edge cases rather than one-off demos
229
+
230
+ ### End-of-Run Evaluators
231
+
232
+ End-of-run evaluators run after all steps complete. They apply to the final reply.
233
+
234
+ ```yaml
235
+ evaluators:
236
+ - type: step_count_max
237
+ mode: hard_gate
238
+ config:
239
+ max: 10
240
+ - type: final_answer_contains
241
+ mode: weighted
242
+ weight: 1
243
+ config:
244
+ keywords: [resolved, confirmed]
245
+ ```
246
+
247
+ End-of-run evaluator types:
248
+
249
+ | Type | Config | Behavior |
250
+ |------|--------|----------|
251
+ | `step_count_max` | `max: number` | Passes if the number of completed turns is within the limit |
252
+ | `final_answer_contains` | `keywords: string[]` | Passes if ALL keywords appear in the final reply |
253
+ | `exact_final_answer` | `expected: string` | Passes if the final reply exactly matches the expected string |
254
+
255
+ ### Conversation State
256
+
257
+ agentlab auto-generates a UUID `conversation_id` for each run. It is sent in every step request. The agent uses it to look up and maintain its own conversation history.
258
+
259
+ The `state` block is optional:
260
+
261
+ ```yaml
262
+ state:
263
+ conversation_id: auto
264
+ ```
265
+
266
+ `auto` is the only supported value. The UUID is always generated regardless of whether the `state` block is present.
267
+
268
+ ### Restrictions
269
+
270
+ Conversation scenarios must not define a `tools:` field. HTTP agents manage their own tools internally. If `tools:` is present, validation will fail with a clear error.
271
+
272
+ Conversation scenarios may define `runtime_profile`, but today that is for shared scenario organization and future stateful hooks. ARL does not inject tool faults into HTTP agents.
273
+
274
+ ---
275
+
276
+ ## Suite Definitions
277
+
278
+ Scenario `suite` still groups related files, but operational launch workflows should use config-level `suite_definitions`.
279
+
280
+ Example:
281
+
282
+ ```yaml
283
+ suite_definitions:
284
+ - name: pre_merge
285
+ include:
286
+ tags:
287
+ - smoke
288
+ - regression
289
+ ```
290
+
291
+ Run one with:
292
+
293
+ ```bash
294
+ agentlab run --suite-def pre_merge --agent mock-default
295
+ agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
296
+ ```
297
+
298
+ Use suite definitions for stable workflow units like:
299
+
300
+ - `smoke`
301
+ - `pre_merge`
302
+ - `release`
303
+ - `incident_regressions`
304
+
305
+ ### Full Example
306
+
307
+ ```yaml
308
+ type: conversation
309
+ id: internal-teams.memory-followup-recall
310
+ name: Follow-Up Recall Within Conversation
311
+ suite: internal-teams
312
+ description: Memoryful agent should recall a user-provided fact later in the same conversation.
313
+ difficulty: medium
314
+ tags:
315
+ - internal-teams
316
+ - conversation
317
+
318
+ steps:
319
+ - role: user
320
+ message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
321
+ evaluators:
322
+ - type: response_contains
323
+ mode: weighted
324
+ config:
325
+ keywords:
326
+ - aisle
327
+
328
+ - role: user
329
+ message: "What seat preference did I mention earlier?"
330
+ evaluators:
331
+ - type: response_contains
332
+ mode: hard_gate
333
+ config:
334
+ keywords:
335
+ - aisle
336
+
337
+ evaluators:
338
+ - type: step_count_max
339
+ mode: hard_gate
340
+ config:
341
+ max: 2
342
+ ```
343
+
344
+ Run it with:
345
+
346
+ ```bash
347
+ agentlab run internal-teams.memory-followup-recall --agent my-production-agent
348
+ ```
349
+
350
+ Where `my-production-agent` is a named `http` agent in `agentlab.config.yaml`. See [agents.md](agents.md) for HTTP agent config.
351
+
352
+ ### CLI Output
353
+
354
+ Conversation runs print a different output format from task runs:
355
+
356
+ ```
357
+ run internal-teams.memory-followup-recall — PASS
358
+ agent: my-production-agent (http://localhost:3000/api/chat)
359
+ turns completed: 2/2
360
+ step 1: pass (response_contains ✓)
361
+ step 2: pass (response_contains ✓)
362
+ run id: run_20260407_001234
363
+ ```
364
+
365
+ If a hard-gate fails mid-run:
366
+
367
+ ```
368
+ run internal-teams.memory-followup-recall — FAIL
369
+ agent: my-production-agent (http://localhost:3000/api/chat)
370
+ turns completed: 1/2
371
+ step 1: FAIL (response_contains ✗)
372
+ run stopped (evaluator_failed)
373
+ run id: run_20260407_001235
374
+ ```
375
+
376
+ ---
377
+
378
+ ## Suites
379
+
380
+ Both task and conversation scenarios can belong to a suite.
381
+
382
+ ```yaml
383
+ suite: support
384
+ ```
385
+
386
+ Run an entire suite:
387
+
388
+ ```bash
389
+ agentlab run --suite support --agent mock-default
390
+ ```
391
+
392
+ `run --suite` skips conversation scenarios when using non-HTTP agents (conversation scenarios require `provider: http`). Task scenarios and conversation scenarios can coexist in the same suite directory.
393
+
394
+ `run --suite` prints a suite batch id at the end. That id is used for suite comparison:
395
+
396
+ ```bash
397
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
398
+ ```
399
+
400
+ ---
401
+
402
+ ## Authoring Conventions
403
+
404
+ - `id` format: `<suite>.<short-name>`
405
+ - keep scenario jobs narrow and concrete
406
+ - keep fixture-backed context in `task.context` (task scenarios)
407
+ - prefer deterministic fixture references over open-ended prompts
408
+ - include `difficulty`, `description`, and `tags` for every launch scenario
409
+ - for conversation scenarios, keep step count low (2–5) and evaluators specific
410
+
411
+ ## Current Examples
412
+
413
+ Task scenario references in this repo:
414
+
415
+ - support: `scenarios/support/refund-correct-order.yaml`
416
+ - support with config tool: `scenarios/support/refund-via-config-tool.yaml`
417
+ - coding: `scenarios/coding/fix-add-function.yaml`
418
+ - research: `scenarios/research/remote-work-policy.yaml`
419
+ - ops: `scenarios/ops/payments-api-alert.yaml`