agent-regression-lab 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/README.md +78 -11
  2. package/bin/agentlab.js +2 -0
  3. package/dist/agent/factory.js +20 -6
  4. package/dist/agent/httpAdapter.js +5 -4
  5. package/dist/config.js +199 -12
  6. package/dist/evaluators.js +56 -1
  7. package/dist/index.js +157 -11
  8. package/dist/init.js +88 -0
  9. package/dist/lib/id.js +3 -0
  10. package/dist/runOutput.js +46 -0
  11. package/dist/runner.js +31 -9
  12. package/dist/scenarios.js +90 -2
  13. package/dist/scoring.js +2 -2
  14. package/dist/storage.js +117 -7
  15. package/dist/tools.js +56 -2
  16. package/dist/trace.js +4 -2
  17. package/dist/ui/App.js +75 -7
  18. package/dist/ui-assets/client.css +92 -0
  19. package/dist/ui-assets/client.js +183 -19
  20. package/docs/agents.md +143 -8
  21. package/docs/coding-agents.md +74 -0
  22. package/docs/golden-suites.md +74 -0
  23. package/docs/integrations-and-live-services.md +58 -0
  24. package/docs/memory-and-stateful-agents.md +51 -0
  25. package/docs/release-checklist.md +30 -0
  26. package/docs/runtime-profiles.md +67 -0
  27. package/docs/scenarios.md +303 -56
  28. package/docs/superpowers/plans/2026-04-13-phase-2-lite-phase-3-plan.md +160 -0
  29. package/docs/superpowers/plans/2026-04-13-phase-one-npm-tools-plan.md +502 -0
  30. package/docs/superpowers/specs/2026-04-13-phase-2-lite-phase-3-design.md +164 -0
  31. package/docs/tools.md +34 -3
  32. package/docs/troubleshooting.md +193 -0
  33. package/docs/variant-sets.md +63 -0
  34. package/examples/coding-tools/README.md +21 -0
  35. package/examples/coding-tools/index.js +11 -0
  36. package/examples/coding-tools/package.json +8 -0
  37. package/examples/support-tools/README.md +21 -0
  38. package/examples/support-tools/index.js +8 -0
  39. package/examples/support-tools/package.json +8 -0
  40. package/package.json +7 -5
package/docs/scenarios.md CHANGED
@@ -2,28 +2,38 @@
2
2
 
3
3
  Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
4
4
 
5
- Each scenario should describe one narrow job for the agent, not a vague capability test.
5
+ agentlab supports two scenario types:
6
6
 
7
- ## Required Shape
7
+ - `task` — a single-instruction job for a tool-using agent (default, no `type` field needed)
8
+ - `conversation` — a multi-turn dialog with an HTTP agent
8
9
 
9
- Each scenario should define:
10
+ ---
11
+
12
+ ## Task Scenarios
13
+
14
+ Task scenarios are the default format. They describe a single job for an agent that uses tools to complete it.
15
+
16
+ ### Required Shape
17
+
18
+ Each task scenario should define:
10
19
 
11
20
  - `id`
12
21
  - `name`
13
22
  - `suite`
14
23
  - `task`
15
24
  - `tools`
16
- - `runtime`
17
25
  - `evaluators`
18
26
 
19
- Common optional fields already used in this repo:
27
+ Common optional fields:
20
28
 
21
29
  - `description`
22
30
  - `difficulty`
23
31
  - `tags`
32
+ - `runtime_profile`
33
+ - `runtime`
24
34
  - task `context`
25
35
 
26
- ## Example
36
+ ### Example
27
37
 
28
38
  ```yaml
29
39
  id: support.refund-correct-order
@@ -65,32 +75,52 @@ evaluators:
65
75
  - ord_1024
66
76
  ```
67
77
 
68
- ## Suites In This Repo
78
+ ### Runtime Profiles
69
79
 
70
- Current benchmark domains:
80
+ Task scenarios can reference a named `runtime_profile` from `agentlab.config.yaml`.
71
81
 
72
- - `support`
73
- - `coding`
74
- - `research`
75
- - `ops`
82
+ ```yaml
83
+ runtime_profile: timeout-orders-tool
84
+ ```
76
85
 
77
- Use a suite when scenarios belong to one behavior family and should be runnable together with:
86
+ Runtime profiles let you apply reusable degraded-tool conditions without duplicating them across scenarios. Current shipped behavior:
78
87
 
79
- ```bash
80
- agentlab run --suite support --agent mock-default
81
- ```
88
+ - task scenarios: tool fault injection is active
89
+ - conversation scenarios: config reference is allowed for shared authoring, but ARL does not yet inject faults into the HTTP agent's internal tools
82
90
 
83
- `run --suite` creates a suite batch id. That id is later used for:
91
+ ### Evaluators
84
92
 
85
- ```bash
86
- agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
93
+ Use deterministic evaluators only.
94
+
95
+ | Type | Description |
96
+ |------|-------------|
97
+ | `tool_call_assertion` | Assert a specific tool was called with specific input |
98
+ | `forbidden_tool` | Fail if a tool was called that should not have been |
99
+ | `final_answer_contains` | Check that the final output contains required substrings |
100
+ | `exact_final_answer` | Require an exact match on the final output |
101
+ | `step_count_max` | Fail if the agent used more steps than allowed |
102
+ | `tool_call_count_max` | Fail if the total number of tool calls exceeds a budget |
103
+ | `tool_repeat_max` | Fail if one tool is overused |
104
+ | `cost_max` | Fail if the run cost exceeds a configured USD budget |
105
+
106
+ Evaluator modes:
107
+
108
+ - `hard_gate` — failure immediately fails the run, regardless of other evaluators
109
+ - `weighted` — contributes to the weighted score (0–100)
110
+
111
+ ### Runtime Limits
112
+
113
+ ```yaml
114
+ runtime:
115
+ max_steps: 8
116
+ timeout_seconds: 60
87
117
  ```
88
118
 
89
- Suite comparison is strict. Only compare batches from the same suite.
119
+ Both are optional. `max_steps` defaults to 8. `timeout_seconds` is uncapped if not set.
90
120
 
91
- ## Tools
121
+ ### Tools
92
122
 
93
- Each scenario declares its allowed tools:
123
+ Each task scenario declares its allowed tools:
94
124
 
95
125
  ```yaml
96
126
  tools:
@@ -98,72 +128,289 @@ tools:
98
128
  - crm.search_customer
99
129
  - orders.list
100
130
  - orders.refund
131
+ forbidden:
132
+ - orders.delete
101
133
  ```
102
134
 
103
- Keep the tool allowlist as narrow as possible. A broad allowlist weakens the benchmark and makes regressions harder to interpret.
135
+ Keep the allowlist as narrow as possible. Broad allowlists weaken the benchmark.
104
136
 
105
- This repo supports both:
137
+ ### Budget And Governance Checks
106
138
 
107
- - built-in deterministic tools
108
- - repo-local custom tools registered in `agentlab.config.yaml`
139
+ Operational regressions are often just as important as correctness regressions. Use budget evaluators to encode "technically worked, but unacceptable in production":
109
140
 
110
- The launch benchmark now includes built-in tools for:
141
+ ```yaml
142
+ evaluators:
143
+ - id: total-tool-budget
144
+ type: tool_call_count_max
145
+ mode: hard_gate
146
+ config:
147
+ max: 2
148
+ - id: no-repeat-order-list
149
+ type: tool_repeat_max
150
+ mode: hard_gate
151
+ config:
152
+ tool: orders.list
153
+ max: 1
154
+ ```
111
155
 
112
- - support
113
- - coding
114
- - research
115
- - ops
156
+ Use `cost_max` only where the run records cost metadata.
116
157
 
117
- See [tools.md](tools.md) for custom tool registration.
158
+ ---
159
+
160
+ ## Conversation Scenarios
161
+
162
+ Conversation scenarios test HTTP agents through multi-turn dialogs. They require `type: conversation` and work exclusively with `provider: http` agents. The agent is responsible for maintaining its own conversation history.
163
+
164
+ ### Required Shape
165
+
166
+ ```yaml
167
+ type: conversation
168
+ id: internal-teams.memory-followup-recall
169
+ name: Follow-Up Recall Within Conversation
170
+ suite: internal-teams
171
+ steps:
172
+ - role: user
173
+ message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
174
+ - role: user
175
+ message: "What seat preference did I mention earlier?"
176
+ ```
177
+
178
+ Each step must have:
179
+
180
+ - `role: user`
181
+ - `message` — the message sent to the agent this turn
182
+
183
+ ### Per-Step Evaluators
184
+
185
+ Evaluators can be attached to individual steps. They run immediately after the agent replies to that step.
186
+
187
+ ```yaml
188
+ steps:
189
+ - role: user
190
+ message: "Where's my order #ORD-001?"
191
+ evaluators:
192
+ - type: response_contains
193
+ mode: hard_gate
194
+ config:
195
+ keywords: [shipped, tracking]
196
+ - type: response_latency_max
197
+ mode: hard_gate
198
+ config:
199
+ ms: 3000
200
+ - role: user
201
+ message: "What's the tracking number?"
202
+ evaluators:
203
+ - type: response_not_contains
204
+ mode: weighted
205
+ weight: 1
206
+ config:
207
+ keywords: ["don't know", error]
208
+ ```
209
+
210
+ If a `hard_gate` per-step evaluator fails, the run stops immediately and remaining steps are skipped.
211
+
212
+ ### Per-Step Evaluator Types
213
+
214
+ | Type | Config | Behavior |
215
+ |------|--------|----------|
216
+ | `response_contains` | `keywords: string[]` | Passes if ALL keywords appear in the reply (case-insensitive) |
217
+ | `response_not_contains` | `keywords: string[]` | Passes if NONE of the keywords appear in the reply (case-insensitive) |
218
+ | `response_matches_regex` | `pattern: string` | Passes if the reply matches the regex pattern (case-insensitive) |
219
+ | `response_latency_max` | `ms: number` | Passes if the HTTP response arrived within the time limit |
220
+
221
+ ### Scenario Quality Rules
222
+
223
+ - prefer `hard_gate` for business-critical assertions
224
+ - use `weighted` checks for quality gradients, not for the single condition that makes the scenario trustworthy
225
+ - conversation scenarios must use `config.keywords` for `response_contains` and `response_not_contains`
226
+ - stale `config.text` authoring is rejected
227
+ - use conversation scenarios when the agent owns memory, tool execution, or conversation history internally
228
+ - keep golden suites focused on repeatable workflows, historical regressions, and ugly edge cases rather than one-off demos
229
+
230
+ ### End-of-Run Evaluators
231
+
232
+ End-of-run evaluators run after all steps complete. They apply to the final reply.
233
+
234
+ ```yaml
235
+ evaluators:
236
+ - type: step_count_max
237
+ mode: hard_gate
238
+ config:
239
+ max: 10
240
+ - type: final_answer_contains
241
+ mode: weighted
242
+ weight: 1
243
+ config:
244
+ keywords: [resolved, confirmed]
245
+ ```
118
246
 
119
- ## Runtime Limits
247
+ End-of-run evaluator types:
120
248
 
121
- Scenarios can enforce:
249
+ | Type | Config | Behavior |
250
+ |------|--------|----------|
251
+ | `step_count_max` | `max: number` | Passes if the number of completed turns is within the limit |
252
+ | `final_answer_contains` | `keywords: string[]` | Passes if ALL keywords appear in the final reply |
253
+ | `exact_final_answer` | `expected: string` | Passes if the final reply exactly matches the expected string |
122
254
 
123
- - `max_steps`
124
- - `timeout_seconds`
255
+ ### Conversation State
256
+
257
+ agentlab auto-generates a UUID `conversation_id` for each run. It is sent in every step request. The agent uses it to look up and maintain its own conversation history.
258
+
259
+ The `state` block is optional:
260
+
261
+ ```yaml
262
+ state:
263
+ conversation_id: auto
264
+ ```
265
+
266
+ `auto` is the only supported value. The UUID is always generated regardless of whether the `state` block is present.
267
+
268
+ ### Restrictions
269
+
270
+ Conversation scenarios must not define a `tools:` field. HTTP agents manage their own tools internally. If `tools:` is present, validation will fail with a clear error.
271
+
272
+ Conversation scenarios may define `runtime_profile`, but today that is for shared scenario organization and future stateful hooks. ARL does not inject tool faults into HTTP agents.
273
+
274
+ ---
275
+
276
+ ## Suite Definitions
277
+
278
+ Scenario `suite` still groups related files, but operational launch workflows should use config-level `suite_definitions`.
125
279
 
126
280
  Example:
127
281
 
128
282
  ```yaml
129
- runtime:
130
- max_steps: 8
131
- timeout_seconds: 60
283
+ suite_definitions:
284
+ - name: pre_merge
285
+ include:
286
+ tags:
287
+ - smoke
288
+ - regression
132
289
  ```
133
290
 
134
- These limits are enforced by the runner. Use them to keep runs bounded and comparisons meaningful.
291
+ Run one with:
135
292
 
136
- ## Evaluators
293
+ ```bash
294
+ agentlab run --suite-def pre_merge --agent mock-default
295
+ agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
296
+ ```
137
297
 
138
- Use deterministic evaluators only.
298
+ Use suite definitions for stable workflow units like:
299
+
300
+ - `smoke`
301
+ - `pre_merge`
302
+ - `release`
303
+ - `incident_regressions`
139
304
 
140
- The current evaluator set includes:
305
+ ### Full Example
306
+
307
+ ```yaml
308
+ type: conversation
309
+ id: internal-teams.memory-followup-recall
310
+ name: Follow-Up Recall Within Conversation
311
+ suite: internal-teams
312
+ description: Memoryful agent should recall a user-provided fact later in the same conversation.
313
+ difficulty: medium
314
+ tags:
315
+ - internal-teams
316
+ - conversation
317
+
318
+ steps:
319
+ - role: user
320
+ message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
321
+ evaluators:
322
+ - type: response_contains
323
+ mode: weighted
324
+ config:
325
+ keywords:
326
+ - aisle
327
+
328
+ - role: user
329
+ message: "What seat preference did I mention earlier?"
330
+ evaluators:
331
+ - type: response_contains
332
+ mode: hard_gate
333
+ config:
334
+ keywords:
335
+ - aisle
141
336
 
142
- - `tool_call_assertion`
143
- - `forbidden_tool`
144
- - `final_answer_contains`
145
- - `exact_final_answer`
146
- - `step_count_max`
337
+ evaluators:
338
+ - type: step_count_max
339
+ mode: hard_gate
340
+ config:
341
+ max: 2
342
+ ```
147
343
 
148
- Guidance:
344
+ Run it with:
149
345
 
150
- - use hard gates for non-negotiable behavior
151
- - use weighted evaluators for softer quality checks
152
- - prefer tool assertions or exact output checks over vague answer checks when possible
346
+ ```bash
347
+ agentlab run internal-teams.memory-followup-recall --agent my-production-agent
348
+ ```
153
349
 
154
- ## Authoring Conventions
350
+ Where `my-production-agent` is a named `http` agent in `agentlab.config.yaml`. See [agents.md](agents.md) for HTTP agent config.
155
351
 
156
- Use these defaults:
352
+ ### CLI Output
353
+
354
+ Conversation runs print a different output format from task runs:
355
+
356
+ ```
357
+ run internal-teams.memory-followup-recall — PASS
358
+ agent: my-production-agent (http://localhost:3000/api/chat)
359
+ turns completed: 2/2
360
+ step 1: pass (response_contains ✓)
361
+ step 2: pass (response_contains ✓)
362
+ run id: run_20260407_001234
363
+ ```
364
+
365
+ If a hard-gate fails mid-run:
366
+
367
+ ```
368
+ run internal-teams.memory-followup-recall — FAIL
369
+ agent: my-production-agent (http://localhost:3000/api/chat)
370
+ turns completed: 1/2
371
+ step 1: FAIL (response_contains ✗)
372
+ run stopped (evaluator_failed)
373
+ run id: run_20260407_001235
374
+ ```
375
+
376
+ ---
377
+
378
+ ## Suites
379
+
380
+ Both task and conversation scenarios can belong to a suite.
381
+
382
+ ```yaml
383
+ suite: support
384
+ ```
385
+
386
+ Run an entire suite:
387
+
388
+ ```bash
389
+ agentlab run --suite support --agent mock-default
390
+ ```
391
+
392
+ `run --suite` skips conversation scenarios when using non-HTTP agents (conversation scenarios require `provider: http`). Task scenarios and conversation scenarios can coexist in the same suite directory.
393
+
394
+ `run --suite` prints a suite batch id at the end. That id is used for suite comparison:
395
+
396
+ ```bash
397
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
398
+ ```
399
+
400
+ ---
401
+
402
+ ## Authoring Conventions
157
403
 
158
404
  - `id` format: `<suite>.<short-name>`
159
405
  - keep scenario jobs narrow and concrete
160
- - keep fixture-backed context in `task.context`
406
+ - keep fixture-backed context in `task.context` (task scenarios)
161
407
  - prefer deterministic fixture references over open-ended prompts
162
408
  - include `difficulty`, `description`, and `tags` for every launch scenario
409
+ - for conversation scenarios, keep step count low (2–5) and evaluators specific
163
410
 
164
411
  ## Current Examples
165
412
 
166
- Useful scenario references in this repo:
413
+ Task scenario references in this repo:
167
414
 
168
415
  - support: `scenarios/support/refund-correct-order.yaml`
169
416
  - support with config tool: `scenarios/support/refund-via-config-tool.yaml`
@@ -0,0 +1,160 @@
1
+ # Phase 2 Lite And Phase 3 Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Deliver a minimal integration story for new users, then improve the UI enough that ARL is easier to demo, screenshot, and understand visually.
6
+
7
+ **Architecture:** Keep Phase 2-lite focused on assets that clarify adoption: README routing, one coding-agent path, and one CI path. Keep Phase 3 focused on UI clarity instead of new product surface area by improving the runs dashboard, comparison screens, and trace presentation inside the existing React UI.
8
+
9
+ **Tech Stack:** TypeScript, React, node:test, esbuild, Markdown, GitHub Actions YAML
10
+
11
+ ---
12
+
13
+ ## File Map
14
+
15
+ **Roadmap and product docs**
16
+ - Modify: `.claude/active-tasks.md`
17
+ - Modify: `.claude/project.md`
18
+ - Modify: `README.md`
19
+
20
+ **Phase 2-lite assets**
21
+ - Create: `docs/coding-agents.md`
22
+ - Create: `.github/workflows/agentlab-pre-merge.yml`
23
+
24
+ **UI**
25
+ - Modify: `src/ui/App.tsx`
26
+ - Modify: `src/ui/styles.css`
27
+
28
+ **Tests**
29
+ - Modify: `tests/launch/ui-smoke.test.ts`
30
+
31
+ ---
32
+
33
+ ### Task 1: Reframe Roadmap To Phase 2-lite Then Phase 3
34
+
35
+ **Files:**
36
+ - Modify: `.claude/active-tasks.md`
37
+ - Modify: `.claude/project.md`
38
+
39
+ - [ ] Update active task tracking so the current next phase is `Phase 2-lite`, not the original full Phase 2.
40
+ - [ ] Update project memory so Phase 2-lite is the minimal integration-story pass and Phase 3 is the main visual/demo workstream.
41
+ - [ ] Keep the scope explicit:
42
+ - HTTP example via `arl-test`
43
+ - CI example
44
+ - coding-agent example
45
+ - then UI polish
46
+
47
+ Verification:
48
+
49
+ ```bash
50
+ rg -n "Phase 2-lite|Phase 3" .claude/active-tasks.md .claude/project.md
51
+ ```
52
+
53
+ ---
54
+
55
+ ### Task 2: Add Phase 2-lite Integration Assets
56
+
57
+ **Files:**
58
+ - Modify: `README.md`
59
+ - Create: `docs/coding-agents.md`
60
+ - Create: `.github/workflows/agentlab-pre-merge.yml`
61
+
62
+ - [ ] Add README routing sections:
63
+ - if your agent runs as an HTTP service
64
+ - if you are validating coding-agent changes
65
+ - if you want pre-merge regression checks in CI
66
+ - [ ] Add one coding-agent guide using the existing coding scenarios and current tool-loop model.
67
+ - [ ] Add one GitHub Actions example that runs:
68
+
69
+ ```bash
70
+ npm ci
71
+ npm run build
72
+ node dist/index.js run --suite-def pre_merge --agent mock-default
73
+ ```
74
+
75
+ - [ ] Keep this section narrow and copy-pasteable. No broad framework matrix.
76
+
77
+ Verification:
78
+
79
+ ```bash
80
+ rg -n "HTTP service|coding-agent|pre-merge|GitHub Actions" README.md docs/coding-agents.md .github/workflows/agentlab-pre-merge.yml
81
+ ```
82
+
83
+ ---
84
+
85
+ ### Task 3: Improve Runs Dashboard And Comparison UX
86
+
87
+ **Files:**
88
+ - Modify: `src/ui/App.tsx`
89
+ - Modify: `src/ui/styles.css`
90
+ - Modify: `tests/launch/ui-smoke.test.ts`
91
+
92
+ - [ ] Add a stronger runs dashboard summary at the top of the list page:
93
+ - total runs shown
94
+ - pass/fail/error counts
95
+ - most recent suite/context hint
96
+ - [ ] Redesign the compare page to make regressions visually obvious:
97
+ - top classification banner
98
+ - clearer delta cards
99
+ - evaluator/tool diff blocks with stronger hierarchy
100
+ - more obvious baseline vs candidate sections
101
+ - [ ] Make the suite compare page easier to scan:
102
+ - headline regression/improvement counts
103
+ - clearer scenario groupings
104
+
105
+ Verification:
106
+
107
+ ```bash
108
+ npx tsx --test tests/launch/ui-smoke.test.ts
109
+ ```
110
+
111
+ ---
112
+
113
+ ### Task 4: Improve Trace And Detail Presentation
114
+
115
+ **Files:**
116
+ - Modify: `src/ui/App.tsx`
117
+ - Modify: `src/ui/styles.css`
118
+ - Modify: `tests/launch/ui-smoke.test.ts`
119
+
120
+ - [ ] Replace the plain trace list with a more intentional timeline treatment:
121
+ - event badges or type labels
122
+ - stronger step grouping
123
+ - clearer source metadata
124
+ - [ ] Keep failure-first behavior intact.
125
+ - [ ] Preserve readability on narrow screens.
126
+
127
+ Verification:
128
+
129
+ ```bash
130
+ npx tsx --test tests/launch/ui-smoke.test.ts
131
+ ```
132
+
133
+ ---
134
+
135
+ ### Task 5: Full Verification
136
+
137
+ **Files:**
138
+ - Modify only if verification exposes issues
139
+
140
+ - [ ] Run focused UI/docs-related verification:
141
+
142
+ ```bash
143
+ npx tsx --test tests/launch/ui-smoke.test.ts tests/cliPackaging.test.ts
144
+ ```
145
+
146
+ - [ ] Run full suite:
147
+
148
+ ```bash
149
+ npm test
150
+ ```
151
+
152
+ - [ ] Run release gates:
153
+
154
+ ```bash
155
+ npm run check
156
+ npm run build
157
+ npm run smoke:cli
158
+ npm_config_cache=/tmp/agentlab-npm-cache npm pack --dry-run
159
+ ```
160
+