agent-regression-lab 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/agents.md ADDED
@@ -0,0 +1,152 @@
1
+ # Agents
2
+
3
+ Named agents are configured in `agentlab.config.yaml`.
4
+
5
+ This repo currently supports three provider modes:
6
+
7
+ - `mock`
8
+ - `openai`
9
+ - `external_process`
10
+
11
+ ## Named Agent Config
12
+
13
+ Example:
14
+
15
+ ```yaml
16
+ agents:
17
+ - name: mock-default
18
+ provider: mock
19
+ label: mock-default
20
+
21
+ - name: openai-cheap
22
+ provider: openai
23
+ model: gpt-4o-mini
24
+ label: openai-cheap
25
+
26
+ - name: custom-node-agent
27
+ provider: external_process
28
+ command: node
29
+ args:
30
+ - custom_agents/node_agent.mjs
31
+ label: custom-node-agent
32
+ ```
33
+
34
+ Run a named agent with:
35
+
36
+ ```bash
37
+ agentlab run support.refund-correct-order --agent mock-default
38
+ ```
39
+
40
+ ## Mock
41
+
42
+ The built-in mock adapter is the best path for deterministic smoke tests and baseline examples.
43
+
44
+ Use it when you want:
45
+
46
+ - fast local verification
47
+ - stable docs examples
48
+ - predictable benchmark behavior
49
+
50
+ ## OpenAI
51
+
52
+ The OpenAI path uses your API key and a configured model.
53
+
54
+ Requirements:
55
+
56
+ - `OPENAI_API_KEY` in the environment
57
+ - a named `openai` agent in `agentlab.config.yaml`, or equivalent CLI runtime settings
58
+
59
+ Example:
60
+
61
+ ```bash
62
+ export OPENAI_API_KEY=...
63
+ agentlab run support.refund-correct-order --agent openai-cheap
64
+ ```
65
+
66
+ The OpenAI path is useful, but less deterministic than the mock path.
67
+
68
+ ## External Process
69
+
70
+ External-process agents communicate with the runner over line-delimited JSON on stdin/stdout.
71
+
72
+ The runner stays in control of:
73
+
74
+ - tool execution
75
+ - stopping conditions
76
+ - runtime limits
77
+ - persisted run state
78
+
79
+ The external agent decides what tool to call next or when to return a final answer.
80
+
81
+ ### Protocol
82
+
83
+ Runner events:
84
+
85
+ - `run_started`
86
+ - `tool_result`
87
+ - `runner_error`
88
+
89
+ Agent responses:
90
+
91
+ - `tool_call`
92
+ - `final`
93
+ - `error`
94
+
95
+ Minimal flow:
96
+
97
+ 1. the runner sends `run_started`
98
+ 2. the agent returns `tool_call` or `final`
99
+ 3. the runner executes the tool and sends `tool_result`
100
+ 4. the agent continues until it returns `final` or `error`
101
+
102
+ Working examples:
103
+
104
+ - `custom_agents/node_agent.mjs`
105
+ - `custom_agents/python_agent.py`
106
+
107
+ Run one of them with:
108
+
109
+ ```bash
110
+ agentlab run support.refund-via-config-tool --agent custom-node-agent
111
+ ```
112
+
113
+ ## Environment Allowlist
114
+
115
+ External-process agents can optionally define `envAllowlist`.
116
+
117
+ Use it when a child process needs specific environment variables passed through.
118
+
119
+ Example shape:
120
+
121
+ ```yaml
122
+ agents:
123
+ - name: custom-agent
124
+ provider: external_process
125
+ command: node
126
+ args:
127
+ - custom_agents/node_agent.mjs
128
+ envAllowlist:
129
+ - OPENAI_API_KEY
130
+ ```
131
+
132
+ Only allow through what the child actually needs.
133
+
134
+ ## Best Practices
135
+
136
+ - use named agents instead of ad hoc local command strings
137
+ - keep labels stable so compare output stays readable
138
+ - prefer the mock path for smoke tests and docs
139
+ - use external-process agents when you want to wrap a local Node or Python agent implementation
140
+ - keep the runner authoritative for tools and termination
141
+
142
+ ## Common Errors
143
+
144
+ Typical failures:
145
+
146
+ - missing `OPENAI_API_KEY`
147
+ - unsupported provider name
148
+ - missing external-process `command`
149
+ - invalid `args` or `envAllowlist`
150
+ - child process returning invalid JSON
151
+
152
+ See [troubleshooting.md](troubleshooting.md) for fixes.
@@ -0,0 +1,64 @@
1
+ # Release Checklist
2
+
3
+ Use this before publishing a new npm version or telling users to upgrade.
4
+
5
+ ## Verification
6
+
7
+ Run the full release gate:
8
+
9
+ ```bash
10
+ npm run check
11
+ npm test
12
+ npm run build
13
+ npm run smoke:cli
14
+ npm pack --dry-run
15
+ ```
16
+
17
+ ## Manual CLI Flow
18
+
19
+ Verify the canonical workflow:
20
+
21
+ ```bash
22
+ agentlab list scenarios
23
+ agentlab run support.refund-correct-order --agent mock-default
24
+ agentlab show <run-id>
25
+ agentlab run support.refund-correct-order --agent mock-default
26
+ agentlab compare <baseline-run-id> <candidate-run-id>
27
+ agentlab run --suite support --agent mock-default
28
+ agentlab run --suite support --agent mock-default
29
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
30
+ agentlab ui
31
+ ```
32
+
33
+ ## Extension Smoke
34
+
35
+ Verify at least one extension path:
36
+
37
+ - run `support.refund-via-config-tool` with `custom-node-agent`, or
38
+ - verify a repo-local custom tool still loads from `agentlab.config.yaml`
39
+
40
+ ## Docs Verification
41
+
42
+ Confirm these files match current behavior:
43
+
44
+ - `README.md`
45
+ - `docs/scenarios.md`
46
+ - `docs/tools.md`
47
+ - `docs/agents.md`
48
+ - `docs/troubleshooting.md`
49
+
50
+ Requirements:
51
+
52
+ - every command works as written
53
+ - every referenced path exists
54
+ - limitations are stated honestly
55
+ - `compare --suite` is documented using suite batch ids, not run ids
56
+
57
+ ## Publish Hygiene
58
+
59
+ Before `npm publish`:
60
+
61
+ - confirm the package version is correct
62
+ - confirm the git tree contains the intended release changes
63
+ - confirm packaged UI assets are included in the tarball
64
+ - confirm the npm metadata still points at the correct repo, homepage, and issues URL
@@ -0,0 +1,172 @@
1
+ # Scenarios
2
+
3
+ Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
4
+
5
+ Each scenario should describe one narrow job for the agent, not a vague capability test.
6
+
7
+ ## Required Shape
8
+
9
+ Each scenario should define:
10
+
11
+ - `id`
12
+ - `name`
13
+ - `suite`
14
+ - `task`
15
+ - `tools`
16
+ - `runtime`
17
+ - `evaluators`
18
+
19
+ Common optional fields already used in this repo:
20
+
21
+ - `description`
22
+ - `difficulty`
23
+ - `tags`
24
+ - task `context`
25
+
26
+ ## Example
27
+
28
+ ```yaml
29
+ id: support.refund-correct-order
30
+ name: Refund The Correct Order
31
+ suite: support
32
+ difficulty: easy
33
+ description: Refund only the duplicated charge.
34
+ tags:
35
+ - refund
36
+ - support
37
+ task:
38
+ instructions: |
39
+ The customer says they were charged twice.
40
+ Find the duplicated charge and refund only that order.
41
+ context:
42
+ customer_email: alice@example.com
43
+ tools:
44
+ allowed:
45
+ - crm.search_customer
46
+ - orders.list
47
+ - orders.refund
48
+ runtime:
49
+ max_steps: 8
50
+ timeout_seconds: 60
51
+ evaluators:
52
+ - id: refund-created
53
+ type: tool_call_assertion
54
+ mode: hard_gate
55
+ config:
56
+ tool: orders.refund
57
+ match:
58
+ order_id: ord_1024
59
+ - id: mentions-order
60
+ type: final_answer_contains
61
+ mode: weighted
62
+ weight: 1
63
+ config:
64
+ required_substrings:
65
+ - ord_1024
66
+ ```
67
+
68
+ ## Suites In This Repo
69
+
70
+ Current benchmark domains:
71
+
72
+ - `support`
73
+ - `coding`
74
+ - `research`
75
+ - `ops`
76
+
77
+ Use a suite when scenarios belong to one behavior family and should be runnable together with:
78
+
79
+ ```bash
80
+ agentlab run --suite support --agent mock-default
81
+ ```
82
+
83
+ `run --suite` creates a suite batch id. That id is later used for:
84
+
85
+ ```bash
86
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
87
+ ```
88
+
89
+ Suite comparison is strict. Only compare batches from the same suite.
90
+
91
+ ## Tools
92
+
93
+ Each scenario declares its allowed tools:
94
+
95
+ ```yaml
96
+ tools:
97
+ allowed:
98
+ - crm.search_customer
99
+ - orders.list
100
+ - orders.refund
101
+ ```
102
+
103
+ Keep the tool allowlist as narrow as possible. A broad allowlist weakens the benchmark and makes regressions harder to interpret.
104
+
105
+ This repo supports both:
106
+
107
+ - built-in deterministic tools
108
+ - repo-local custom tools registered in `agentlab.config.yaml`
109
+
110
+ The launch benchmark now includes built-in tools for:
111
+
112
+ - support
113
+ - coding
114
+ - research
115
+ - ops
116
+
117
+ See [tools.md](tools.md) for custom tool registration.
118
+
119
+ ## Runtime Limits
120
+
121
+ Scenarios can enforce:
122
+
123
+ - `max_steps`
124
+ - `timeout_seconds`
125
+
126
+ Example:
127
+
128
+ ```yaml
129
+ runtime:
130
+ max_steps: 8
131
+ timeout_seconds: 60
132
+ ```
133
+
134
+ These limits are enforced by the runner. Use them to keep runs bounded and comparisons meaningful.
135
+
136
+ ## Evaluators
137
+
138
+ Use deterministic evaluators only.
139
+
140
+ The current evaluator set includes:
141
+
142
+ - `tool_call_assertion`
143
+ - `forbidden_tool`
144
+ - `final_answer_contains`
145
+ - `exact_final_answer`
146
+ - `step_count_max`
147
+
148
+ Guidance:
149
+
150
+ - use hard gates for non-negotiable behavior
151
+ - use weighted evaluators for softer quality checks
152
+ - prefer tool assertions or exact output checks over vague answer checks when possible
153
+
154
+ ## Authoring Conventions
155
+
156
+ Use these defaults:
157
+
158
+ - `id` format: `<suite>.<short-name>`
159
+ - keep scenario jobs narrow and concrete
160
+ - keep fixture-backed context in `task.context`
161
+ - prefer deterministic fixture references over open-ended prompts
162
+ - include `difficulty`, `description`, and `tags` for every launch scenario
163
+
164
+ ## Current Examples
165
+
166
+ Useful scenario references in this repo:
167
+
168
+ - support: `scenarios/support/refund-correct-order.yaml`
169
+ - support with config tool: `scenarios/support/refund-via-config-tool.yaml`
170
+ - coding: `scenarios/coding/fix-add-function.yaml`
171
+ - research: `scenarios/research/remote-work-policy.yaml`
172
+ - ops: `scenarios/ops/payments-api-alert.yaml`
package/docs/tools.md ADDED
@@ -0,0 +1,102 @@
1
+ # Custom Tools
2
+
3
+ Custom tools are registered in `agentlab.config.yaml` and loaded from repo-local JS or TS modules.
4
+
5
+ This is the main extension point when built-in tools are not enough.
6
+
7
+ ## What A Tool Registration Needs
8
+
9
+ Each tool entry must define:
10
+
11
+ - `name`
12
+ - `modulePath`
13
+ - `exportName`
14
+ - `description`
15
+ - `inputSchema`
16
+
17
+ Example:
18
+
19
+ ```yaml
20
+ tools:
21
+ - name: support.find_duplicate_charge
22
+ modulePath: user_tools/findDuplicateCharge.ts
23
+ exportName: findDuplicateCharge
24
+ description: Find the duplicated charge order id for a given customer.
25
+ inputSchema:
26
+ type: object
27
+ additionalProperties: false
28
+ properties:
29
+ customer_id:
30
+ type: string
31
+ description: Customer id to inspect for duplicated charges.
32
+ required:
33
+ - customer_id
34
+ ```
35
+
36
+ ## Tool Module Shape
37
+
38
+ The exported function should be async and should return JSON-serializable output.
39
+
40
+ Minimal example:
41
+
42
+ ```ts
43
+ export async function myTool(input: unknown): Promise<{ ok: boolean }> {
44
+ return { ok: true };
45
+ }
46
+ ```
47
+
48
+ The existing working example is:
49
+
50
+ - `user_tools/findDuplicateCharge.ts`
51
+
52
+ ## Important Constraints
53
+
54
+ - `modulePath` must stay within the repo
55
+ - the module must exist at load time
56
+ - the named export must exist
57
+ - tool input should be validated defensively inside the tool
58
+ - tool output should be deterministic and JSON-serializable
59
+
60
+ For launch usage, treat tools as fixture-backed local functions, not live integrations.
61
+
62
+ ## Recommended Pattern
63
+
64
+ Use this approach:
65
+
66
+ 1. read fixture data from `fixtures/`
67
+ 2. validate the input shape
68
+ 3. return a small structured result
69
+ 4. throw a clear error for missing fixture state or invalid input
70
+
71
+ The current `findDuplicateCharge` tool shows that pattern.
72
+
73
+ ## Wiring A Tool Into A Scenario
74
+
75
+ 1. register the tool in `agentlab.config.yaml`
76
+ 2. add the tool name to the scenario allowlist
77
+ 3. add an evaluator that confirms the tool was used correctly if the behavior is important
78
+
79
+ Example scenario:
80
+
81
+ - `scenarios/support/refund-via-config-tool.yaml`
82
+
83
+ ## Best Practices
84
+
85
+ - keep tool names stable and descriptive
86
+ - keep tools scenario-agnostic where possible
87
+ - prefer read-only or sandboxed behavior
88
+ - do not mutate global machine state
89
+ - do not call live external systems in benchmark paths
90
+ - keep schemas narrow so agent tool calls are easy to validate and compare
91
+
92
+ ## Common Errors
93
+
94
+ Typical config failures:
95
+
96
+ - duplicate tool names
97
+ - repo-external module paths
98
+ - missing module files
99
+ - missing exports
100
+ - invalid `inputSchema` shape
101
+
102
+ See [troubleshooting.md](troubleshooting.md) for failure examples and fixes.
@@ -0,0 +1,158 @@
1
+ # Troubleshooting
2
+
3
+ This page covers the main failure modes users hit during install, first run, and comparison.
4
+
5
+ ## `agentlab: command not found`
6
+
7
+ You are probably in one of these states:
8
+
9
+ - the package is not installed globally
10
+ - you have not run `npm link` from the repo
11
+ - your shell path does not include npm global bins
12
+
13
+ Fast fixes:
14
+
15
+ ```bash
16
+ npm install
17
+ npm run build
18
+ npm link
19
+ agentlab --help
20
+ ```
21
+
22
+ Or skip linking and use:
23
+
24
+ ```bash
25
+ npm run start -- --help
26
+ ```
27
+
28
+ ## `OPENAI_API_KEY is required`
29
+
30
+ You used an OpenAI-backed agent without exporting the API key.
31
+
32
+ Fix:
33
+
34
+ ```bash
35
+ export OPENAI_API_KEY=...
36
+ agentlab run support.refund-correct-order --agent openai-cheap
37
+ ```
38
+
39
+ ## `No scenarios found for suite ...`
40
+
41
+ The suite id must match a suite under `scenarios/`.
42
+
43
+ List valid options:
44
+
45
+ ```bash
46
+ agentlab list scenarios
47
+ ```
48
+
49
+ Current built-in suites in this repo include:
50
+
51
+ - `support`
52
+ - `coding`
53
+ - `research`
54
+ - `ops`
55
+
56
+ ## `Run '<id>' not found`
57
+
58
+ `show` and run-to-run `compare` require run ids from completed runs.
59
+
60
+ Get a fresh run id by executing a scenario:
61
+
62
+ ```bash
63
+ agentlab run support.refund-correct-order --agent mock-default
64
+ ```
65
+
66
+ Then use:
67
+
68
+ ```bash
69
+ agentlab show <run-id>
70
+ agentlab compare <baseline-run-id> <candidate-run-id>
71
+ ```
72
+
73
+ ## `Missing baseline or candidate suite batch id`
74
+
75
+ `compare --suite` does not use run ids. It uses suite batch ids printed by `run --suite`.
76
+
77
+ Example:
78
+
79
+ ```bash
80
+ agentlab run --suite support --agent mock-default
81
+ agentlab run --suite support --agent mock-default
82
+ agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
83
+ ```
84
+
85
+ ## Cross-suite suite comparison errors
86
+
87
+ Suite batch comparison is strict. Compare batches from the same suite only.
88
+
89
+ This is valid:
90
+
91
+ ```bash
92
+ agentlab compare --suite suite_...support_batch_a suite_...support_batch_b
93
+ ```
94
+
95
+ This is not valid:
96
+
97
+ - a `support` batch compared against an `ops` batch
98
+ - mixed or malformed suite batch selections
99
+
100
+ If you are unsure which batch came from which suite, rerun the suite and record the printed batch ids.
101
+
102
+ ## `agentlab ui` fails to load assets
103
+
104
+ Installed packages should already include the built UI assets.
105
+
106
+ If you are running from a repo checkout, build first:
107
+
108
+ ```bash
109
+ npm install
110
+ npm run build
111
+ agentlab ui
112
+ ```
113
+
114
+ If the problem persists, verify that these files exist:
115
+
116
+ - `dist/ui-assets/client.js`
117
+ - `dist/ui-assets/client.css`
118
+
119
+ ## Config tool or agent not found
120
+
121
+ Typical reasons:
122
+
123
+ - `agentlab.config.yaml` is missing
124
+ - the configured `name` does not match the CLI `--agent` value
125
+ - `modulePath` points outside the repo
126
+ - the configured export or command does not exist
127
+
128
+ Working references in this repo:
129
+
130
+ - tool config: `agentlab.config.yaml`
131
+ - custom tool: `user_tools/findDuplicateCharge.ts`
132
+ - external agents: `custom_agents/node_agent.mjs`, `custom_agents/python_agent.py`
133
+
134
+ ## Global install behaves differently from repo mode
135
+
136
+ That usually means the current working directory is wrong.
137
+
138
+ The CLI operates on the current working directory and expects:
139
+
140
+ - `scenarios/`
141
+ - `fixtures/`
142
+ - optional `agentlab.config.yaml`
143
+
144
+ Run it from the project root you want to evaluate.
145
+
146
+ ## Release Verification
147
+
148
+ Before publishing or cutting a release, run:
149
+
150
+ ```bash
151
+ npm run check
152
+ npm test
153
+ npm run build
154
+ npm run smoke:cli
155
+ npm pack --dry-run
156
+ ```
157
+
158
+ For the full pre-launch checklist, see [release-checklist.md](release-checklist.md).
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agent-regression-lab",
3
- "version": "0.1.0",
3
+ "version": "0.2.0",
4
4
  "private": false,
5
5
  "description": "Local-first scenario-based evaluation harness for AI agents.",
6
6
  "license": "MIT",
@@ -25,13 +25,16 @@
25
25
  },
26
26
  "files": [
27
27
  "dist",
28
- "README.md"
28
+ "dist/ui-assets",
29
+ "README.md",
30
+ "docs"
29
31
  ],
30
32
  "engines": {
31
33
  "node": ">=22"
32
34
  },
33
35
  "scripts": {
34
- "build": "tsc -p tsconfig.json",
36
+ "build": "tsc -p tsconfig.json && npm run build:ui",
37
+ "build:ui": "esbuild src/ui/client.tsx --bundle --format=esm --platform=browser --outdir=dist/ui-assets --loader:.css=css --log-level=warning",
35
38
  "check": "tsc -p tsconfig.json --noEmit",
36
39
  "test": "tsx --test tests/**/*.test.ts",
37
40
  "smoke:cli": "npm run build && node dist/index.js --help && node dist/index.js version",