agent-regression-lab 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +53 -7
- package/dist/agent/factory.js +20 -6
- package/dist/agent/httpAdapter.js +5 -4
- package/dist/config.js +186 -3
- package/dist/evaluators.js +56 -1
- package/dist/index.js +143 -11
- package/dist/lib/id.js +3 -0
- package/dist/runOutput.js +46 -0
- package/dist/runner.js +31 -9
- package/dist/scenarios.js +90 -2
- package/dist/scoring.js +2 -2
- package/dist/storage.js +117 -7
- package/dist/tools.js +38 -0
- package/dist/trace.js +4 -2
- package/dist/ui/App.js +28 -2
- package/dist/ui-assets/client.js +82 -0
- package/docs/agents.md +143 -8
- package/docs/golden-suites.md +74 -0
- package/docs/integrations-and-live-services.md +58 -0
- package/docs/memory-and-stateful-agents.md +51 -0
- package/docs/release-checklist.md +30 -0
- package/docs/runtime-profiles.md +67 -0
- package/docs/scenarios.md +303 -56
- package/docs/troubleshooting.md +138 -0
- package/docs/variant-sets.md +63 -0
- package/package.json +2 -2
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# Golden Suites
|
|
2
|
+
|
|
3
|
+
Golden suites are the scenario portfolio internal engineering teams should keep as long-lived regression assets.
|
|
4
|
+
|
|
5
|
+
They are not just demos. They are engineering memory for the behaviors that matter before merge and before release.
|
|
6
|
+
|
|
7
|
+
## Required Launch Categories
|
|
8
|
+
|
|
9
|
+
- coding agent regressions
|
|
10
|
+
- support and policy agents
|
|
11
|
+
- incident / ops agents
|
|
12
|
+
- memoryful multi-turn agents
|
|
13
|
+
- tool-failure recovery
|
|
14
|
+
- ambiguity and escalation
|
|
15
|
+
- adversarial or malformed tool output
|
|
16
|
+
- cost / latency / step-discipline checks
|
|
17
|
+
|
|
18
|
+
## Recommended Portfolio Composition
|
|
19
|
+
|
|
20
|
+
- 5 golden workflows
|
|
21
|
+
- 5 historical regressions
|
|
22
|
+
- 5 ugly edge failures
|
|
23
|
+
- 3 degraded-tool scenarios
|
|
24
|
+
- 2 policy or escalation scenarios
|
|
25
|
+
|
|
26
|
+
## How To Use Golden Suites
|
|
27
|
+
|
|
28
|
+
1. Keep one or two scenarios for the happy path that must always work.
|
|
29
|
+
2. Add scenarios from real incidents as soon as a failure is understood.
|
|
30
|
+
3. Add edge-case scenarios for ambiguity, degraded tools, malformed outputs, and multi-turn drift.
|
|
31
|
+
4. Group launch-critical workflows into config-level `suite_definitions`.
|
|
32
|
+
5. Run one scenario while debugging locally.
|
|
33
|
+
6. Run a `pre_merge` suite definition before merge.
|
|
34
|
+
7. Run curated `release` and `incident_regressions` suite definitions before release.
|
|
35
|
+
|
|
36
|
+
## Suggested Initial Internal-Team Scenarios
|
|
37
|
+
|
|
38
|
+
- coding destructive edit guardrails
|
|
39
|
+
- incident triage under noisy alerts
|
|
40
|
+
- escalation on ambiguity instead of guessing
|
|
41
|
+
- malformed tool output or partial tool output
|
|
42
|
+
- cross-session memory leakage
|
|
43
|
+
- follow-up recall across turns
|
|
44
|
+
|
|
45
|
+
## Design Rule
|
|
46
|
+
|
|
47
|
+
Treat suite composition as a product artifact.
|
|
48
|
+
|
|
49
|
+
The suite is part of the system design, not a disposable test folder.
|
|
50
|
+
|
|
51
|
+
## Recommended Suite Definitions
|
|
52
|
+
|
|
53
|
+
Use first-class `suite_definitions` instead of ad hoc tags alone:
|
|
54
|
+
|
|
55
|
+
```yaml
|
|
56
|
+
suite_definitions:
|
|
57
|
+
- name: smoke
|
|
58
|
+
include:
|
|
59
|
+
tags: [smoke]
|
|
60
|
+
|
|
61
|
+
- name: pre_merge
|
|
62
|
+
include:
|
|
63
|
+
tags: [smoke, regression]
|
|
64
|
+
|
|
65
|
+
- name: release
|
|
66
|
+
include:
|
|
67
|
+
suites: [support, internal-teams]
|
|
68
|
+
|
|
69
|
+
- name: incident_regressions
|
|
70
|
+
include:
|
|
71
|
+
tags: [incident, regression]
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
These become the operational units you wire into local verification, pre-merge checks, and release readiness.
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# Integrations And Live Services
|
|
2
|
+
|
|
3
|
+
Use this guide to choose the right ARL provider path for the engineering question you are trying to answer.
|
|
4
|
+
|
|
5
|
+
## Provider Matrix
|
|
6
|
+
|
|
7
|
+
### `mock`
|
|
8
|
+
|
|
9
|
+
Use when you want:
|
|
10
|
+
|
|
11
|
+
- deterministic smoke tests
|
|
12
|
+
- stable docs examples
|
|
13
|
+
- baseline verification while changing the harness itself
|
|
14
|
+
|
|
15
|
+
### `openai`
|
|
16
|
+
|
|
17
|
+
Use when you want:
|
|
18
|
+
|
|
19
|
+
- real model behavior against deterministic tool surfaces
|
|
20
|
+
- prompt and model validation before merge
|
|
21
|
+
- quick local comparisons where the model is the variable
|
|
22
|
+
|
|
23
|
+
### `external_process`
|
|
24
|
+
|
|
25
|
+
Use when you want:
|
|
26
|
+
|
|
27
|
+
- a local Node or Python agent to participate in the runner-controlled tool loop
|
|
28
|
+
- the runner to remain authoritative for tools, step limits, and storage
|
|
29
|
+
- a thin adapter around an existing local agent implementation
|
|
30
|
+
|
|
31
|
+
### `http`
|
|
32
|
+
|
|
33
|
+
Use when you want:
|
|
34
|
+
|
|
35
|
+
- production-like multi-turn validation against a running service
|
|
36
|
+
- the agent to own memory, conversation history, and internal tool execution
|
|
37
|
+
- live verification of a real app instead of a deterministic wrapper
|
|
38
|
+
|
|
39
|
+
`arl-test/` is the canonical example of this path in this repo.
|
|
40
|
+
|
|
41
|
+
## Live-Service Verification
|
|
42
|
+
|
|
43
|
+
Default workflow:
|
|
44
|
+
|
|
45
|
+
1. start the service
|
|
46
|
+
2. run `agentlab` from the project containing the relevant scenarios and `agentlab.config.yaml`
|
|
47
|
+
3. run one scenario while debugging
|
|
48
|
+
4. run a suite before merge
|
|
49
|
+
5. compare candidate runs or suite batches against a known baseline
|
|
50
|
+
|
|
51
|
+
## Integration Design Rule
|
|
52
|
+
|
|
53
|
+
Choose the simplest provider that answers the engineering question you have.
|
|
54
|
+
|
|
55
|
+
- If you only need deterministic regression evidence, prefer `mock`.
|
|
56
|
+
- If you need real model behavior but deterministic tools, prefer `openai`.
|
|
57
|
+
- If you need a local agent implementation but still want runner-owned tools, prefer `external_process`.
|
|
58
|
+
- If you need the real running service with its own memory and orchestration, use `http`.
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
# Memory And Stateful Agents
|
|
2
|
+
|
|
3
|
+
Memoryful agents are a distinct category in ARL.
|
|
4
|
+
|
|
5
|
+
Use `type: conversation` scenarios when the agent owns:
|
|
6
|
+
|
|
7
|
+
- conversation history
|
|
8
|
+
- internal memory
|
|
9
|
+
- internal tool execution
|
|
10
|
+
- session or conversation identifiers
|
|
11
|
+
|
|
12
|
+
## What ARL Owns
|
|
13
|
+
|
|
14
|
+
For conversation scenarios, ARL owns:
|
|
15
|
+
|
|
16
|
+
- the ordered user steps
|
|
17
|
+
- the generated `conversation_id`
|
|
18
|
+
- per-step and end-of-run evaluation
|
|
19
|
+
- trace capture
|
|
20
|
+
- run storage and comparison
|
|
21
|
+
|
|
22
|
+
## What The Agent Owns
|
|
23
|
+
|
|
24
|
+
For conversation scenarios, the agent owns:
|
|
25
|
+
|
|
26
|
+
- how it stores conversation state
|
|
27
|
+
- how it interprets `conversation_id`
|
|
28
|
+
- what internal tools it calls
|
|
29
|
+
- how it handles memory and recall across turns
|
|
30
|
+
|
|
31
|
+
## How To Test Memoryful Agents
|
|
32
|
+
|
|
33
|
+
Good memory-focused scenarios should cover:
|
|
34
|
+
|
|
35
|
+
- follow-up recall within one conversation
|
|
36
|
+
- refusal to leak identity or state across sessions
|
|
37
|
+
- correct handling of repeated turns
|
|
38
|
+
- graceful behavior when earlier turns are ambiguous or incomplete
|
|
39
|
+
|
|
40
|
+
## Recommended Stateful Regression Cases
|
|
41
|
+
|
|
42
|
+
- follow-up recall after two or more turns
|
|
43
|
+
- cross-session contamination
|
|
44
|
+
- stale memory overriding fresh input
|
|
45
|
+
- memory surviving the right turns but not the wrong sessions
|
|
46
|
+
|
|
47
|
+
## Design Rule
|
|
48
|
+
|
|
49
|
+
Use task scenarios when the runner should stay authoritative for tools and turn control.
|
|
50
|
+
|
|
51
|
+
Use conversation scenarios when the agent itself is being tested for memory, session behavior, or internal orchestration.
|
|
@@ -37,6 +37,36 @@ Verify at least one extension path:
|
|
|
37
37
|
- run `support.refund-via-config-tool` with `custom-node-agent`, or
|
|
38
38
|
- verify a repo-local custom tool still loads from `agentlab.config.yaml`
|
|
39
39
|
|
|
40
|
+
## HTTP Provider Smoke
|
|
41
|
+
|
|
42
|
+
Verify the HTTP provider path for conversation scenarios:
|
|
43
|
+
|
|
44
|
+
1. Start a minimal echo server (or any running HTTP agent service)
|
|
45
|
+
2. Add a named `http` agent to `agentlab.config.yaml`:
|
|
46
|
+
|
|
47
|
+
```yaml
|
|
48
|
+
agents:
|
|
49
|
+
- name: my-agent
|
|
50
|
+
provider: http
|
|
51
|
+
url: http://localhost:3000/api/chat
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
3. Run a conversation scenario:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
agentlab run internal-teams.memory-followup-recall --agent my-agent
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
4. Confirm the run produces a pass/fail result and the CLI output shows turn-by-turn step status
|
|
61
|
+
|
|
62
|
+
If no live HTTP service is available, confirm the HTTP error paths work correctly:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
agentlab run internal-teams.memory-followup-recall --agent my-agent
|
|
66
|
+
# (with no service running)
|
|
67
|
+
# Expected: status: error, terminationReason: http_connection_failed
|
|
68
|
+
```
|
|
69
|
+
|
|
40
70
|
## Docs Verification
|
|
41
71
|
|
|
42
72
|
Confirm these files match current behavior:
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
# Runtime Profiles
|
|
2
|
+
|
|
3
|
+
Runtime profiles are reusable test-environment overlays defined in `agentlab.config.yaml`.
|
|
4
|
+
|
|
5
|
+
They let you keep degraded-tool conditions and state-related authoring metadata out of individual scenarios.
|
|
6
|
+
|
|
7
|
+
## Why They Exist
|
|
8
|
+
|
|
9
|
+
Use a runtime profile when multiple scenarios should run under the same bad condition or seeded state instead of repeating that setup inline.
|
|
10
|
+
|
|
11
|
+
Typical uses:
|
|
12
|
+
|
|
13
|
+
- force one tool to time out
|
|
14
|
+
- return malformed or partial tool output
|
|
15
|
+
- keep a named profile for memory-related scenario setup
|
|
16
|
+
|
|
17
|
+
## Config Shape
|
|
18
|
+
|
|
19
|
+
```yaml
|
|
20
|
+
runtime_profiles:
|
|
21
|
+
- name: timeout-orders-tool
|
|
22
|
+
tool_faults:
|
|
23
|
+
- tool: orders.list
|
|
24
|
+
mode: timeout
|
|
25
|
+
timeout_ms: 1500
|
|
26
|
+
|
|
27
|
+
- name: malformed-docs-read
|
|
28
|
+
tool_faults:
|
|
29
|
+
- tool: docs.read
|
|
30
|
+
mode: malformed_output
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Supported tool fault modes:
|
|
34
|
+
|
|
35
|
+
- `timeout`
|
|
36
|
+
- `error`
|
|
37
|
+
- `malformed_output`
|
|
38
|
+
- `partial_output`
|
|
39
|
+
|
|
40
|
+
## Scenario Usage
|
|
41
|
+
|
|
42
|
+
Reference the profile from the scenario:
|
|
43
|
+
|
|
44
|
+
```yaml
|
|
45
|
+
runtime_profile: timeout-orders-tool
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
Example command:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
agentlab run internal-teams.tool-timeout-profile --agent mock-default
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Current Execution Scope
|
|
55
|
+
|
|
56
|
+
Today, runtime-profile fault injection is active only for task scenarios where ARL owns the tool loop.
|
|
57
|
+
|
|
58
|
+
That means:
|
|
59
|
+
|
|
60
|
+
- task scenarios: tool faults are injected deterministically by the runner
|
|
61
|
+
- conversation scenarios: the reference is allowed, but ARL does not intercept the HTTP agent's internal tools
|
|
62
|
+
|
|
63
|
+
The `state` block is available in config for reusable authoring metadata, but automatic seeded-state execution is not yet applied by the runner.
|
|
64
|
+
|
|
65
|
+
## Design Rule
|
|
66
|
+
|
|
67
|
+
Use runtime profiles for reusable conditions, not one-off scenario-specific quirks.
|