agent-regression-lab 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,74 @@
1
+ # Golden Suites
2
+
3
+ Golden suites are the scenario portfolio internal engineering teams should keep as long-lived regression assets.
4
+
5
+ They are not just demos. They are engineering memory for the behaviors that matter before merge and before release.
6
+
7
+ ## Required Launch Categories
8
+
9
+ - coding agent regressions
10
+ - support and policy agents
11
+ - incident / ops agents
12
+ - memoryful multi-turn agents
13
+ - tool-failure recovery
14
+ - ambiguity and escalation
15
+ - adversarial or malformed tool output
16
+ - cost / latency / step-discipline checks
17
+
18
+ ## Recommended Portfolio Composition
19
+
20
+ - 5 golden workflows
21
+ - 5 historical regressions
22
+ - 5 ugly edge failures
23
+ - 3 degraded-tool scenarios
24
+ - 2 policy or escalation scenarios
25
+
26
+ ## How To Use Golden Suites
27
+
28
+ 1. Keep one or two scenarios for the happy path that must always work.
29
+ 2. Add scenarios from real incidents as soon as a failure is understood.
30
+ 3. Add edge-case scenarios for ambiguity, degraded tools, malformed outputs, and multi-turn drift.
31
+ 4. Group launch-critical workflows into config-level `suite_definitions`.
32
+ 5. Run one scenario while debugging locally.
33
+ 6. Run a `pre_merge` suite definition before merge.
34
+ 7. Run curated `release` and `incident_regressions` suite definitions before release.
35
+
36
+ ## Suggested Initial Internal-Team Scenarios
37
+
38
+ - coding destructive edit guardrails
39
+ - incident triage under noisy alerts
40
+ - escalation on ambiguity instead of guessing
41
+ - malformed tool output or partial tool output
42
+ - cross-session memory leakage
43
+ - follow-up recall across turns
44
+
45
+ ## Design Rule
46
+
47
+ Treat suite composition as a product artifact.
48
+
49
+ The suite is part of the system design, not a disposable test folder.
50
+
51
+ ## Recommended Suite Definitions
52
+
53
+ Use first-class `suite_definitions` instead of ad hoc tags alone:
54
+
55
+ ```yaml
56
+ suite_definitions:
57
+ - name: smoke
58
+ include:
59
+ tags: [smoke]
60
+
61
+ - name: pre_merge
62
+ include:
63
+ tags: [smoke, regression]
64
+
65
+ - name: release
66
+ include:
67
+ suites: [support, internal-teams]
68
+
69
+ - name: incident_regressions
70
+ include:
71
+ tags: [incident, regression]
72
+ ```
73
+
74
+ These become the operational units you wire into local verification, pre-merge checks, and release readiness.
@@ -0,0 +1,58 @@
1
+ # Integrations And Live Services
2
+
3
+ Use this guide to choose the right ARL provider path for the engineering question you are trying to answer.
4
+
5
+ ## Provider Matrix
6
+
7
+ ### `mock`
8
+
9
+ Use when you want:
10
+
11
+ - deterministic smoke tests
12
+ - stable docs examples
13
+ - baseline verification while changing the harness itself
14
+
15
+ ### `openai`
16
+
17
+ Use when you want:
18
+
19
+ - real model behavior against deterministic tool surfaces
20
+ - prompt and model validation before merge
21
+ - quick local comparisons where the model is the variable
22
+
23
+ ### `external_process`
24
+
25
+ Use when you want:
26
+
27
+ - a local Node or Python agent to participate in the runner-controlled tool loop
28
+ - the runner to remain authoritative for tools, step limits, and storage
29
+ - a thin adapter around an existing local agent implementation
30
+
31
+ ### `http`
32
+
33
+ Use when you want:
34
+
35
+ - production-like multi-turn validation against a running service
36
+ - the agent to own memory, conversation history, and internal tool execution
37
+ - live verification of a real app instead of a deterministic wrapper
38
+
39
+ `arl-test/` is the canonical example of this path in this repo.
40
+
41
+ ## Live-Service Verification
42
+
43
+ Default workflow:
44
+
45
+ 1. start the service
46
+ 2. run `agentlab` from the project containing the relevant scenarios and `agentlab.config.yaml`
47
+ 3. run one scenario while debugging
48
+ 4. run a suite before merge
49
+ 5. compare candidate runs or suite batches against a known baseline
50
+
51
+ ## Integration Design Rule
52
+
53
+ Choose the simplest provider that answers the engineering question you have.
54
+
55
+ - If you only need deterministic regression evidence, prefer `mock`.
56
+ - If you need real model behavior but deterministic tools, prefer `openai`.
57
+ - If you need a local agent implementation but still want runner-owned tools, prefer `external_process`.
58
+ - If you need the real running service with its own memory and orchestration, use `http`.
@@ -0,0 +1,51 @@
1
+ # Memory And Stateful Agents
2
+
3
+ Memoryful agents are a distinct category in ARL.
4
+
5
+ Use `type: conversation` scenarios when the agent owns:
6
+
7
+ - conversation history
8
+ - internal memory
9
+ - internal tool execution
10
+ - session or conversation identifiers
11
+
12
+ ## What ARL Owns
13
+
14
+ For conversation scenarios, ARL owns:
15
+
16
+ - the ordered user steps
17
+ - the generated `conversation_id`
18
+ - per-step and end-of-run evaluation
19
+ - trace capture
20
+ - run storage and comparison
21
+
22
+ ## What The Agent Owns
23
+
24
+ For conversation scenarios, the agent owns:
25
+
26
+ - how it stores conversation state
27
+ - how it interprets `conversation_id`
28
+ - what internal tools it calls
29
+ - how it handles memory and recall across turns
30
+
31
+ ## How To Test Memoryful Agents
32
+
33
+ Good memory-focused scenarios should cover:
34
+
35
+ - follow-up recall within one conversation
36
+ - refusal to leak identity or state across sessions
37
+ - correct handling of repeated turns
38
+ - graceful behavior when earlier turns are ambiguous or incomplete
39
+
40
+ ## Recommended Stateful Regression Cases
41
+
42
+ - follow-up recall after two or more turns
43
+ - cross-session contamination
44
+ - stale memory overriding fresh input
45
+ - memory surviving the right turns but not the wrong sessions
46
+
47
+ ## Design Rule
48
+
49
+ Use task scenarios when the runner should stay authoritative for tools and turn control.
50
+
51
+ Use conversation scenarios when the agent itself is being tested for memory, session behavior, or internal orchestration.
@@ -37,6 +37,36 @@ Verify at least one extension path:
37
37
  - run `support.refund-via-config-tool` with `custom-node-agent`, or
38
38
  - verify a repo-local custom tool still loads from `agentlab.config.yaml`
39
39
 
40
+ ## HTTP Provider Smoke
41
+
42
+ Verify the HTTP provider path for conversation scenarios:
43
+
44
+ 1. Start a minimal echo server (or any running HTTP agent service)
45
+ 2. Add a named `http` agent to `agentlab.config.yaml`:
46
+
47
+ ```yaml
48
+ agents:
49
+ - name: my-agent
50
+ provider: http
51
+ url: http://localhost:3000/api/chat
52
+ ```
53
+
54
+ 3. Run a conversation scenario:
55
+
56
+ ```bash
57
+ agentlab run internal-teams.memory-followup-recall --agent my-agent
58
+ ```
59
+
60
+ 4. Confirm the run produces a pass/fail result and the CLI output shows turn-by-turn step status
61
+
62
+ If no live HTTP service is available, confirm the HTTP error paths work correctly:
63
+
64
+ ```bash
65
+ agentlab run internal-teams.memory-followup-recall --agent my-agent
66
+ # (with no service running)
67
+ # Expected: status: error, terminationReason: http_connection_failed
68
+ ```
69
+
40
70
  ## Docs Verification
41
71
 
42
72
  Confirm these files match current behavior:
@@ -0,0 +1,67 @@
1
+ # Runtime Profiles
2
+
3
+ Runtime profiles are reusable test-environment overlays defined in `agentlab.config.yaml`.
4
+
5
+ They let you keep degraded-tool conditions and state-related authoring metadata out of individual scenarios.
6
+
7
+ ## Why They Exist
8
+
9
+ Use a runtime profile when multiple scenarios should run under the same bad condition or seeded state instead of repeating that setup inline.
10
+
11
+ Typical uses:
12
+
13
+ - force one tool to time out
14
+ - return malformed or partial tool output
15
+ - keep a named profile for memory-related scenario setup
16
+
17
+ ## Config Shape
18
+
19
+ ```yaml
20
+ runtime_profiles:
21
+ - name: timeout-orders-tool
22
+ tool_faults:
23
+ - tool: orders.list
24
+ mode: timeout
25
+ timeout_ms: 1500
26
+
27
+ - name: malformed-docs-read
28
+ tool_faults:
29
+ - tool: docs.read
30
+ mode: malformed_output
31
+ ```
32
+
33
+ Supported tool fault modes:
34
+
35
+ - `timeout`
36
+ - `error`
37
+ - `malformed_output`
38
+ - `partial_output`
39
+
40
+ ## Scenario Usage
41
+
42
+ Reference the profile from the scenario:
43
+
44
+ ```yaml
45
+ runtime_profile: timeout-orders-tool
46
+ ```
47
+
48
+ Example command:
49
+
50
+ ```bash
51
+ agentlab run internal-teams.tool-timeout-profile --agent mock-default
52
+ ```
53
+
54
+ ## Current Execution Scope
55
+
56
+ Today, runtime-profile fault injection is active only for task scenarios where ARL owns the tool loop.
57
+
58
+ That means:
59
+
60
+ - task scenarios: tool faults are injected deterministically by the runner
61
+ - conversation scenarios: the reference is allowed, but ARL does not intercept the HTTP agent's internal tools
62
+
63
+ The `state` block is available in config for reusable authoring metadata, but automatic seeded-state execution is not yet applied by the runner.
64
+
65
+ ## Design Rule
66
+
67
+ Use runtime profiles for reusable conditions, not one-off scenario-specific quirks.