npm - agent-regression-lab - Versions diffs - 0.1.1 → 0.3.0 - Mend

agent-regression-lab 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/README.md +186 -123
package/dist/agent/factory.js +20 -6
package/dist/agent/httpAdapter.js +79 -0
package/dist/agent/mockAdapter.js +210 -13
package/dist/config.js +223 -4
package/dist/conversationEvaluators.js +167 -0
package/dist/conversationRunner.js +199 -0
package/dist/evaluators.js +56 -1
package/dist/index.js +428 -111
package/dist/lib/id.js +6 -0
package/dist/runOutput.js +46 -0
package/dist/runner.js +31 -9
package/dist/scenarios.js +211 -11
package/dist/scoring.js +2 -2
package/dist/storage.js +305 -31
package/dist/tools.js +284 -0
package/dist/trace.js +4 -2
package/dist/ui/App.js +67 -5
package/dist/ui/server.js +18 -0
package/dist/ui-assets/client.js +165 -3
package/docs/agents.md +287 -0
package/docs/golden-suites.md +74 -0
package/docs/integrations-and-live-services.md +58 -0
package/docs/memory-and-stateful-agents.md +51 -0
package/docs/release-checklist.md +94 -0
package/docs/runtime-profiles.md +67 -0
package/docs/scenarios.md +419 -0
package/docs/tools.md +102 -0
package/docs/troubleshooting.md +296 -0
package/docs/variant-sets.md +63 -0
package/package.json +4 -3

package/docs/memory-and-stateful-agents.md ADDED Viewed

@@ -0,0 +1,51 @@
+# Memory And Stateful Agents
+Memoryful agents are a distinct category in ARL.
+Use `type: conversation` scenarios when the agent owns:
+- conversation history
+- internal memory
+- internal tool execution
+- session or conversation identifiers
+## What ARL Owns
+For conversation scenarios, ARL owns:
+- the ordered user steps
+- the generated `conversation_id`
+- per-step and end-of-run evaluation
+- trace capture
+- run storage and comparison
+## What The Agent Owns
+For conversation scenarios, the agent owns:
+- how it stores conversation state
+- how it interprets `conversation_id`
+- what internal tools it calls
+- how it handles memory and recall across turns
+## How To Test Memoryful Agents
+Good memory-focused scenarios should cover:
+- follow-up recall within one conversation
+- refusal to leak identity or state across sessions
+- correct handling of repeated turns
+- graceful behavior when earlier turns are ambiguous or incomplete
+## Recommended Stateful Regression Cases
+- follow-up recall after two or more turns
+- cross-session contamination
+- stale memory overriding fresh input
+- memory surviving the right turns but not the wrong sessions
+## Design Rule
+Use task scenarios when the runner should stay authoritative for tools and turn control.
+Use conversation scenarios when the agent itself is being tested for memory, session behavior, or internal orchestration.

package/docs/release-checklist.md ADDED Viewed

@@ -0,0 +1,94 @@
+# Release Checklist
+Use this before publishing a new npm version or telling users to upgrade.
+## Verification
+Run the full release gate:
+```bash
+npm run check
+npm test
+npm run build
+npm run smoke:cli
+npm pack --dry-run
+```
+## Manual CLI Flow
+Verify the canonical workflow:
+```bash
+agentlab list scenarios
+agentlab run support.refund-correct-order --agent mock-default
+agentlab show <run-id>
+agentlab run support.refund-correct-order --agent mock-default
+agentlab compare <baseline-run-id> <candidate-run-id>
+agentlab run --suite support --agent mock-default
+agentlab run --suite support --agent mock-default
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+agentlab ui
+```
+## Extension Smoke
+Verify at least one extension path:
+- run `support.refund-via-config-tool` with `custom-node-agent`, or
+- verify a repo-local custom tool still loads from `agentlab.config.yaml`
+## HTTP Provider Smoke
+Verify the HTTP provider path for conversation scenarios:
+1. Start a minimal echo server (or any running HTTP agent service)
+2. Add a named `http` agent to `agentlab.config.yaml`:
+```yaml
+agents:
+  - name: my-agent
+    provider: http
+    url: http://localhost:3000/api/chat
+```
+3. Run a conversation scenario:
+```bash
+agentlab run internal-teams.memory-followup-recall --agent my-agent
+```
+4. Confirm the run produces a pass/fail result and the CLI output shows turn-by-turn step status
+If no live HTTP service is available, confirm the HTTP error paths work correctly:
+```bash
+agentlab run internal-teams.memory-followup-recall --agent my-agent
+# (with no service running)
+# Expected: status: error, terminationReason: http_connection_failed
+```
+## Docs Verification
+Confirm these files match current behavior:
+- `README.md`
+- `docs/scenarios.md`
+- `docs/tools.md`
+- `docs/agents.md`
+- `docs/troubleshooting.md`
+Requirements:
+- every command works as written
+- every referenced path exists
+- limitations are stated honestly
+- `compare --suite` is documented using suite batch ids, not run ids
+## Publish Hygiene
+Before `npm publish`:
+- confirm the package version is correct
+- confirm the git tree contains the intended release changes
+- confirm packaged UI assets are included in the tarball
+- confirm the npm metadata still points at the correct repo, homepage, and issues URL

package/docs/runtime-profiles.md ADDED Viewed

@@ -0,0 +1,67 @@
+# Runtime Profiles
+Runtime profiles are reusable test-environment overlays defined in `agentlab.config.yaml`.
+They let you keep degraded-tool conditions and state-related authoring metadata out of individual scenarios.
+## Why They Exist
+Use a runtime profile when multiple scenarios should run under the same bad condition or seeded state instead of repeating that setup inline.
+Typical uses:
+- force one tool to time out
+- return malformed or partial tool output
+- keep a named profile for memory-related scenario setup
+## Config Shape
+```yaml
+runtime_profiles:
+  - name: timeout-orders-tool
+    tool_faults:
+      - tool: orders.list
+        mode: timeout
+        timeout_ms: 1500
+  - name: malformed-docs-read
+    tool_faults:
+      - tool: docs.read
+        mode: malformed_output
+```
+Supported tool fault modes:
+- `timeout`
+- `error`
+- `malformed_output`
+- `partial_output`
+## Scenario Usage
+Reference the profile from the scenario:
+```yaml
+runtime_profile: timeout-orders-tool
+```
+Example command:
+```bash
+agentlab run internal-teams.tool-timeout-profile --agent mock-default
+```
+## Current Execution Scope
+Today, runtime-profile fault injection is active only for task scenarios where ARL owns the tool loop.
+That means:
+- task scenarios: tool faults are injected deterministically by the runner
+- conversation scenarios: the reference is allowed, but ARL does not intercept the HTTP agent's internal tools
+The `state` block is available in config for reusable authoring metadata, but automatic seeded-state execution is not yet applied by the runner.
+## Design Rule
+Use runtime profiles for reusable conditions, not one-off scenario-specific quirks.

package/docs/scenarios.md ADDED Viewed

@@ -0,0 +1,419 @@
+# Scenarios
+Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
+agentlab supports two scenario types:
+- `task` — a single-instruction job for a tool-using agent (default, no `type` field needed)
+- `conversation` — a multi-turn dialog with an HTTP agent
+---
+## Task Scenarios
+Task scenarios are the default format. They describe a single job for an agent that uses tools to complete it.
+### Required Shape
+Each task scenario should define:
+- `id`
+- `name`
+- `suite`
+- `task`
+- `tools`
+- `evaluators`
+Common optional fields:
+- `description`
+- `difficulty`
+- `tags`
+- `runtime_profile`
+- `runtime`
+- task `context`
+### Example
+```yaml
+id: support.refund-correct-order
+name: Refund The Correct Order
+suite: support
+difficulty: easy
+description: Refund only the duplicated charge.
+tags:
+  - refund
+  - support
+task:
+  instructions: |
+    The customer says they were charged twice.
+    Find the duplicated charge and refund only that order.
+  context:
+    customer_email: alice@example.com
+tools:
+  allowed:
+    - crm.search_customer
+    - orders.list
+    - orders.refund
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
+evaluators:
+  - id: refund-created
+    type: tool_call_assertion
+    mode: hard_gate
+    config:
+      tool: orders.refund
+      match:
+        order_id: ord_1024
+  - id: mentions-order
+    type: final_answer_contains
+    mode: weighted
+    weight: 1
+    config:
+      required_substrings:
+        - ord_1024
+```
+### Runtime Profiles
+Task scenarios can reference a named `runtime_profile` from `agentlab.config.yaml`.
+```yaml
+runtime_profile: timeout-orders-tool
+```
+Runtime profiles let you apply reusable degraded-tool conditions without duplicating them across scenarios. Current shipped behavior:
+- task scenarios: tool fault injection is active
+- conversation scenarios: config reference is allowed for shared authoring, but ARL does not yet inject faults into the HTTP agent's internal tools
+### Evaluators
+Use deterministic evaluators only.
+| Type | Description |
+|------|-------------|
+| `tool_call_assertion` | Assert a specific tool was called with specific input |
+| `forbidden_tool` | Fail if a tool was called that should not have been |
+| `final_answer_contains` | Check that the final output contains required substrings |
+| `exact_final_answer` | Require an exact match on the final output |
+| `step_count_max` | Fail if the agent used more steps than allowed |
+| `tool_call_count_max` | Fail if the total number of tool calls exceeds a budget |
+| `tool_repeat_max` | Fail if one tool is overused |
+| `cost_max` | Fail if the run cost exceeds a configured USD budget |
+Evaluator modes:
+- `hard_gate` — failure immediately fails the run, regardless of other evaluators
+- `weighted` — contributes to the weighted score (0–100)
+### Runtime Limits
+```yaml
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
+```
+Both are optional. `max_steps` defaults to 8. `timeout_seconds` is uncapped if not set.
+### Tools
+Each task scenario declares its allowed tools:
+```yaml
+tools:
+  allowed:
+    - crm.search_customer
+    - orders.list
+    - orders.refund
+  forbidden:
+    - orders.delete
+```
+Keep the allowlist as narrow as possible. Broad allowlists weaken the benchmark.
+### Budget And Governance Checks
+Operational regressions are often just as important as correctness regressions. Use budget evaluators to encode "technically worked, but unacceptable in production":
+```yaml
+evaluators:
+  - id: total-tool-budget
+    type: tool_call_count_max
+    mode: hard_gate
+    config:
+      max: 2
+  - id: no-repeat-order-list
+    type: tool_repeat_max
+    mode: hard_gate
+    config:
+      tool: orders.list
+      max: 1
+```
+Use `cost_max` only where the run records cost metadata.
+---
+## Conversation Scenarios
+Conversation scenarios test HTTP agents through multi-turn dialogs. They require `type: conversation` and work exclusively with `provider: http` agents. The agent is responsible for maintaining its own conversation history.
+### Required Shape
+```yaml
+type: conversation
+id: internal-teams.memory-followup-recall
+name: Follow-Up Recall Within Conversation
+suite: internal-teams
+steps:
+  - role: user
+    message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
+  - role: user
+    message: "What seat preference did I mention earlier?"
+```
+Each step must have:
+- `role: user`
+- `message` — the message sent to the agent this turn
+### Per-Step Evaluators
+Evaluators can be attached to individual steps. They run immediately after the agent replies to that step.
+```yaml
+steps:
+  - role: user
+    message: "Where's my order #ORD-001?"
+    evaluators:
+      - type: response_contains
+        mode: hard_gate
+        config:
+          keywords: [shipped, tracking]
+      - type: response_latency_max
+        mode: hard_gate
+        config:
+          ms: 3000
+  - role: user
+    message: "What's the tracking number?"
+    evaluators:
+      - type: response_not_contains
+        mode: weighted
+        weight: 1
+        config:
+          keywords: ["don't know", error]
+```
+If a `hard_gate` per-step evaluator fails, the run stops immediately and remaining steps are skipped.
+### Per-Step Evaluator Types
+| Type | Config | Behavior |
+|------|--------|----------|
+| `response_contains` | `keywords: string[]` | Passes if ALL keywords appear in the reply (case-insensitive) |
+| `response_not_contains` | `keywords: string[]` | Passes if NONE of the keywords appear in the reply (case-insensitive) |
+| `response_matches_regex` | `pattern: string` | Passes if the reply matches the regex pattern (case-insensitive) |
+| `response_latency_max` | `ms: number` | Passes if the HTTP response arrived within the time limit |
+### Scenario Quality Rules
+- prefer `hard_gate` for business-critical assertions
+- use `weighted` checks for quality gradients, not for the single condition that makes the scenario trustworthy
+- conversation scenarios must use `config.keywords` for `response_contains` and `response_not_contains`
+- stale `config.text` authoring is rejected
+- use conversation scenarios when the agent owns memory, tool execution, or conversation history internally
+- keep golden suites focused on repeatable workflows, historical regressions, and ugly edge cases rather than one-off demos
+### End-of-Run Evaluators
+End-of-run evaluators run after all steps complete. They apply to the final reply.
+```yaml
+evaluators:
+  - type: step_count_max
+    mode: hard_gate
+    config:
+      max: 10
+  - type: final_answer_contains
+    mode: weighted
+    weight: 1
+    config:
+      keywords: [resolved, confirmed]
+```
+End-of-run evaluator types:
+| Type | Config | Behavior |
+|------|--------|----------|
+| `step_count_max` | `max: number` | Passes if the number of completed turns is within the limit |
+| `final_answer_contains` | `keywords: string[]` | Passes if ALL keywords appear in the final reply |
+| `exact_final_answer` | `expected: string` | Passes if the final reply exactly matches the expected string |
+### Conversation State
+agentlab auto-generates a UUID `conversation_id` for each run. It is sent in every step request. The agent uses it to look up and maintain its own conversation history.
+The `state` block is optional:
+```yaml
+state:
+  conversation_id: auto
+```
+`auto` is the only supported value. The UUID is always generated regardless of whether the `state` block is present.
+### Restrictions
+Conversation scenarios must not define a `tools:` field. HTTP agents manage their own tools internally. If `tools:` is present, validation will fail with a clear error.
+Conversation scenarios may define `runtime_profile`, but today that is for shared scenario organization and future stateful hooks. ARL does not inject tool faults into HTTP agents.
+---
+## Suite Definitions
+Scenario `suite` still groups related files, but operational launch workflows should use config-level `suite_definitions`.
+Example:
+```yaml
+suite_definitions:
+  - name: pre_merge
+    include:
+      tags:
+        - smoke
+        - regression
+```
+Run one with:
+```bash
+agentlab run --suite-def pre_merge --agent mock-default
+agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
+```
+Use suite definitions for stable workflow units like:
+- `smoke`
+- `pre_merge`
+- `release`
+- `incident_regressions`
+### Full Example
+```yaml
+type: conversation
+id: internal-teams.memory-followup-recall
+name: Follow-Up Recall Within Conversation
+suite: internal-teams
+description: Memoryful agent should recall a user-provided fact later in the same conversation.
+difficulty: medium
+tags:
+  - internal-teams
+  - conversation
+steps:
+  - role: user
+    message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
+    evaluators:
+      - type: response_contains
+        mode: weighted
+        config:
+          keywords:
+            - aisle
+  - role: user
+    message: "What seat preference did I mention earlier?"
+    evaluators:
+      - type: response_contains
+        mode: hard_gate
+        config:
+          keywords:
+            - aisle
+evaluators:
+  - type: step_count_max
+    mode: hard_gate
+    config:
+      max: 2
+```
+Run it with:
+```bash
+agentlab run internal-teams.memory-followup-recall --agent my-production-agent
+```
+Where `my-production-agent` is a named `http` agent in `agentlab.config.yaml`. See [agents.md](agents.md) for HTTP agent config.
+### CLI Output
+Conversation runs print a different output format from task runs:
+```
+run internal-teams.memory-followup-recall — PASS
+  agent: my-production-agent (http://localhost:3000/api/chat)
+  turns completed: 2/2
+  step 1: pass (response_contains ✓)
+  step 2: pass (response_contains ✓)
+  run id: run_20260407_001234
+```
+If a hard-gate fails mid-run:
+```
+run internal-teams.memory-followup-recall — FAIL
+  agent: my-production-agent (http://localhost:3000/api/chat)
+  turns completed: 1/2
+  step 1: FAIL (response_contains ✗)
+  run stopped (evaluator_failed)
+  run id: run_20260407_001235
+```
+---
+## Suites
+Both task and conversation scenarios can belong to a suite.
+```yaml
+suite: support
+```
+Run an entire suite:
+```bash
+agentlab run --suite support --agent mock-default
+```
+`run --suite` skips conversation scenarios when using non-HTTP agents (conversation scenarios require `provider: http`). Task scenarios and conversation scenarios can coexist in the same suite directory.
+`run --suite` prints a suite batch id at the end. That id is used for suite comparison:
+```bash
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
+---
+## Authoring Conventions
+- `id` format: `<suite>.<short-name>`
+- keep scenario jobs narrow and concrete
+- keep fixture-backed context in `task.context` (task scenarios)
+- prefer deterministic fixture references over open-ended prompts
+- include `difficulty`, `description`, and `tags` for every launch scenario
+- for conversation scenarios, keep step count low (2–5) and evaluators specific
+## Current Examples
+Task scenario references in this repo:
+- support: `scenarios/support/refund-correct-order.yaml`
+- support with config tool: `scenarios/support/refund-via-config-tool.yaml`
+- coding: `scenarios/coding/fix-add-function.yaml`
+- research: `scenarios/research/remote-work-policy.yaml`
+- ops: `scenarios/ops/payments-api-alert.yaml`