npm - agent-regression-lab - Versions diffs - 0.2.0 → 0.4.0 - Mend

agent-regression-lab 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

package/README.md +78 -11
package/bin/agentlab.js +2 -0
package/dist/agent/factory.js +20 -6
package/dist/agent/httpAdapter.js +5 -4
package/dist/config.js +199 -12
package/dist/evaluators.js +56 -1
package/dist/index.js +157 -11
package/dist/init.js +88 -0
package/dist/lib/id.js +3 -0
package/dist/runOutput.js +46 -0
package/dist/runner.js +31 -9
package/dist/scenarios.js +90 -2
package/dist/scoring.js +2 -2
package/dist/storage.js +117 -7
package/dist/tools.js +56 -2
package/dist/trace.js +4 -2
package/dist/ui/App.js +75 -7
package/dist/ui-assets/client.css +92 -0
package/dist/ui-assets/client.js +183 -19
package/docs/agents.md +143 -8
package/docs/coding-agents.md +74 -0
package/docs/golden-suites.md +74 -0
package/docs/integrations-and-live-services.md +58 -0
package/docs/memory-and-stateful-agents.md +51 -0
package/docs/release-checklist.md +30 -0
package/docs/runtime-profiles.md +67 -0
package/docs/scenarios.md +303 -56
package/docs/superpowers/plans/2026-04-13-phase-2-lite-phase-3-plan.md +160 -0
package/docs/superpowers/plans/2026-04-13-phase-one-npm-tools-plan.md +502 -0
package/docs/superpowers/specs/2026-04-13-phase-2-lite-phase-3-design.md +164 -0
package/docs/tools.md +34 -3
package/docs/troubleshooting.md +193 -0
package/docs/variant-sets.md +63 -0
package/examples/coding-tools/README.md +21 -0
package/examples/coding-tools/index.js +11 -0
package/examples/coding-tools/package.json +8 -0
package/examples/support-tools/README.md +21 -0
package/examples/support-tools/index.js +8 -0
package/examples/support-tools/package.json +8 -0
package/package.json +7 -5

package/docs/scenarios.md CHANGED Viewed

@@ -2,28 +2,38 @@
 Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
-Each scenario should describe one narrow job for the agent, not a vague capability test.
+agentlab supports two scenario types:
-## Required Shape
+- `task` — a single-instruction job for a tool-using agent (default, no `type` field needed)
+- `conversation` — a multi-turn dialog with an HTTP agent
-Each scenario should define:
+---
+## Task Scenarios
+Task scenarios are the default format. They describe a single job for an agent that uses tools to complete it.
+### Required Shape
+Each task scenario should define:
 - `id`
 - `name`
 - `suite`
 - `task`
 - `tools`
-- `runtime`
 - `evaluators`
-Common optional fields already used in this repo:
+Common optional fields:
 - `description`
 - `difficulty`
 - `tags`
+- `runtime_profile`
+- `runtime`
 - task `context`
-## Example
+### Example
 ```yaml
 id: support.refund-correct-order
@@ -65,32 +75,52 @@ evaluators:
         - ord_1024
 ```
-## Suites In This Repo
+### Runtime Profiles
-Current benchmark domains:
+Task scenarios can reference a named `runtime_profile` from `agentlab.config.yaml`.
-- `support`
-- `coding`
-- `research`
-- `ops`
+```yaml
+runtime_profile: timeout-orders-tool
+```
-Use a suite when scenarios belong to one behavior family and should be runnable together with:
+Runtime profiles let you apply reusable degraded-tool conditions without duplicating them across scenarios. Current shipped behavior:
-```bash
-agentlab run --suite support --agent mock-default
-```
+- task scenarios: tool fault injection is active
+- conversation scenarios: config reference is allowed for shared authoring, but ARL does not yet inject faults into the HTTP agent's internal tools
-`run --suite` creates a suite batch id. That id is later used for:
+### Evaluators
-```bash
-agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+Use deterministic evaluators only.
+| Type | Description |
+|------|-------------|
+| `tool_call_assertion` | Assert a specific tool was called with specific input |
+| `forbidden_tool` | Fail if a tool was called that should not have been |
+| `final_answer_contains` | Check that the final output contains required substrings |
+| `exact_final_answer` | Require an exact match on the final output |
+| `step_count_max` | Fail if the agent used more steps than allowed |
+| `tool_call_count_max` | Fail if the total number of tool calls exceeds a budget |
+| `tool_repeat_max` | Fail if one tool is overused |
+| `cost_max` | Fail if the run cost exceeds a configured USD budget |
+Evaluator modes:
+- `hard_gate` — failure immediately fails the run, regardless of other evaluators
+- `weighted` — contributes to the weighted score (0–100)
+### Runtime Limits
+```yaml
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
 ```
-Suite comparison is strict. Only compare batches from the same suite.
+Both are optional. `max_steps` defaults to 8. `timeout_seconds` is uncapped if not set.
-## Tools
+### Tools
-Each scenario declares its allowed tools:
+Each task scenario declares its allowed tools:
 ```yaml
 tools:
@@ -98,72 +128,289 @@ tools:
     - crm.search_customer
     - orders.list
     - orders.refund
+  forbidden:
+    - orders.delete
 ```
-Keep the tool allowlist as narrow as possible. A broad allowlist weakens the benchmark and makes regressions harder to interpret.
+Keep the allowlist as narrow as possible. Broad allowlists weaken the benchmark.
-This repo supports both:
+### Budget And Governance Checks
-- built-in deterministic tools
-- repo-local custom tools registered in `agentlab.config.yaml`
+Operational regressions are often just as important as correctness regressions. Use budget evaluators to encode "technically worked, but unacceptable in production":
-The launch benchmark now includes built-in tools for:
+```yaml
+evaluators:
+  - id: total-tool-budget
+    type: tool_call_count_max
+    mode: hard_gate
+    config:
+      max: 2
+  - id: no-repeat-order-list
+    type: tool_repeat_max
+    mode: hard_gate
+    config:
+      tool: orders.list
+      max: 1
+```
-- support
-- coding
-- research
-- ops
+Use `cost_max` only where the run records cost metadata.
-See [tools.md](tools.md) for custom tool registration.
+---
+## Conversation Scenarios
+Conversation scenarios test HTTP agents through multi-turn dialogs. They require `type: conversation` and work exclusively with `provider: http` agents. The agent is responsible for maintaining its own conversation history.
+### Required Shape
+```yaml
+type: conversation
+id: internal-teams.memory-followup-recall
+name: Follow-Up Recall Within Conversation
+suite: internal-teams
+steps:
+  - role: user
+    message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
+  - role: user
+    message: "What seat preference did I mention earlier?"
+```
+Each step must have:
+- `role: user`
+- `message` — the message sent to the agent this turn
+### Per-Step Evaluators
+Evaluators can be attached to individual steps. They run immediately after the agent replies to that step.
+```yaml
+steps:
+  - role: user
+    message: "Where's my order #ORD-001?"
+    evaluators:
+      - type: response_contains
+        mode: hard_gate
+        config:
+          keywords: [shipped, tracking]
+      - type: response_latency_max
+        mode: hard_gate
+        config:
+          ms: 3000
+  - role: user
+    message: "What's the tracking number?"
+    evaluators:
+      - type: response_not_contains
+        mode: weighted
+        weight: 1
+        config:
+          keywords: ["don't know", error]
+```
+If a `hard_gate` per-step evaluator fails, the run stops immediately and remaining steps are skipped.
+### Per-Step Evaluator Types
+| Type | Config | Behavior |
+|------|--------|----------|
+| `response_contains` | `keywords: string[]` | Passes if ALL keywords appear in the reply (case-insensitive) |
+| `response_not_contains` | `keywords: string[]` | Passes if NONE of the keywords appear in the reply (case-insensitive) |
+| `response_matches_regex` | `pattern: string` | Passes if the reply matches the regex pattern (case-insensitive) |
+| `response_latency_max` | `ms: number` | Passes if the HTTP response arrived within the time limit |
+### Scenario Quality Rules
+- prefer `hard_gate` for business-critical assertions
+- use `weighted` checks for quality gradients, not for the single condition that makes the scenario trustworthy
+- conversation scenarios must use `config.keywords` for `response_contains` and `response_not_contains`
+- stale `config.text` authoring is rejected
+- use conversation scenarios when the agent owns memory, tool execution, or conversation history internally
+- keep golden suites focused on repeatable workflows, historical regressions, and ugly edge cases rather than one-off demos
+### End-of-Run Evaluators
+End-of-run evaluators run after all steps complete. They apply to the final reply.
+```yaml
+evaluators:
+  - type: step_count_max
+    mode: hard_gate
+    config:
+      max: 10
+  - type: final_answer_contains
+    mode: weighted
+    weight: 1
+    config:
+      keywords: [resolved, confirmed]
+```
-## Runtime Limits
+End-of-run evaluator types:
-Scenarios can enforce:
+| Type | Config | Behavior |
+|------|--------|----------|
+| `step_count_max` | `max: number` | Passes if the number of completed turns is within the limit |
+| `final_answer_contains` | `keywords: string[]` | Passes if ALL keywords appear in the final reply |
+| `exact_final_answer` | `expected: string` | Passes if the final reply exactly matches the expected string |
-- `max_steps`
-- `timeout_seconds`
+### Conversation State
+agentlab auto-generates a UUID `conversation_id` for each run. It is sent in every step request. The agent uses it to look up and maintain its own conversation history.
+The `state` block is optional:
+```yaml
+state:
+  conversation_id: auto
+```
+`auto` is the only supported value. The UUID is always generated regardless of whether the `state` block is present.
+### Restrictions
+Conversation scenarios must not define a `tools:` field. HTTP agents manage their own tools internally. If `tools:` is present, validation will fail with a clear error.
+Conversation scenarios may define `runtime_profile`, but today that is for shared scenario organization and future stateful hooks. ARL does not inject tool faults into HTTP agents.
+---
+## Suite Definitions
+Scenario `suite` still groups related files, but operational launch workflows should use config-level `suite_definitions`.
 Example:
 ```yaml
-runtime:
-  max_steps: 8
-  timeout_seconds: 60
+suite_definitions:
+  - name: pre_merge
+    include:
+      tags:
+        - smoke
+        - regression
 ```
-These limits are enforced by the runner. Use them to keep runs bounded and comparisons meaningful.
+Run one with:
-## Evaluators
+```bash
+agentlab run --suite-def pre_merge --agent mock-default
+agentlab run --suite-def pre_merge --variant-set refund-agent-model-comparison
+```
-Use deterministic evaluators only.
+Use suite definitions for stable workflow units like:
+- `smoke`
+- `pre_merge`
+- `release`
+- `incident_regressions`
-The current evaluator set includes:
+### Full Example
+```yaml
+type: conversation
+id: internal-teams.memory-followup-recall
+name: Follow-Up Recall Within Conversation
+suite: internal-teams
+description: Memoryful agent should recall a user-provided fact later in the same conversation.
+difficulty: medium
+tags:
+  - internal-teams
+  - conversation
+steps:
+  - role: user
+    message: "I'm traveling next Tuesday and I prefer aisle seats. Please remember that."
+    evaluators:
+      - type: response_contains
+        mode: weighted
+        config:
+          keywords:
+            - aisle
+  - role: user
+    message: "What seat preference did I mention earlier?"
+    evaluators:
+      - type: response_contains
+        mode: hard_gate
+        config:
+          keywords:
+            - aisle
-- `tool_call_assertion`
-- `forbidden_tool`
-- `final_answer_contains`
-- `exact_final_answer`
-- `step_count_max`
+evaluators:
+  - type: step_count_max
+    mode: hard_gate
+    config:
+      max: 2
+```
-Guidance:
+Run it with:
-- use hard gates for non-negotiable behavior
-- use weighted evaluators for softer quality checks
-- prefer tool assertions or exact output checks over vague answer checks when possible
+```bash
+agentlab run internal-teams.memory-followup-recall --agent my-production-agent
+```
-## Authoring Conventions
+Where `my-production-agent` is a named `http` agent in `agentlab.config.yaml`. See [agents.md](agents.md) for HTTP agent config.
-Use these defaults:
+### CLI Output
+Conversation runs print a different output format from task runs:
+```
+run internal-teams.memory-followup-recall — PASS
+  agent: my-production-agent (http://localhost:3000/api/chat)
+  turns completed: 2/2
+  step 1: pass (response_contains ✓)
+  step 2: pass (response_contains ✓)
+  run id: run_20260407_001234
+```
+If a hard-gate fails mid-run:
+```
+run internal-teams.memory-followup-recall — FAIL
+  agent: my-production-agent (http://localhost:3000/api/chat)
+  turns completed: 1/2
+  step 1: FAIL (response_contains ✗)
+  run stopped (evaluator_failed)
+  run id: run_20260407_001235
+```
+---
+## Suites
+Both task and conversation scenarios can belong to a suite.
+```yaml
+suite: support
+```
+Run an entire suite:
+```bash
+agentlab run --suite support --agent mock-default
+```
+`run --suite` skips conversation scenarios when using non-HTTP agents (conversation scenarios require `provider: http`). Task scenarios and conversation scenarios can coexist in the same suite directory.
+`run --suite` prints a suite batch id at the end. That id is used for suite comparison:
+```bash
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
+---
+## Authoring Conventions
 - `id` format: `<suite>.<short-name>`
 - keep scenario jobs narrow and concrete
-- keep fixture-backed context in `task.context`
+- keep fixture-backed context in `task.context` (task scenarios)
 - prefer deterministic fixture references over open-ended prompts
 - include `difficulty`, `description`, and `tags` for every launch scenario
+- for conversation scenarios, keep step count low (2–5) and evaluators specific
 ## Current Examples
-Useful scenario references in this repo:
+Task scenario references in this repo:
 - support: `scenarios/support/refund-correct-order.yaml`
 - support with config tool: `scenarios/support/refund-via-config-tool.yaml`

package/docs/superpowers/plans/2026-04-13-phase-2-lite-phase-3-plan.md ADDED Viewed

@@ -0,0 +1,160 @@
+# Phase 2 Lite And Phase 3 Implementation Plan
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+**Goal:** Deliver a minimal integration story for new users, then improve the UI enough that ARL is easier to demo, screenshot, and understand visually.
+**Architecture:** Keep Phase 2-lite focused on assets that clarify adoption: README routing, one coding-agent path, and one CI path. Keep Phase 3 focused on UI clarity instead of new product surface area by improving the runs dashboard, comparison screens, and trace presentation inside the existing React UI.
+**Tech Stack:** TypeScript, React, node:test, esbuild, Markdown, GitHub Actions YAML
+---
+## File Map
+**Roadmap and product docs**
+- Modify: `.claude/active-tasks.md`
+- Modify: `.claude/project.md`
+- Modify: `README.md`
+**Phase 2-lite assets**
+- Create: `docs/coding-agents.md`
+- Create: `.github/workflows/agentlab-pre-merge.yml`
+**UI**
+- Modify: `src/ui/App.tsx`
+- Modify: `src/ui/styles.css`
+**Tests**
+- Modify: `tests/launch/ui-smoke.test.ts`
+---
+### Task 1: Reframe Roadmap To Phase 2-lite Then Phase 3
+**Files:**
+- Modify: `.claude/active-tasks.md`
+- Modify: `.claude/project.md`
+- [ ] Update active task tracking so the current next phase is `Phase 2-lite`, not the original full Phase 2.
+- [ ] Update project memory so Phase 2-lite is the minimal integration-story pass and Phase 3 is the main visual/demo workstream.
+- [ ] Keep the scope explicit:
+  - HTTP example via `arl-test`
+  - CI example
+  - coding-agent example
+  - then UI polish
+Verification:
+```bash
+rg -n "Phase 2-lite|Phase 3" .claude/active-tasks.md .claude/project.md
+```
+---
+### Task 2: Add Phase 2-lite Integration Assets
+**Files:**
+- Modify: `README.md`
+- Create: `docs/coding-agents.md`
+- Create: `.github/workflows/agentlab-pre-merge.yml`
+- [ ] Add README routing sections:
+  - if your agent runs as an HTTP service
+  - if you are validating coding-agent changes
+  - if you want pre-merge regression checks in CI
+- [ ] Add one coding-agent guide using the existing coding scenarios and current tool-loop model.
+- [ ] Add one GitHub Actions example that runs:
+```bash
+npm ci
+npm run build
+node dist/index.js run --suite-def pre_merge --agent mock-default
+```
+- [ ] Keep this section narrow and copy-pasteable. No broad framework matrix.
+Verification:
+```bash
+rg -n "HTTP service|coding-agent|pre-merge|GitHub Actions" README.md docs/coding-agents.md .github/workflows/agentlab-pre-merge.yml
+```
+---
+### Task 3: Improve Runs Dashboard And Comparison UX
+**Files:**
+- Modify: `src/ui/App.tsx`
+- Modify: `src/ui/styles.css`
+- Modify: `tests/launch/ui-smoke.test.ts`
+- [ ] Add a stronger runs dashboard summary at the top of the list page:
+  - total runs shown
+  - pass/fail/error counts
+  - most recent suite/context hint
+- [ ] Redesign the compare page to make regressions visually obvious:
+  - top classification banner
+  - clearer delta cards
+  - evaluator/tool diff blocks with stronger hierarchy
+  - more obvious baseline vs candidate sections
+- [ ] Make the suite compare page easier to scan:
+  - headline regression/improvement counts
+  - clearer scenario groupings
+Verification:
+```bash
+npx tsx --test tests/launch/ui-smoke.test.ts
+```
+---
+### Task 4: Improve Trace And Detail Presentation
+**Files:**
+- Modify: `src/ui/App.tsx`
+- Modify: `src/ui/styles.css`
+- Modify: `tests/launch/ui-smoke.test.ts`
+- [ ] Replace the plain trace list with a more intentional timeline treatment:
+  - event badges or type labels
+  - stronger step grouping
+  - clearer source metadata
+- [ ] Keep failure-first behavior intact.
+- [ ] Preserve readability on narrow screens.
+Verification:
+```bash
+npx tsx --test tests/launch/ui-smoke.test.ts
+```
+---
+### Task 5: Full Verification
+**Files:**
+- Modify only if verification exposes issues
+- [ ] Run focused UI/docs-related verification:
+```bash
+npx tsx --test tests/launch/ui-smoke.test.ts tests/cliPackaging.test.ts
+```
+- [ ] Run full suite:
+```bash
+npm test
+```
+- [ ] Run release gates:
+```bash
+npm run check
+npm run build
+npm run smoke:cli
+npm_config_cache=/tmp/agentlab-npm-cache npm pack --dry-run
+```