npm - agent-regression-lab - Versions diffs - 0.1.1 → 0.3.0 - Mend

agent-regression-lab 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/README.md +186 -123
package/dist/agent/factory.js +20 -6
package/dist/agent/httpAdapter.js +79 -0
package/dist/agent/mockAdapter.js +210 -13
package/dist/config.js +223 -4
package/dist/conversationEvaluators.js +167 -0
package/dist/conversationRunner.js +199 -0
package/dist/evaluators.js +56 -1
package/dist/index.js +428 -111
package/dist/lib/id.js +6 -0
package/dist/runOutput.js +46 -0
package/dist/runner.js +31 -9
package/dist/scenarios.js +211 -11
package/dist/scoring.js +2 -2
package/dist/storage.js +305 -31
package/dist/tools.js +284 -0
package/dist/trace.js +4 -2
package/dist/ui/App.js +67 -5
package/dist/ui/server.js +18 -0
package/dist/ui-assets/client.js +165 -3
package/docs/agents.md +287 -0
package/docs/golden-suites.md +74 -0
package/docs/integrations-and-live-services.md +58 -0
package/docs/memory-and-stateful-agents.md +51 -0
package/docs/release-checklist.md +94 -0
package/docs/runtime-profiles.md +67 -0
package/docs/scenarios.md +419 -0
package/docs/tools.md +102 -0
package/docs/troubleshooting.md +296 -0
package/docs/variant-sets.md +63 -0
package/package.json +4 -3

package/README.md CHANGED Viewed

@@ -1,62 +1,139 @@
 # Agent Regression Lab
-Agent Regression Lab is a local-first evaluation harness for AI agents.
+Agent Regression Lab is the local-first regression spine for agent engineering teams.
-It lets you define fixed scenarios in YAML, run an agent against them repeatedly, capture a structured trace, score the result, and compare runs over time.
+It gives teams a repeatable way to define expected agent behavior in YAML, replay it against deterministic tool surfaces or live HTTP agents, store traces and scores locally, and compare candidate behavior against known baselines over time.
-This is an alpha developer tool. It is useful now for local benchmarking and debugging, but it is not yet a polished platform.
+This is a local-first alpha for early technical teams. It is strongest when used across one workflow spine:
+- debug a single scenario while building
+- validate a branch with a suite before merge
+- run curated golden suites before release
+- keep incident-derived scenarios as engineering memory
+## Who It Is For
+- teams shipping prompt, model, tool, workflow, and memory changes
+- engineers who need repeatable before/after evidence instead of vibes
+- teams validating live HTTP agents as well as deterministic local scenarios
+- researchers and technical operators who want local control before adopting heavier hosted infrastructure
+## Why Teams Use It
+- catch regressions before merge or release
+- debug subtle behavioral changes with full traces
+- compare model, prompt, tool, and workflow changes against a known baseline
+- build a portfolio of golden workflows, historical regressions, and ugly edge cases
+- preserve engineering memory so old failures do not quietly return
 ## What It Supports Today
 - YAML scenarios under `scenarios/`
-- Deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
-- Named agents from `agentlab.config.yaml`
-- Built-in `mock`, `openai`, and `external_process` agent modes
+- deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
+- named agents from `agentlab.config.yaml`
+- built-in `mock`, `openai`, `external_process`, and `http` agent modes
+- `type: conversation` multi-turn dialog scenarios for HTTP agents
 - SQLite-backed local run history under `artifacts/agentlab.db`
 - CLI commands to list, run, show, compare, and launch the UI
-- Local web UI for run inspection and direct run-to-run comparison
+- local web UI for run inspection, run comparison, and suite batch comparison
-## Quickstart
+## Workflow Spine
-1. Install dependencies:
+Use this as the default product story:
-```bash
-npm install
-```
+1. debug locally with one scenario
+2. validate a branch with a suite
+3. run curated golden suites before release
+4. keep incident-derived scenarios as permanent regression assets
+## First 10 Minutes
+The fastest path is to run the CLI from a local checkout.
-2. Run the typecheck, tests, and build:
+1. Install dependencies and build:
 ```bash
+npm install
 npm run check
 npm test
 npm run build
 ```
-3. Run a scenario:
+2. Verify the CLI:
 ```bash
-npm run start -- run support.refund-correct-order --agent mock-default
+agentlab --help
 ```
-4. Inspect a run:
+If you have not linked the package locally yet, use:
 ```bash
-npm run start -- show <run-id>
+npm link
+agentlab --help
 ```
-5. Launch the local UI:
+3. List scenarios:
 ```bash
-npm run start -- ui
+agentlab list scenarios
+```
+4. Run a deterministic sample scenario:
+```bash
+agentlab run support.refund-correct-order --agent mock-default
+```
+5. Inspect the run:
+```bash
+agentlab show <run-id>
+```
+6. Run the same scenario again, then compare the two runs:
+```bash
+agentlab compare <baseline-run-id> <candidate-run-id>
+```
+7. Launch the local UI:
+```bash
+agentlab ui
 ```
 The UI starts on `http://127.0.0.1:4173`.
-## Installable CLI
+8. Run a suite and compare two suite batches:
+```bash
+agentlab run --suite support --agent mock-default
+agentlab run --suite support --agent mock-default
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
+`run --suite` prints a `Suite batch:` id at the end. That is the id used by `compare --suite`.
+## Install
+### Installed CLI
+After the package is published:
-The package can be installed as a Node CLI.
+```bash
+npm install -g agent-regression-lab
+agentlab --help
+```
-Local development install:
+You can also use:
+```bash
+npx agent-regression-lab --help
+```
+### Local Development Install
+From this repo:
 ```bash
 npm install
@@ -65,141 +142,127 @@ npm link
 agentlab --help
 ```
-Packed or published install:
+### Repo-Local Dev Mode
+If you do not want to link the package yet:
 ```bash
-npm install -g agent-regression-lab
-agentlab --help
+npm run start -- --help
+npm run start -- run support.refund-correct-order --agent mock-default
 ```
-The CLI operates on the current working directory. Run it from the root of a project that contains `scenarios/`, `fixtures/`, and optional `agentlab.config.yaml`.
 ## CLI
+Supported command surface:
 ```text
 agentlab list scenarios
 agentlab run <scenario-id> [--agent <name>]
 agentlab run --suite <suite-id> [--agent <name>]
+agentlab run --suite-def <name> [--agent <name>]
+agentlab run <scenario-id> [--variant-set <name>]
 agentlab show <run-id>
 agentlab compare <baseline-run-id> <candidate-run-id>
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
 agentlab ui
+agentlab version
+agentlab help
 ```
-You can also run these through `npm run start -- ...` during local development.
-## Scenarios
-Scenarios are YAML files under `scenarios/`.
-Current scenario features:
-- task instructions
-- fixture references
-- allowed and forbidden tools
-- `max_steps`
-- `timeout_seconds`
-- evaluator configuration
-Example scenario shape:
-```yaml
-id: support.refund-correct-order
-name: Refund The Correct Order
-suite: support
-task:
-  instructions: |
-    The customer says they were charged twice.
-    Find the duplicated charge and refund only that order.
-tools:
-  allowed:
-    - crm.search_customer
-    - orders.list
-    - orders.refund
-runtime:
-  max_steps: 8
-  timeout_seconds: 60
-evaluators:
-  - id: refund-created
-    type: tool_call_assertion
-    mode: hard_gate
-    config:
-      tool: orders.refund
-      match:
-        order_id: ord_1024
+The CLI operates on the current working directory. Run it from the root of a project that contains `scenarios/`, `fixtures/`, and optional `agentlab.config.yaml`.
+## Canonical Workflow
+Use this as the default mental model:
+1. list scenarios
+2. run one scenario or one suite
+3. note the run id or suite batch id
+4. inspect the run in CLI or UI
+5. compare two runs or two suite batches
+6. extend the setup with a named agent or repo-local tool when needed
+## Canonical Live HTTP Fixture
+`arl-test/` is the canonical live HTTP regression fixture in this repo.
+Use it to verify the production-like HTTP path end to end:
+```bash
+cd arl-test
+npm start
+node ../dist/index.js list scenarios
+node ../dist/index.js run order-tracking-in-transit --agent support-agent
 ```
-## Custom Agents And Tools
+The `arl-test` scenarios are intended to behave like a real internal-team regression fixture, not just a toy demo.
+## Config And Extension Points
-`agentlab.config.yaml` is the extension point for named agents and repo-local tools.
+`agentlab.config.yaml` is the public extension point for:
+- named agents
+- repo-local custom tools
 Supported agent providers:
 - `mock`
 - `openai`
 - `external_process`
+- `http` — point at a running HTTP service for multi-turn conversation testing
-Supported custom tool model:
-- repo-local JS/TS module path
-- named export that resolves to an async function
-Example config:
-```yaml
-agents:
-  - name: custom-node-agent
-    provider: external_process
-    command: node
-    args:
-      - custom_agents/node_agent.mjs
-    label: custom-node-agent
-tools:
-  - name: support.find_duplicate_charge
-    modulePath: user_tools/findDuplicateCharge.ts
-    exportName: findDuplicateCharge
-    description: Find the duplicated charge order id for a given customer.
-    inputSchema:
-      type: object
-      additionalProperties: false
-      properties:
-        customer_id:
-          type: string
-      required:
-        - customer_id
-```
+Working sample assets already live in this repo:
+- external agents: `custom_agents/node_agent.mjs`, `custom_agents/python_agent.py`
+- custom tool: `user_tools/findDuplicateCharge.ts`
+- sample config: `agentlab.config.yaml`
+See:
+- [docs/scenarios.md](docs/scenarios.md)
+- [docs/tools.md](docs/tools.md)
+- [docs/agents.md](docs/agents.md)
+- [docs/troubleshooting.md](docs/troubleshooting.md)
+- [docs/release-checklist.md](docs/release-checklist.md)
+## Local Data And Artifacts
-## External Process Protocol
+By default the product writes local state under `artifacts/`.
-External agents communicate with the runner over line-delimited JSON on stdin/stdout.
+Important paths:
-Runner events:
+- SQLite DB: `artifacts/agentlab.db`
+- per-run trace output: `artifacts/<run-id>/trace.json`
+- local UI assets at runtime: served from packaged `dist/ui-assets` or built into `artifacts/ui/` in repo mode
-- `run_started`
-- `tool_result`
-- `runner_error`
+If you delete `artifacts/`, you remove stored run history and generated local outputs.
-Agent responses:
+## Determinism
-- `tool_call`
-- `final`
-- `error`
+The benchmark is designed to be deterministic enough for repeated local evaluation:
-The runner stays in control of the loop. External agents must not execute tools directly.
+- built-in tools read from local fixtures
+- scenarios declare fixed tool allowlists and evaluator rules
+- scoring is rule-based
+- suite comparison is based on stored local runs and suite batch ids
-Minimal flow:
+Agent behavior can still vary depending on the provider path. The built-in `mock` path is the most deterministic path for smoke tests and baseline examples.
-1. runner sends `run_started` with instructions, tool specs, context, and limits
-2. agent sends back a `tool_call` or `final`
-3. runner executes the tool and sends `tool_result`
-4. agent sends the next `tool_call` or `final`
+## Limitations
-See `custom_agents/node_agent.mjs` and `custom_agents/python_agent.py` for working examples.
+- this is a local-first alpha, not a hosted platform
+- custom tool loading is limited to repo-local module paths
+- external agents integrate through the local stdin/stdout protocol only
+- the UI is intentionally minimal and optimized for debugging
+- SQLite-backed local storage still makes sequential live verification the safest path when reusing the same local artifacts DB
+- the benchmark is broader than before, but still small compared to a mature benchmark product
-## Honest Limitations
+## Next Docs
-- comparison is run-to-run, not full suite regression analysis yet
-- tool loading is limited to local repo module paths
-- external agents use the local stdin/stdout protocol only
-- the UI is intentionally minimal and optimized for debugging, not dashboards
-- the benchmark suite is still small
+- scenario authoring: [docs/scenarios.md](docs/scenarios.md)
+- golden suites: [docs/golden-suites.md](docs/golden-suites.md)
+- integrations and live services: [docs/integrations-and-live-services.md](docs/integrations-and-live-services.md)
+- memory and stateful agents: [docs/memory-and-stateful-agents.md](docs/memory-and-stateful-agents.md)
+- custom tools: [docs/tools.md](docs/tools.md)
+- named agents and external-process protocol: [docs/agents.md](docs/agents.md)
+- common failure modes: [docs/troubleshooting.md](docs/troubleshooting.md)

package/dist/agent/factory.js CHANGED Viewed

@@ -2,6 +2,20 @@ import { ExternalProcessAgentAdapter } from "./externalProcessAdapter.js";
 import { MockAgentAdapter } from "./mockAdapter.js";
 import { OpenAIResponsesAgentAdapter } from "./openaiResponsesAdapter.js";
 import { createAgentVersionId } from "../lib/id.js";
+function attachIdentityMetadata(version, config) {
+    return {
+        ...version,
+        variantSetName: config.variantSetName,
+        variantLabel: config.variantLabel,
+        promptVersion: config.promptVersion,
+        modelVersion: config.modelVersion,
+        toolSchemaVersion: config.toolSchemaVersion,
+        configLabel: config.configLabel,
+        configHash: config.configHash,
+        runtimeProfileName: config.runtimeProfileName,
+        suiteDefinitionName: config.suiteDefinitionName,
+    };
+}
 class MockAgentAdapterFactory {
     createAdapter() {
         return new MockAgentAdapter();
@@ -9,13 +23,13 @@ class MockAgentAdapterFactory {
     createVersion(config) {
         const label = config.label ?? config.agentName ?? "mock-support-agent-v1";
         const payload = { adapter: "mock", domain: "support", agentName: config.agentName };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             modelId: "mock-model",
             provider: "mock",
             config: payload,
-        };
+        }, config);
     }
 }
 class OpenAIAdapterFactory {
@@ -28,13 +42,13 @@ class OpenAIAdapterFactory {
         const model = config.model ?? "gpt-4o-mini";
         const label = config.label ?? config.agentName ?? `openai-${model}`;
         const payload = { provider: "openai", model, agentName: config.agentName };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             modelId: model,
             provider: "openai",
             config: payload,
-        };
+        }, config);
     }
 }
 class ExternalProcessAdapterFactory {
@@ -53,14 +67,14 @@ class ExternalProcessAdapterFactory {
             args: config.args ?? [],
             agentName: config.agentName,
         };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             provider: "external_process",
             command: config.command,
             args: config.args ?? [],
             config: payload,
-        };
+        }, config);
     }
 }
 export function createAgentFactory(config) {

package/dist/agent/httpAdapter.js ADDED Viewed

@@ -0,0 +1,79 @@
+import { performance } from "node:perf_hooks";
+export function interpolateTemplate(template, message, conversationId) {
+    return template.replace(/\{\{([^}]+)\}\}/g, (_, key) => {
+        const k = key.trim();
+        if (k === "message")
+            return message;
+        if (k === "conversation_id")
+            return conversationId;
+        if (k.startsWith("env."))
+            return process.env[k.slice(4)] ?? "";
+        return "";
+    });
+}
+export function buildRequestBody(template, message, conversationId) {
+    if (!template) {
+        return { message, conversation_id: conversationId };
+    }
+    const result = {};
+    for (const [field, valueTemplate] of Object.entries(template)) {
+        result[field] = interpolateTemplate(valueTemplate, message, conversationId);
+    }
+    return result;
+}
+export function extractReply(body, responseField) {
+    const field = responseField ?? "message";
+    if (typeof body === "object" && body !== null && field in body) {
+        const value = body[field];
+        return typeof value === "string" ? value : null;
+    }
+    return null;
+}
+export async function callHttpAgent(input) {
+    const { url, message, conversationId, request_template, response_field, headers = {}, timeout_ms = 30000 } = input;
+    const body = buildRequestBody(request_template, message, conversationId);
+    const interpolatedHeaders = {};
+    for (const [key, value] of Object.entries(headers)) {
+        interpolatedHeaders[key] = interpolateTemplate(value, message, conversationId);
+    }
+    const controller = new AbortController();
+    const timeoutHandle = setTimeout(() => controller.abort(), timeout_ms);
+    const start = performance.now();
+    let response;
+    try {
+        response = await fetch(url, {
+            method: "POST",
+            headers: { "Content-Type": "application/json", ...interpolatedHeaders },
+            body: JSON.stringify(body),
+            signal: controller.signal,
+        });
+    }
+    catch (error) {
+        clearTimeout(timeoutHandle);
+        if (error instanceof Error && error.name === "AbortError") {
+            throw Object.assign(new Error(`Request to ${url} timed out after ${timeout_ms}ms`), { code: "timeout_exceeded" });
+        }
+        throw Object.assign(new Error(`Connection to ${url} failed: ${error instanceof Error ? error.message : String(error)}`), { code: "http_connection_failed" });
+    }
+    clearTimeout(timeoutHandle);
+    const latencyMs = Math.round(performance.now() - start);
+    if (!response.ok) {
+        throw Object.assign(new Error(`HTTP ${response.status} from ${url}`), {
+            code: "http_error",
+            httpStatus: response.status,
+        });
+    }
+    let parsed;
+    try {
+        parsed = await response.json();
+    }
+    catch {
+        throw Object.assign(new Error(`Response from ${url} is not valid JSON`), { code: "invalid_response_format" });
+    }
+    const reply = extractReply(parsed, response_field);
+    if (reply === null) {
+        const field = response_field ?? "message";
+        throw Object.assign(new Error(`Response from ${url} missing expected field '${field}'`), { code: "invalid_response_format" });
+    }
+    return { reply, latencyMs };
+}