npm - agent-regression-lab - Versions diffs - 0.1.1 → 0.2.0 - Mend

agent-regression-lab 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.md +140 -123
package/dist/agent/httpAdapter.js +78 -0
package/dist/agent/mockAdapter.js +210 -13
package/dist/config.js +37 -1
package/dist/conversationEvaluators.js +167 -0
package/dist/conversationRunner.js +199 -0
package/dist/index.js +287 -102
package/dist/lib/id.js +3 -0
package/dist/scenarios.js +121 -9
package/dist/storage.js +193 -29
package/dist/tools.js +246 -0
package/dist/ui/App.js +39 -3
package/dist/ui/server.js +18 -0
package/dist/ui-assets/client.js +83 -3
package/docs/agents.md +152 -0
package/docs/release-checklist.md +64 -0
package/docs/scenarios.md +172 -0
package/docs/tools.md +102 -0
package/docs/troubleshooting.md +158 -0
package/package.json +3 -2

package/README.md CHANGED Viewed

@@ -2,61 +2,114 @@
 Agent Regression Lab is a local-first evaluation harness for AI agents.
-It lets you define fixed scenarios in YAML, run an agent against them repeatedly, capture a structured trace, score the result, and compare runs over time.
+It gives you a repeatable way to define scenarios in YAML, run agents against deterministic tool surfaces, store traces and scores locally, and compare runs or suite batches over time.
-This is an alpha developer tool. It is useful now for local benchmarking and debugging, but it is not yet a polished platform.
+This is an alpha developer tool. It is ready for early technical users, but it is not a polished platform.
+## Who It Is For
+- engineers building or debugging agent workflows
+- researchers who want repeatable local evals
+- teams that want a simple local regression harness before investing in heavier infrastructure
 ## What It Supports Today
 - YAML scenarios under `scenarios/`
-- Deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
-- Named agents from `agentlab.config.yaml`
-- Built-in `mock`, `openai`, and `external_process` agent modes
+- deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
+- named agents from `agentlab.config.yaml`
+- built-in `mock`, `openai`, and `external_process` agent modes
 - SQLite-backed local run history under `artifacts/agentlab.db`
 - CLI commands to list, run, show, compare, and launch the UI
-- Local web UI for run inspection and direct run-to-run comparison
+- local web UI for run inspection, run comparison, and suite batch comparison
-## Quickstart
+## First 10 Minutes
-1. Install dependencies:
+The fastest path is to run the CLI from a local checkout.
+1. Install dependencies and build:
 ```bash
 npm install
+npm run check
+npm test
+npm run build
 ```
-2. Run the typecheck, tests, and build:
+2. Verify the CLI:
 ```bash
-npm run check
-npm test
-npm run build
+agentlab --help
 ```
-3. Run a scenario:
+If you have not linked the package locally yet, use:
 ```bash
-npm run start -- run support.refund-correct-order --agent mock-default
+npm link
+agentlab --help
+```
+3. List scenarios:
+```bash
+agentlab list scenarios
+```
+4. Run a deterministic sample scenario:
+```bash
+agentlab run support.refund-correct-order --agent mock-default
+```
+5. Inspect the run:
+```bash
+agentlab show <run-id>
 ```
-4. Inspect a run:
+6. Run the same scenario again, then compare the two runs:
 ```bash
-npm run start -- show <run-id>
+agentlab compare <baseline-run-id> <candidate-run-id>
 ```
-5. Launch the local UI:
+7. Launch the local UI:
 ```bash
-npm run start -- ui
+agentlab ui
 ```
 The UI starts on `http://127.0.0.1:4173`.
-## Installable CLI
+8. Run a suite and compare two suite batches:
+```bash
+agentlab run --suite support --agent mock-default
+agentlab run --suite support --agent mock-default
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
-The package can be installed as a Node CLI.
+`run --suite` prints a `Suite batch:` id at the end. That is the id used by `compare --suite`.
-Local development install:
+## Install
+### Installed CLI
+After the package is published:
+```bash
+npm install -g agent-regression-lab
+agentlab --help
+```
+You can also use:
+```bash
+npx agent-regression-lab --help
+```
+### Local Development Install
+From this repo:
 ```bash
 npm install
@@ -65,72 +118,50 @@ npm link
 agentlab --help
 ```
-Packed or published install:
+### Repo-Local Dev Mode
+If you do not want to link the package yet:
 ```bash
-npm install -g agent-regression-lab
-agentlab --help
+npm run start -- --help
+npm run start -- run support.refund-correct-order --agent mock-default
 ```
-The CLI operates on the current working directory. Run it from the root of a project that contains `scenarios/`, `fixtures/`, and optional `agentlab.config.yaml`.
 ## CLI
+Supported command surface:
 ```text
 agentlab list scenarios
 agentlab run <scenario-id> [--agent <name>]
 agentlab run --suite <suite-id> [--agent <name>]
 agentlab show <run-id>
 agentlab compare <baseline-run-id> <candidate-run-id>
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
 agentlab ui
+agentlab version
+agentlab help
 ```
-You can also run these through `npm run start -- ...` during local development.
-## Scenarios
-Scenarios are YAML files under `scenarios/`.
-Current scenario features:
-- task instructions
-- fixture references
-- allowed and forbidden tools
-- `max_steps`
-- `timeout_seconds`
-- evaluator configuration
-Example scenario shape:
-```yaml
-id: support.refund-correct-order
-name: Refund The Correct Order
-suite: support
-task:
-  instructions: |
-    The customer says they were charged twice.
-    Find the duplicated charge and refund only that order.
-tools:
-  allowed:
-    - crm.search_customer
-    - orders.list
-    - orders.refund
-runtime:
-  max_steps: 8
-  timeout_seconds: 60
-evaluators:
-  - id: refund-created
-    type: tool_call_assertion
-    mode: hard_gate
-    config:
-      tool: orders.refund
-      match:
-        order_id: ord_1024
-```
+The CLI operates on the current working directory. Run it from the root of a project that contains `scenarios/`, `fixtures/`, and optional `agentlab.config.yaml`.
+## Canonical Workflow
+Use this as the default mental model:
-## Custom Agents And Tools
+1. list scenarios
+2. run one scenario or one suite
+3. note the run id or suite batch id
+4. inspect the run in CLI or UI
+5. compare two runs or two suite batches
+6. extend the setup with a named agent or repo-local tool when needed
-`agentlab.config.yaml` is the extension point for named agents and repo-local tools.
+## Config And Extension Points
+`agentlab.config.yaml` is the public extension point for:
+- named agents
+- repo-local custom tools
 Supported agent providers:
@@ -138,68 +169,54 @@ Supported agent providers:
 - `openai`
 - `external_process`
-Supported custom tool model:
-- repo-local JS/TS module path
-- named export that resolves to an async function
-Example config:
-```yaml
-agents:
-  - name: custom-node-agent
-    provider: external_process
-    command: node
-    args:
-      - custom_agents/node_agent.mjs
-    label: custom-node-agent
-tools:
-  - name: support.find_duplicate_charge
-    modulePath: user_tools/findDuplicateCharge.ts
-    exportName: findDuplicateCharge
-    description: Find the duplicated charge order id for a given customer.
-    inputSchema:
-      type: object
-      additionalProperties: false
-      properties:
-        customer_id:
-          type: string
-      required:
-        - customer_id
-```
+Working sample assets already live in this repo:
+- external agents: `custom_agents/node_agent.mjs`, `custom_agents/python_agent.py`
+- custom tool: `user_tools/findDuplicateCharge.ts`
+- sample config: `agentlab.config.yaml`
+See:
+- [docs/scenarios.md](docs/scenarios.md)
+- [docs/tools.md](docs/tools.md)
+- [docs/agents.md](docs/agents.md)
+- [docs/troubleshooting.md](docs/troubleshooting.md)
+- [docs/release-checklist.md](docs/release-checklist.md)
+## Local Data And Artifacts
-## External Process Protocol
+By default the product writes local state under `artifacts/`.
-External agents communicate with the runner over line-delimited JSON on stdin/stdout.
+Important paths:
-Runner events:
+- SQLite DB: `artifacts/agentlab.db`
+- per-run trace output: `artifacts/<run-id>/trace.json`
+- local UI assets at runtime: served from packaged `dist/ui-assets` or built into `artifacts/ui/` in repo mode
-- `run_started`
-- `tool_result`
-- `runner_error`
+If you delete `artifacts/`, you remove stored run history and generated local outputs.
-Agent responses:
+## Determinism
-- `tool_call`
-- `final`
-- `error`
+The benchmark is designed to be deterministic enough for repeated local evaluation:
-The runner stays in control of the loop. External agents must not execute tools directly.
+- built-in tools read from local fixtures
+- scenarios declare fixed tool allowlists and evaluator rules
+- scoring is rule-based
+- suite comparison is based on stored local runs and suite batch ids
-Minimal flow:
+Agent behavior can still vary depending on the provider path. The built-in `mock` path is the most deterministic path for smoke tests and baseline examples.
-1. runner sends `run_started` with instructions, tool specs, context, and limits
-2. agent sends back a `tool_call` or `final`
-3. runner executes the tool and sends `tool_result`
-4. agent sends the next `tool_call` or `final`
+## Limitations
-See `custom_agents/node_agent.mjs` and `custom_agents/python_agent.py` for working examples.
+- this is a local-first alpha, not a hosted platform
+- custom tool loading is limited to repo-local module paths
+- external agents integrate through the local stdin/stdout protocol only
+- the UI is intentionally minimal and optimized for debugging
+- the benchmark is broader than before, but still small compared to a mature benchmark product
-## Honest Limitations
+## Next Docs
-- comparison is run-to-run, not full suite regression analysis yet
-- tool loading is limited to local repo module paths
-- external agents use the local stdin/stdout protocol only
-- the UI is intentionally minimal and optimized for debugging, not dashboards
-- the benchmark suite is still small
+- scenario authoring: [docs/scenarios.md](docs/scenarios.md)
+- custom tools: [docs/tools.md](docs/tools.md)
+- named agents and external-process protocol: [docs/agents.md](docs/agents.md)
+- common failure modes: [docs/troubleshooting.md](docs/troubleshooting.md)

package/dist/agent/httpAdapter.js ADDED Viewed

@@ -0,0 +1,78 @@
+import { performance } from "node:perf_hooks";
+export function interpolateTemplate(template, message, conversationId) {
+    return template.replace(/\{\{([^}]+)\}\}/g, (_, key) => {
+        if (key === "message")
+            return message;
+        if (key === "conversation_id")
+            return conversationId;
+        if (key.startsWith("env."))
+            return process.env[key.slice(4)] ?? "";
+        return "";
+    });
+}
+export function buildRequestBody(template, message, conversationId) {
+    if (!template) {
+        return { message, conversation_id: conversationId };
+    }
+    const result = {};
+    for (const [field, valueTemplate] of Object.entries(template)) {
+        result[field] = interpolateTemplate(valueTemplate, message, conversationId);
+    }
+    return result;
+}
+export function extractReply(body, responseField) {
+    const field = responseField ?? "message";
+    if (typeof body === "object" && body !== null && field in body) {
+        const value = body[field];
+        return typeof value === "string" ? value : null;
+    }
+    return null;
+}
+export async function callHttpAgent(input) {
+    const { url, message, conversationId, request_template, response_field, headers = {}, timeout_ms = 30000 } = input;
+    const body = buildRequestBody(request_template, message, conversationId);
+    const interpolatedHeaders = {};
+    for (const [key, value] of Object.entries(headers)) {
+        interpolatedHeaders[key] = interpolateTemplate(value, message, conversationId);
+    }
+    const controller = new AbortController();
+    const timeoutHandle = setTimeout(() => controller.abort(), timeout_ms);
+    const start = performance.now();
+    let response;
+    try {
+        response = await fetch(url, {
+            method: "POST",
+            headers: { "Content-Type": "application/json", ...interpolatedHeaders },
+            body: JSON.stringify(body),
+            signal: controller.signal,
+        });
+    }
+    catch (error) {
+        clearTimeout(timeoutHandle);
+        if (error instanceof Error && error.name === "AbortError") {
+            throw Object.assign(new Error(`Request to ${url} timed out after ${timeout_ms}ms`), { code: "timeout_exceeded" });
+        }
+        throw Object.assign(new Error(`Connection to ${url} failed: ${error instanceof Error ? error.message : String(error)}`), { code: "http_connection_failed" });
+    }
+    clearTimeout(timeoutHandle);
+    const latencyMs = Math.round(performance.now() - start);
+    if (!response.ok) {
+        throw Object.assign(new Error(`HTTP ${response.status} from ${url}`), {
+            code: "http_error",
+            httpStatus: response.status,
+        });
+    }
+    let parsed;
+    try {
+        parsed = await response.json();
+    }
+    catch {
+        throw Object.assign(new Error(`Response from ${url} is not valid JSON`), { code: "invalid_response_format" });
+    }
+    const reply = extractReply(parsed, response_field);
+    if (reply === null) {
+        const field = response_field ?? "message";
+        throw Object.assign(new Error(`Response from ${url} missing expected field '${field}'`), { code: "invalid_response_format" });
+    }
+    return { reply, latencyMs };
+}