npm - agent-regression-lab - Versions diffs - 0.1.0 - Mend

agent-regression-lab 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/README.md +205 -0
package/dist/agent/externalProcessAdapter.js +173 -0
package/dist/agent/factory.js +84 -0
package/dist/agent/mockAdapter.js +96 -0
package/dist/agent/openaiResponsesAdapter.js +155 -0
package/dist/config.js +123 -0
package/dist/evaluators.js +109 -0
package/dist/index.js +296 -0
package/dist/lib/fs.js +8 -0
package/dist/lib/id.js +16 -0
package/dist/runOutput.js +13 -0
package/dist/runner.js +199 -0
package/dist/scenarios.js +155 -0
package/dist/scoring.js +18 -0
package/dist/storage.js +394 -0
package/dist/tools.js +128 -0
package/dist/trace.js +30 -0
package/dist/types.js +1 -0
package/dist/ui/App.js +85 -0
package/dist/ui/client.js +10 -0
package/dist/ui/server.js +147 -0
package/package.json +53 -0

package/README.md ADDED Viewed

@@ -0,0 +1,205 @@
+# Agent Regression Lab
+Agent Regression Lab is a local-first evaluation harness for AI agents.
+It lets you define fixed scenarios in YAML, run an agent against them repeatedly, capture a structured trace, score the result, and compare runs over time.
+This is an alpha developer tool. It is useful now for local benchmarking and debugging, but it is not yet a polished platform.
+## What It Supports Today
+- YAML scenarios under `scenarios/`
+- Deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
+- Named agents from `agentlab.config.yaml`
+- Built-in `mock`, `openai`, and `external_process` agent modes
+- SQLite-backed local run history under `artifacts/agentlab.db`
+- CLI commands to list, run, show, compare, and launch the UI
+- Local web UI for run inspection and direct run-to-run comparison
+## Quickstart
+1. Install dependencies:
+```bash
+npm install
+```
+2. Run the typecheck, tests, and build:
+```bash
+npm run check
+npm test
+npm run build
+```
+3. Run a scenario:
+```bash
+npm run start -- run support.refund-correct-order --agent mock-default
+```
+4. Inspect a run:
+```bash
+npm run start -- show <run-id>
+```
+5. Launch the local UI:
+```bash
+npm run start -- ui
+```
+The UI starts on `http://127.0.0.1:4173`.
+## Installable CLI
+The package can be installed as a Node CLI.
+Local development install:
+```bash
+npm install
+npm run build
+npm link
+agentlab --help
+```
+Packed or published install:
+```bash
+npm install -g agent-regression-lab
+agentlab --help
+```
+The CLI operates on the current working directory. Run it from the root of a project that contains `scenarios/`, `fixtures/`, and optional `agentlab.config.yaml`.
+## CLI
+```text
+agentlab list scenarios
+agentlab run <scenario-id> [--agent <name>]
+agentlab run --suite <suite-id> [--agent <name>]
+agentlab show <run-id>
+agentlab compare <baseline-run-id> <candidate-run-id>
+agentlab ui
+```
+You can also run these through `npm run start -- ...` during local development.
+## Scenarios
+Scenarios are YAML files under `scenarios/`.
+Current scenario features:
+- task instructions
+- fixture references
+- allowed and forbidden tools
+- `max_steps`
+- `timeout_seconds`
+- evaluator configuration
+Example scenario shape:
+```yaml
+id: support.refund-correct-order
+name: Refund The Correct Order
+suite: support
+task:
+  instructions: |
+    The customer says they were charged twice.
+    Find the duplicated charge and refund only that order.
+tools:
+  allowed:
+    - crm.search_customer
+    - orders.list
+    - orders.refund
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
+evaluators:
+  - id: refund-created
+    type: tool_call_assertion
+    mode: hard_gate
+    config:
+      tool: orders.refund
+      match:
+        order_id: ord_1024
+```
+## Custom Agents And Tools
+`agentlab.config.yaml` is the extension point for named agents and repo-local tools.
+Supported agent providers:
+- `mock`
+- `openai`
+- `external_process`
+Supported custom tool model:
+- repo-local JS/TS module path
+- named export that resolves to an async function
+Example config:
+```yaml
+agents:
+  - name: custom-node-agent
+    provider: external_process
+    command: node
+    args:
+      - custom_agents/node_agent.mjs
+    label: custom-node-agent
+tools:
+  - name: support.find_duplicate_charge
+    modulePath: user_tools/findDuplicateCharge.ts
+    exportName: findDuplicateCharge
+    description: Find the duplicated charge order id for a given customer.
+    inputSchema:
+      type: object
+      additionalProperties: false
+      properties:
+        customer_id:
+          type: string
+      required:
+        - customer_id
+```
+## External Process Protocol
+External agents communicate with the runner over line-delimited JSON on stdin/stdout.
+Runner events:
+- `run_started`
+- `tool_result`
+- `runner_error`
+Agent responses:
+- `tool_call`
+- `final`
+- `error`
+The runner stays in control of the loop. External agents must not execute tools directly.
+Minimal flow:
+1. runner sends `run_started` with instructions, tool specs, context, and limits
+2. agent sends back a `tool_call` or `final`
+3. runner executes the tool and sends `tool_result`
+4. agent sends the next `tool_call` or `final`
+See `custom_agents/node_agent.mjs` and `custom_agents/python_agent.py` for working examples.
+## Honest Limitations
+- comparison is run-to-run, not full suite regression analysis yet
+- tool loading is limited to local repo module paths
+- external agents use the local stdin/stdout protocol only
+- the UI is intentionally minimal and optimized for debugging, not dashboards
+- the benchmark suite is still small

package/dist/agent/externalProcessAdapter.js ADDED Viewed

@@ -0,0 +1,173 @@
+import { spawn } from "node:child_process";
+import readline from "node:readline";
+class ExternalProcessAgentSession {
+    input;
+    options;
+    process;
+    stdoutLines = [];
+    stderrLines = [];
+    pendingResolver;
+    exited = false;
+    closed = false;
+    constructor(input, options) {
+        this.input = input;
+        this.options = options;
+        this.process = spawn(this.options.command, this.options.args, {
+            stdio: ["pipe", "pipe", "pipe"],
+            env: buildChildEnv(this.options.envAllowlist),
+        });
+        const stdoutReader = readline.createInterface({ input: this.process.stdout });
+        stdoutReader.on("line", (line) => this.handleStdoutLine(line));
+        this.process.stderr.on("data", (chunk) => {
+            this.stderrLines.push(String(chunk).trim());
+        });
+        this.process.on("exit", (code, signal) => {
+            this.exited = true;
+            if (this.pendingResolver) {
+                const detail = this.stderrLines.filter(Boolean).join(" | ");
+                this.pendingResolver.reject(new Error(`External agent exited before responding (code=${String(code)}, signal=${String(signal)}${detail ? `, stderr=${detail}` : ""}).`));
+                clearTimeout(this.pendingResolver.timer);
+                this.pendingResolver = undefined;
+            }
+        });
+    }
+    async next(event) {
+        if (this.exited || this.closed) {
+            return { type: "error", message: "External agent process is no longer running." };
+        }
+        try {
+            const response = await this.sendAndReceive(toProtocolEvent(event, this.input), this.options.responseTimeoutMs);
+            const parsed = parseProtocolResponse(response);
+            if (parsed.type === "final" || parsed.type === "error") {
+                this.close();
+            }
+            return parsed;
+        }
+        catch (error) {
+            this.close();
+            return { type: "error", message: error instanceof Error ? error.message : String(error) };
+        }
+    }
+    sendAndReceive(event, timeoutMs) {
+        this.process.stdin.write(`${JSON.stringify(event)}\n`);
+        if (this.stdoutLines.length > 0) {
+            return Promise.resolve(this.stdoutLines.shift());
+        }
+        return new Promise((resolve, reject) => {
+            const timer = setTimeout(() => {
+                this.pendingResolver = undefined;
+                reject(new Error(`External agent timed out after ${timeoutMs}ms waiting for a response.`));
+            }, timeoutMs);
+            this.pendingResolver = { resolve, reject, timer };
+        });
+    }
+    handleStdoutLine(line) {
+        const trimmed = line.trim();
+        if (!trimmed) {
+            return;
+        }
+        if (this.pendingResolver) {
+            const { resolve, timer } = this.pendingResolver;
+            clearTimeout(timer);
+            this.pendingResolver = undefined;
+            resolve(trimmed);
+            return;
+        }
+        this.stdoutLines.push(trimmed);
+    }
+    close() {
+        if (this.closed) {
+            return;
+        }
+        this.closed = true;
+        if (!this.exited) {
+            this.process.kill();
+        }
+    }
+}
+export class ExternalProcessAgentAdapter {
+    options;
+    constructor(options) {
+        this.options = options;
+    }
+    async startRun(input) {
+        if (!this.options.command) {
+            throw new Error("External process agent requires a command.");
+        }
+        return new ExternalProcessAgentSession(input, {
+            command: this.options.command,
+            args: this.options.args ?? [],
+            envAllowlist: this.options.envAllowlist ?? [],
+            responseTimeoutMs: this.options.responseTimeoutMs ?? 10_000,
+        });
+    }
+}
+function toProtocolEvent(event, input) {
+    if (event.type === "run_started") {
+        return { type: "run_started", input };
+    }
+    if (event.type === "tool_result") {
+        return event;
+    }
+    return event;
+}
+function parseProtocolResponse(raw) {
+    let parsed;
+    try {
+        parsed = JSON.parse(raw);
+    }
+    catch {
+        throw new Error(`External agent returned invalid JSON: ${raw}`);
+    }
+    if (typeof parsed !== "object" || parsed === null || Array.isArray(parsed) || typeof parsed.type !== "string") {
+        throw new Error("External agent returned an invalid protocol message.");
+    }
+    const type = parsed.type;
+    if (type === "tool_call") {
+        if (typeof parsed.toolName !== "string") {
+            throw new Error("External agent tool_call response is missing toolName.");
+        }
+        return {
+            type: "tool_call",
+            toolName: parsed.toolName,
+            input: parsed.input ?? {},
+            metadata: isObject(parsed.metadata) ? parsed.metadata : undefined,
+        };
+    }
+    if (type === "final") {
+        if (typeof parsed.output !== "string") {
+            throw new Error("External agent final response is missing output.");
+        }
+        return {
+            type: "final",
+            output: parsed.output,
+            metadata: isObject(parsed.metadata) ? parsed.metadata : undefined,
+        };
+    }
+    if (type === "error") {
+        if (typeof parsed.message !== "string") {
+            throw new Error("External agent error response is missing message.");
+        }
+        return {
+            type: "error",
+            message: parsed.message,
+            retryable: Boolean(parsed.retryable),
+        };
+    }
+    throw new Error(`External agent returned unsupported response type '${String(type)}'.`);
+}
+function buildChildEnv(allowlist) {
+    const env = {};
+    for (const key of allowlist) {
+        if (process.env[key] !== undefined) {
+            env[key] = process.env[key];
+        }
+    }
+    env.PATH = process.env.PATH;
+    env.PWD = process.cwd();
+    env.HOME = process.env.HOME;
+    return env;
+}
+function isObject(value) {
+    return typeof value === "object" && value !== null && !Array.isArray(value);
+}

package/dist/agent/factory.js ADDED Viewed

@@ -0,0 +1,84 @@
+import { ExternalProcessAgentAdapter } from "./externalProcessAdapter.js";
+import { MockAgentAdapter } from "./mockAdapter.js";
+import { OpenAIResponsesAgentAdapter } from "./openaiResponsesAdapter.js";
+import { createAgentVersionId } from "../lib/id.js";
+class MockAgentAdapterFactory {
+    createAdapter() {
+        return new MockAgentAdapter();
+    }
+    createVersion(config) {
+        const label = config.label ?? config.agentName ?? "mock-support-agent-v1";
+        const payload = { adapter: "mock", domain: "support", agentName: config.agentName };
+        return {
+            id: createAgentVersionId(label, payload),
+            label,
+            modelId: "mock-model",
+            provider: "mock",
+            config: payload,
+        };
+    }
+}
+class OpenAIAdapterFactory {
+    createAdapter() {
+        return new OpenAIResponsesAgentAdapter({
+            apiKey: process.env.OPENAI_API_KEY,
+        });
+    }
+    createVersion(config) {
+        const model = config.model ?? "gpt-4o-mini";
+        const label = config.label ?? config.agentName ?? `openai-${model}`;
+        const payload = { provider: "openai", model, agentName: config.agentName };
+        return {
+            id: createAgentVersionId(label, payload),
+            label,
+            modelId: model,
+            provider: "openai",
+            config: payload,
+        };
+    }
+}
+class ExternalProcessAdapterFactory {
+    createAdapter(config = {}) {
+        return new ExternalProcessAgentAdapter({
+            command: config.command ?? "",
+            args: config.args ?? [],
+            envAllowlist: config.envAllowlist ?? [],
+        });
+    }
+    createVersion(config) {
+        const label = config.label ?? config.agentName ?? "external-process-agent";
+        const payload = {
+            provider: "external_process",
+            command: config.command,
+            args: config.args ?? [],
+            agentName: config.agentName,
+        };
+        return {
+            id: createAgentVersionId(label, payload),
+            label,
+            provider: "external_process",
+            command: config.command,
+            args: config.args ?? [],
+            config: payload,
+        };
+    }
+}
+export function createAgentFactory(config) {
+    switch (config.provider) {
+        case "mock":
+            return new MockAgentAdapterFactory();
+        case "openai":
+            return new OpenAIAdapterFactory();
+        case "external_process":
+            return {
+                createAdapter: () => new ExternalProcessAgentAdapter({
+                    command: config.command ?? "",
+                    args: config.args ?? [],
+                    envAllowlist: config.envAllowlist ?? [],
+                }),
+                createVersion: (runtimeConfig) => new ExternalProcessAdapterFactory().createVersion(runtimeConfig),
+            };
+        default:
+            throw new Error(`Unsupported provider '${String(config.provider)}'.`);
+    }
+}

package/dist/agent/mockAdapter.js ADDED Viewed

@@ -0,0 +1,96 @@
+class MockAgentSession {
+    input;
+    state = { step: "start" };
+    constructor(input) {
+        this.input = input;
+    }
+    hasTool(toolName) {
+        return this.input.availableTools.some((tool) => tool.name === toolName);
+    }
+    async next(event) {
+        if (event.type === "runner_error") {
+            return { type: "error", message: event.message };
+        }
+        if (this.state.step === "start") {
+            const email = String(this.input.context.customer_email ?? "");
+            this.state = { step: "listed_customer" };
+            return {
+                type: "tool_call",
+                toolName: "crm.search_customer",
+                input: { email },
+                metadata: { message: "Looking up customer." },
+            };
+        }
+        if (this.state.step === "listed_customer") {
+            if (event.type !== "tool_result") {
+                return { type: "error", message: "Expected customer lookup result." };
+            }
+            const result = event.result;
+            if (this.hasTool("support.find_duplicate_charge")) {
+                this.state = { step: "found_duplicate" };
+                return {
+                    type: "tool_call",
+                    toolName: "support.find_duplicate_charge",
+                    input: { customer_id: String(result.id ?? "") },
+                    metadata: { message: "Looking up the duplicated order directly." },
+                };
+            }
+            this.state = { step: "listed_orders" };
+            return {
+                type: "tool_call",
+                toolName: "orders.list",
+                input: { customer_id: String(result.id ?? "") },
+                metadata: { message: "Listing customer orders." },
+            };
+        }
+        if (this.state.step === "listed_orders") {
+            if (event.type !== "tool_result" || !Array.isArray(event.result)) {
+                return { type: "error", message: "Expected order list result." };
+            }
+            const duplicate = event.result.find((order) => typeof order === "object" && order !== null && order.id === "ord_1024");
+            if (!duplicate?.id) {
+                return { type: "error", message: "Could not identify duplicate order." };
+            }
+            this.state = { step: "done" };
+            return {
+                type: "tool_call",
+                toolName: "orders.refund",
+                input: { order_id: duplicate.id },
+                metadata: { message: "Refunding the duplicated charge." },
+            };
+        }
+        if (this.state.step === "found_duplicate") {
+            if (event.type !== "tool_result" || typeof event.result !== "object" || event.result === null) {
+                return { type: "error", message: "Expected duplicate lookup result." };
+            }
+            const result = event.result;
+            if (!result.order_id) {
+                return { type: "error", message: "Duplicate lookup did not return an order id." };
+            }
+            this.state = { step: "done" };
+            return {
+                type: "tool_call",
+                toolName: "orders.refund",
+                input: { order_id: result.order_id },
+                metadata: { message: "Refunding the duplicated charge." },
+            };
+        }
+        if (this.state.step === "done") {
+            if (event.type !== "tool_result" || typeof event.result !== "object" || event.result === null) {
+                return { type: "error", message: "Expected refund result." };
+            }
+            const refund = event.result;
+            return {
+                type: "final",
+                output: `Refunded duplicated charge on order ${refund.order_id} for ${refund.amount} ${refund.currency}.`,
+                metadata: { completed: true },
+            };
+        }
+        return { type: "error", message: "Unexpected session state." };
+    }
+}
+export class MockAgentAdapter {
+    async startRun(input) {
+        return new MockAgentSession(input);
+    }
+}