npm - agent-regression-lab - Versions diffs - 0.2.0 → 0.3.0 - Mend

agent-regression-lab 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/README.md +53 -7
package/dist/agent/factory.js +20 -6
package/dist/agent/httpAdapter.js +5 -4
package/dist/config.js +186 -3
package/dist/evaluators.js +56 -1
package/dist/index.js +143 -11
package/dist/lib/id.js +3 -0
package/dist/runOutput.js +46 -0
package/dist/runner.js +31 -9
package/dist/scenarios.js +90 -2
package/dist/scoring.js +2 -2
package/dist/storage.js +117 -7
package/dist/tools.js +38 -0
package/dist/trace.js +4 -2
package/dist/ui/App.js +28 -2
package/dist/ui-assets/client.js +82 -0
package/docs/agents.md +143 -8
package/docs/golden-suites.md +74 -0
package/docs/integrations-and-live-services.md +58 -0
package/docs/memory-and-stateful-agents.md +51 -0
package/docs/release-checklist.md +30 -0
package/docs/runtime-profiles.md +67 -0
package/docs/scenarios.md +303 -56
package/docs/troubleshooting.md +138 -0
package/docs/variant-sets.md +63 -0
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -1,27 +1,51 @@
 # Agent Regression Lab
-Agent Regression Lab is a local-first evaluation harness for AI agents.
+Agent Regression Lab is the local-first regression spine for agent engineering teams.
-It gives you a repeatable way to define scenarios in YAML, run agents against deterministic tool surfaces, store traces and scores locally, and compare runs or suite batches over time.
+It gives teams a repeatable way to define expected agent behavior in YAML, replay it against deterministic tool surfaces or live HTTP agents, store traces and scores locally, and compare candidate behavior against known baselines over time.
-This is an alpha developer tool. It is ready for early technical users, but it is not a polished platform.
+This is a local-first alpha for early technical teams. It is strongest when used across one workflow spine:
+- debug a single scenario while building
+- validate a branch with a suite before merge
+- run curated golden suites before release
+- keep incident-derived scenarios as engineering memory
 ## Who It Is For
-- engineers building or debugging agent workflows
-- researchers who want repeatable local evals
-- teams that want a simple local regression harness before investing in heavier infrastructure
+- teams shipping prompt, model, tool, workflow, and memory changes
+- engineers who need repeatable before/after evidence instead of vibes
+- teams validating live HTTP agents as well as deterministic local scenarios
+- researchers and technical operators who want local control before adopting heavier hosted infrastructure
+## Why Teams Use It
+- catch regressions before merge or release
+- debug subtle behavioral changes with full traces
+- compare model, prompt, tool, and workflow changes against a known baseline
+- build a portfolio of golden workflows, historical regressions, and ugly edge cases
+- preserve engineering memory so old failures do not quietly return
 ## What It Supports Today
 - YAML scenarios under `scenarios/`
 - deterministic built-in tools plus repo-local custom tools from `agentlab.config.yaml`
 - named agents from `agentlab.config.yaml`
-- built-in `mock`, `openai`, and `external_process` agent modes
+- built-in `mock`, `openai`, `external_process`, and `http` agent modes
+- `type: conversation` multi-turn dialog scenarios for HTTP agents
 - SQLite-backed local run history under `artifacts/agentlab.db`
 - CLI commands to list, run, show, compare, and launch the UI
 - local web UI for run inspection, run comparison, and suite batch comparison
+## Workflow Spine
+Use this as the default product story:
+1. debug locally with one scenario
+2. validate a branch with a suite
+3. run curated golden suites before release
+4. keep incident-derived scenarios as permanent regression assets
 ## First 10 Minutes
 The fastest path is to run the CLI from a local checkout.
@@ -135,6 +159,8 @@ Supported command surface:
 agentlab list scenarios
 agentlab run <scenario-id> [--agent <name>]
 agentlab run --suite <suite-id> [--agent <name>]
+agentlab run --suite-def <name> [--agent <name>]
+agentlab run <scenario-id> [--variant-set <name>]
 agentlab show <run-id>
 agentlab compare <baseline-run-id> <candidate-run-id>
 agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
@@ -156,6 +182,21 @@ Use this as the default mental model:
 5. compare two runs or two suite batches
 6. extend the setup with a named agent or repo-local tool when needed
+## Canonical Live HTTP Fixture
+`arl-test/` is the canonical live HTTP regression fixture in this repo.
+Use it to verify the production-like HTTP path end to end:
+```bash
+cd arl-test
+npm start
+node ../dist/index.js list scenarios
+node ../dist/index.js run order-tracking-in-transit --agent support-agent
+```
+The `arl-test` scenarios are intended to behave like a real internal-team regression fixture, not just a toy demo.
 ## Config And Extension Points
 `agentlab.config.yaml` is the public extension point for:
@@ -168,6 +209,7 @@ Supported agent providers:
 - `mock`
 - `openai`
 - `external_process`
+- `http` — point at a running HTTP service for multi-turn conversation testing
 Working sample assets already live in this repo:
@@ -212,11 +254,15 @@ Agent behavior can still vary depending on the provider path. The built-in `mock
 - custom tool loading is limited to repo-local module paths
 - external agents integrate through the local stdin/stdout protocol only
 - the UI is intentionally minimal and optimized for debugging
+- SQLite-backed local storage still makes sequential live verification the safest path when reusing the same local artifacts DB
 - the benchmark is broader than before, but still small compared to a mature benchmark product
 ## Next Docs
 - scenario authoring: [docs/scenarios.md](docs/scenarios.md)
+- golden suites: [docs/golden-suites.md](docs/golden-suites.md)
+- integrations and live services: [docs/integrations-and-live-services.md](docs/integrations-and-live-services.md)
+- memory and stateful agents: [docs/memory-and-stateful-agents.md](docs/memory-and-stateful-agents.md)
 - custom tools: [docs/tools.md](docs/tools.md)
 - named agents and external-process protocol: [docs/agents.md](docs/agents.md)
 - common failure modes: [docs/troubleshooting.md](docs/troubleshooting.md)

package/dist/agent/factory.js CHANGED Viewed

@@ -2,6 +2,20 @@ import { ExternalProcessAgentAdapter } from "./externalProcessAdapter.js";
 import { MockAgentAdapter } from "./mockAdapter.js";
 import { OpenAIResponsesAgentAdapter } from "./openaiResponsesAdapter.js";
 import { createAgentVersionId } from "../lib/id.js";
+function attachIdentityMetadata(version, config) {
+    return {
+        ...version,
+        variantSetName: config.variantSetName,
+        variantLabel: config.variantLabel,
+        promptVersion: config.promptVersion,
+        modelVersion: config.modelVersion,
+        toolSchemaVersion: config.toolSchemaVersion,
+        configLabel: config.configLabel,
+        configHash: config.configHash,
+        runtimeProfileName: config.runtimeProfileName,
+        suiteDefinitionName: config.suiteDefinitionName,
+    };
+}
 class MockAgentAdapterFactory {
     createAdapter() {
         return new MockAgentAdapter();
@@ -9,13 +23,13 @@ class MockAgentAdapterFactory {
     createVersion(config) {
         const label = config.label ?? config.agentName ?? "mock-support-agent-v1";
         const payload = { adapter: "mock", domain: "support", agentName: config.agentName };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             modelId: "mock-model",
             provider: "mock",
             config: payload,
-        };
+        }, config);
     }
 }
 class OpenAIAdapterFactory {
@@ -28,13 +42,13 @@ class OpenAIAdapterFactory {
         const model = config.model ?? "gpt-4o-mini";
         const label = config.label ?? config.agentName ?? `openai-${model}`;
         const payload = { provider: "openai", model, agentName: config.agentName };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             modelId: model,
             provider: "openai",
             config: payload,
-        };
+        }, config);
     }
 }
 class ExternalProcessAdapterFactory {
@@ -53,14 +67,14 @@ class ExternalProcessAdapterFactory {
             args: config.args ?? [],
             agentName: config.agentName,
         };
-        return {
+        return attachIdentityMetadata({
             id: createAgentVersionId(label, payload),
             label,
             provider: "external_process",
             command: config.command,
             args: config.args ?? [],
             config: payload,
-        };
+        }, config);
     }
 }
 export function createAgentFactory(config) {

package/dist/agent/httpAdapter.js CHANGED Viewed

@@ -1,12 +1,13 @@
 import { performance } from "node:perf_hooks";
 export function interpolateTemplate(template, message, conversationId) {
     return template.replace(/\{\{([^}]+)\}\}/g, (_, key) => {
-        if (key === "message")
+        const k = key.trim();
+        if (k === "message")
             return message;
-        if (key === "conversation_id")
+        if (k === "conversation_id")
             return conversationId;
-        if (key.startsWith("env."))
-            return process.env[key.slice(4)] ?? "";
+        if (k.startsWith("env."))
+            return process.env[k.slice(4)] ?? "";
         return "";
     });
 }

package/dist/config.js CHANGED Viewed

@@ -1,12 +1,12 @@
 import { statSync, readFileSync } from "node:fs";
 import { resolve, relative, sep } from "node:path";
 import { parse } from "yaml";
-const CONFIG_PATH = resolve("agentlab.config.yaml");
 export function loadAgentLabConfig() {
-    if (!exists(CONFIG_PATH)) {
+    const configPath = resolve("agentlab.config.yaml");
+    if (!exists(configPath)) {
         return {};
     }
-    const raw = readFileSync(CONFIG_PATH, "utf8");
+    const raw = readFileSync(configPath, "utf8");
     const parsed = parse(raw);
     validateConfig(parsed);
     return parsed;
@@ -41,6 +41,47 @@ function validateConfig(value) {
             names.add(agent.name);
         }
     }
+    const agents = (value.agents ?? []);
+    const agentNames = new Set(agents.map((agent) => agent.name));
+    if (value.variant_sets !== undefined) {
+        if (!Array.isArray(value.variant_sets)) {
+            throw new Error("agentlab.config.yaml field 'variant_sets' must be an array.");
+        }
+        const names = new Set();
+        for (const variantSet of value.variant_sets) {
+            validateVariantSetDefinition(variantSet, agentNames);
+            if (names.has(variantSet.name)) {
+                throw new Error(`agentlab.config.yaml defines duplicate variant set '${variantSet.name}'.`);
+            }
+            names.add(variantSet.name);
+        }
+    }
+    if (value.runtime_profiles !== undefined) {
+        if (!Array.isArray(value.runtime_profiles)) {
+            throw new Error("agentlab.config.yaml field 'runtime_profiles' must be an array.");
+        }
+        const names = new Set();
+        for (const runtimeProfile of value.runtime_profiles) {
+            validateRuntimeProfileDefinition(runtimeProfile);
+            if (names.has(runtimeProfile.name)) {
+                throw new Error(`agentlab.config.yaml defines duplicate runtime profile '${runtimeProfile.name}'.`);
+            }
+            names.add(runtimeProfile.name);
+        }
+    }
+    if (value.suite_definitions !== undefined) {
+        if (!Array.isArray(value.suite_definitions)) {
+            throw new Error("agentlab.config.yaml field 'suite_definitions' must be an array.");
+        }
+        const names = new Set();
+        for (const suiteDefinition of value.suite_definitions) {
+            validateSuiteDefinition(suiteDefinition);
+            if (names.has(suiteDefinition.name)) {
+                throw new Error(`agentlab.config.yaml defines duplicate suite definition '${suiteDefinition.name}'.`);
+            }
+            names.add(suiteDefinition.name);
+        }
+    }
 }
 function validateToolRegistration(value) {
     if (!isObject(value)) {
@@ -145,6 +186,148 @@ export function getAgentRegistration(name) {
     }
     return match;
 }
+export function getVariantSet(name) {
+    const match = loadAgentLabConfig().variant_sets?.find((variantSet) => variantSet.name === name);
+    if (!match) {
+        throw new Error(`agentlab.config.yaml does not define variant set '${name}'.`);
+    }
+    return match;
+}
+export function getRuntimeProfile(name) {
+    const match = loadAgentLabConfig().runtime_profiles?.find((runtimeProfile) => runtimeProfile.name === name);
+    if (!match) {
+        throw new Error(`agentlab.config.yaml does not define runtime profile '${name}'.`);
+    }
+    return match;
+}
+export function getSuiteDefinition(name) {
+    const match = loadAgentLabConfig().suite_definitions?.find((suiteDefinition) => suiteDefinition.name === name);
+    if (!match) {
+        throw new Error(`agentlab.config.yaml does not define suite definition '${name}'.`);
+    }
+    return match;
+}
+function validateVariantSetDefinition(value, agentNames) {
+    if (!isObject(value)) {
+        throw new Error("Each variant set definition in agentlab.config.yaml must be an object.");
+    }
+    if (typeof value.name !== "string" || value.name.length === 0) {
+        throw new Error("Each variant set definition must define a non-empty 'name'.");
+    }
+    if (!Array.isArray(value.variants)) {
+        throw new Error(`Variant set '${value.name}' must define a 'variants' array.`);
+    }
+    const labels = new Set();
+    for (const variant of value.variants) {
+        if (!isObject(variant)) {
+            throw new Error(`Variant set '${value.name}' contains a non-object variant definition.`);
+        }
+        if (typeof variant.agent !== "string" || variant.agent.length === 0) {
+            throw new Error(`Variant set '${value.name}' contains a variant with a non-empty 'agent' required.`);
+        }
+        if (!agentNames.has(variant.agent)) {
+            throw new Error(`Variant set '${value.name}' references unknown agent '${variant.agent}'.`);
+        }
+        if (typeof variant.label !== "string" || variant.label.length === 0) {
+            throw new Error(`Variant set '${value.name}' contains a variant with a non-empty 'label' required.`);
+        }
+        if (labels.has(variant.label)) {
+            throw new Error(`Variant set '${value.name}' defines duplicate variant label '${variant.label}'.`);
+        }
+        labels.add(variant.label);
+    }
+}
+function validateRuntimeProfileDefinition(value) {
+    if (!isObject(value)) {
+        throw new Error("Each runtime profile definition in agentlab.config.yaml must be an object.");
+    }
+    if (typeof value.name !== "string" || value.name.length === 0) {
+        throw new Error("Each runtime profile definition must define a non-empty 'name'.");
+    }
+    if (value.tool_faults !== undefined) {
+        if (!Array.isArray(value.tool_faults)) {
+            throw new Error(`Runtime profile '${value.name}' field 'tool_faults' must be an array.`);
+        }
+        for (const fault of value.tool_faults) {
+            if (!isObject(fault)) {
+                throw new Error(`Runtime profile '${value.name}' contains a non-object tool fault definition.`);
+            }
+            if (typeof fault.tool !== "string" || fault.tool.length === 0) {
+                throw new Error(`Runtime profile '${value.name}' contains a tool fault with a non-empty 'tool' required.`);
+            }
+            if (fault.mode !== "timeout" && fault.mode !== "error" && fault.mode !== "malformed_output" && fault.mode !== "partial_output") {
+                throw new Error(`Runtime profile '${value.name}' uses invalid tool fault mode '${String(fault.mode)}'.`);
+            }
+            if (fault.error_message !== undefined && (typeof fault.error_message !== "string" || fault.error_message.length === 0)) {
+                throw new Error(`Runtime profile '${value.name}' tool fault for '${fault.tool}' field 'error_message' must be a non-empty string.`);
+            }
+            if (fault.timeout_ms !== undefined && (typeof fault.timeout_ms !== "number" || fault.timeout_ms <= 0)) {
+                throw new Error(`Runtime profile '${value.name}' tool fault for '${fault.tool}' field 'timeout_ms' must be a positive number.`);
+            }
+            if (fault.partial_output !== undefined && !isObject(fault.partial_output)) {
+                throw new Error(`Runtime profile '${value.name}' tool fault for '${fault.tool}' field 'partial_output' must be an object.`);
+            }
+        }
+    }
+    if (value.state !== undefined) {
+        if (!isObject(value.state)) {
+            throw new Error(`Runtime profile '${value.name}' field 'state' must be an object.`);
+        }
+        if (value.state.reset !== "per_run" && value.state.reset !== "per_variant_run" && value.state.reset !== "manual") {
+            throw new Error(`Runtime profile '${value.name}' field 'state.reset' must be one of 'per_run', 'per_variant_run', or 'manual'.`);
+        }
+        if (value.state.seeded_messages !== undefined) {
+            if (!Array.isArray(value.state.seeded_messages)) {
+                throw new Error(`Runtime profile '${value.name}' field 'state.seeded_messages' must be an array.`);
+            }
+            for (const message of value.state.seeded_messages) {
+                if (!isObject(message)) {
+                    throw new Error(`Runtime profile '${value.name}' contains a non-object seeded message.`);
+                }
+                if (message.role !== "user" && message.role !== "assistant") {
+                    throw new Error(`Runtime profile '${value.name}' seeded message role must be 'user' or 'assistant'.`);
+                }
+                if (typeof message.message !== "string" || message.message.length === 0) {
+                    throw new Error(`Runtime profile '${value.name}' seeded message must define a non-empty 'message'.`);
+                }
+            }
+        }
+        if (value.state.memory_blob !== undefined && !isObject(value.state.memory_blob)) {
+            throw new Error(`Runtime profile '${value.name}' field 'state.memory_blob' must be an object.`);
+        }
+    }
+}
+function validateSuiteDefinition(value) {
+    if (!isObject(value)) {
+        throw new Error("Each suite definition in agentlab.config.yaml must be an object.");
+    }
+    if (typeof value.name !== "string" || value.name.length === 0) {
+        throw new Error("Each suite definition must define a non-empty 'name'.");
+    }
+    if (!isObject(value.include)) {
+        throw new Error(`Suite definition '${value.name}' must define an object 'include'.`);
+    }
+    validateSuiteSelectorArray(value.include, value.name, "include.scenarios");
+    validateSuiteSelectorArray(value.include, value.name, "include.tags");
+    validateSuiteSelectorArray(value.include, value.name, "include.suites");
+    if (value.exclude !== undefined) {
+        if (!isObject(value.exclude)) {
+            throw new Error(`Suite definition '${value.name}' field 'exclude' must be an object.`);
+        }
+        validateSuiteSelectorArray(value.exclude, value.name, "exclude.scenarios");
+        validateSuiteSelectorArray(value.exclude, value.name, "exclude.tags");
+        validateSuiteSelectorArray(value.exclude, value.name, "exclude.suites");
+    }
+}
+function validateSuiteSelectorArray(value, suiteName, key) {
+    const fieldName = key.split(".")[1];
+    const selector = value[fieldName];
+    if (selector !== undefined) {
+        if (!Array.isArray(selector) || selector.some((item) => typeof item !== "string")) {
+            throw new Error(`Suite definition '${suiteName}' field '${key}' must be an array of strings.`);
+        }
+    }
+}
 function exists(path) {
     try {
         statSync(path);

package/dist/evaluators.js CHANGED Viewed

@@ -13,6 +13,12 @@ function evaluateOne(evaluator, bundle) {
             return evaluateExactFinalAnswer(evaluator, bundle.run.finalOutput);
         case "step_count_max":
             return evaluateStepCountMax(evaluator, bundle.run.totalSteps);
+        case "tool_call_count_max":
+            return evaluateToolCallCountMax(evaluator, bundle.run.totalToolCalls);
+        case "tool_repeat_max":
+            return evaluateToolRepeatMax(evaluator, bundle.toolCalls);
+        case "cost_max":
+            return evaluateCostMax(evaluator, bundle.run.totalCostUsd);
         default:
             return {
                 evaluatorId: evaluator.id,
@@ -86,7 +92,8 @@ function evaluateExactFinalAnswer(evaluator, finalOutput) {
     };
 }
 function evaluateStepCountMax(evaluator, stepCount) {
-    const max = Number(evaluator.config.max_steps ?? 0);
+    const rawMax = evaluator.config.max ?? evaluator.config.max_steps;
+    const max = Number(rawMax ?? 0);
     const passed = stepCount <= max;
     return {
         evaluatorId: evaluator.id,
@@ -98,6 +105,54 @@ function evaluateStepCountMax(evaluator, stepCount) {
         message: passed ? `Step count ${stepCount} is within max ${max}.` : `Step count ${stepCount} exceeds max ${max}.`,
     };
 }
+function evaluateToolCallCountMax(evaluator, totalToolCalls) {
+    const max = Number(evaluator.config.max ?? 0);
+    const passed = totalToolCalls <= max;
+    return {
+        evaluatorId: evaluator.id,
+        evaluatorType: evaluator.type,
+        mode: evaluator.mode,
+        status: passed ? "pass" : "fail",
+        weight: evaluator.weight,
+        rawScore: passed ? 1 : 0,
+        message: passed
+            ? `Tool call count ${totalToolCalls} is within max ${max}.`
+            : `Tool call count ${totalToolCalls} exceeds max ${max}.`,
+    };
+}
+function evaluateToolRepeatMax(evaluator, toolCalls) {
+    const tool = String(evaluator.config.tool ?? "");
+    const max = Number(evaluator.config.max ?? 0);
+    const count = toolCalls.filter((call) => call.toolName === tool).length;
+    const passed = count <= max;
+    return {
+        evaluatorId: evaluator.id,
+        evaluatorType: evaluator.type,
+        mode: evaluator.mode,
+        status: passed ? "pass" : "fail",
+        weight: evaluator.weight,
+        rawScore: passed ? 1 : 0,
+        message: passed
+            ? `Tool '${tool}' usage count ${count} is within max ${max}.`
+            : `Tool '${tool}' usage count ${count} exceeds max ${max}.`,
+    };
+}
+function evaluateCostMax(evaluator, totalCostUsd) {
+    const maxUsd = Number(evaluator.config.max_usd ?? 0);
+    const total = totalCostUsd ?? 0;
+    const passed = total <= maxUsd;
+    return {
+        evaluatorId: evaluator.id,
+        evaluatorType: evaluator.type,
+        mode: evaluator.mode,
+        status: passed ? "pass" : "fail",
+        weight: evaluator.weight,
+        rawScore: passed ? 1 : 0,
+        message: passed
+            ? `Total cost ${total} is within max ${maxUsd}.`
+            : `Total cost ${total} exceeds max ${maxUsd}.`,
+    };
+}
 function matches(input, match) {
     if (!isObject(input)) {
         return false;