npm - agent-regression-lab - Versions diffs - 0.1.0 → 0.2.0 - Mend

agent-regression-lab 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/README.md +140 -123
package/dist/agent/httpAdapter.js +78 -0
package/dist/agent/mockAdapter.js +210 -13
package/dist/config.js +37 -1
package/dist/conversationEvaluators.js +167 -0
package/dist/conversationRunner.js +199 -0
package/dist/index.js +287 -102
package/dist/lib/id.js +3 -0
package/dist/scenarios.js +121 -9
package/dist/storage.js +193 -29
package/dist/tools.js +246 -0
package/dist/ui/App.js +39 -3
package/dist/ui/server.js +47 -3
package/dist/ui-assets/client.css +174 -0
package/dist/ui-assets/client.js +22185 -0
package/docs/agents.md +152 -0
package/docs/release-checklist.md +64 -0
package/docs/scenarios.md +172 -0
package/docs/tools.md +102 -0
package/docs/troubleshooting.md +158 -0
package/package.json +6 -3

package/docs/agents.md ADDED Viewed

@@ -0,0 +1,152 @@
+# Agents
+Named agents are configured in `agentlab.config.yaml`.
+This repo currently supports three provider modes:
+- `mock`
+- `openai`
+- `external_process`
+## Named Agent Config
+Example:
+```yaml
+agents:
+  - name: mock-default
+    provider: mock
+    label: mock-default
+  - name: openai-cheap
+    provider: openai
+    model: gpt-4o-mini
+    label: openai-cheap
+  - name: custom-node-agent
+    provider: external_process
+    command: node
+    args:
+      - custom_agents/node_agent.mjs
+    label: custom-node-agent
+```
+Run a named agent with:
+```bash
+agentlab run support.refund-correct-order --agent mock-default
+```
+## Mock
+The built-in mock adapter is the best path for deterministic smoke tests and baseline examples.
+Use it when you want:
+- fast local verification
+- stable docs examples
+- predictable benchmark behavior
+## OpenAI
+The OpenAI path uses your API key and a configured model.
+Requirements:
+- `OPENAI_API_KEY` in the environment
+- a named `openai` agent in `agentlab.config.yaml`, or equivalent CLI runtime settings
+Example:
+```bash
+export OPENAI_API_KEY=...
+agentlab run support.refund-correct-order --agent openai-cheap
+```
+The OpenAI path is useful, but less deterministic than the mock path.
+## External Process
+External-process agents communicate with the runner over line-delimited JSON on stdin/stdout.
+The runner stays in control of:
+- tool execution
+- stopping conditions
+- runtime limits
+- persisted run state
+The external agent decides what tool to call next or when to return a final answer.
+### Protocol
+Runner events:
+- `run_started`
+- `tool_result`
+- `runner_error`
+Agent responses:
+- `tool_call`
+- `final`
+- `error`
+Minimal flow:
+1. the runner sends `run_started`
+2. the agent returns `tool_call` or `final`
+3. the runner executes the tool and sends `tool_result`
+4. the agent continues until it returns `final` or `error`
+Working examples:
+- `custom_agents/node_agent.mjs`
+- `custom_agents/python_agent.py`
+Run one of them with:
+```bash
+agentlab run support.refund-via-config-tool --agent custom-node-agent
+```
+## Environment Allowlist
+External-process agents can optionally define `envAllowlist`.
+Use it when a child process needs specific environment variables passed through.
+Example shape:
+```yaml
+agents:
+  - name: custom-agent
+    provider: external_process
+    command: node
+    args:
+      - custom_agents/node_agent.mjs
+    envAllowlist:
+      - OPENAI_API_KEY
+```
+Only allow through what the child actually needs.
+## Best Practices
+- use named agents instead of ad hoc local command strings
+- keep labels stable so compare output stays readable
+- prefer the mock path for smoke tests and docs
+- use external-process agents when you want to wrap a local Node or Python agent implementation
+- keep the runner authoritative for tools and termination
+## Common Errors
+Typical failures:
+- missing `OPENAI_API_KEY`
+- unsupported provider name
+- missing external-process `command`
+- invalid `args` or `envAllowlist`
+- child process returning invalid JSON
+See [troubleshooting.md](troubleshooting.md) for fixes.

package/docs/release-checklist.md ADDED Viewed

@@ -0,0 +1,64 @@
+# Release Checklist
+Use this before publishing a new npm version or telling users to upgrade.
+## Verification
+Run the full release gate:
+```bash
+npm run check
+npm test
+npm run build
+npm run smoke:cli
+npm pack --dry-run
+```
+## Manual CLI Flow
+Verify the canonical workflow:
+```bash
+agentlab list scenarios
+agentlab run support.refund-correct-order --agent mock-default
+agentlab show <run-id>
+agentlab run support.refund-correct-order --agent mock-default
+agentlab compare <baseline-run-id> <candidate-run-id>
+agentlab run --suite support --agent mock-default
+agentlab run --suite support --agent mock-default
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+agentlab ui
+```
+## Extension Smoke
+Verify at least one extension path:
+- run `support.refund-via-config-tool` with `custom-node-agent`, or
+- verify a repo-local custom tool still loads from `agentlab.config.yaml`
+## Docs Verification
+Confirm these files match current behavior:
+- `README.md`
+- `docs/scenarios.md`
+- `docs/tools.md`
+- `docs/agents.md`
+- `docs/troubleshooting.md`
+Requirements:
+- every command works as written
+- every referenced path exists
+- limitations are stated honestly
+- `compare --suite` is documented using suite batch ids, not run ids
+## Publish Hygiene
+Before `npm publish`:
+- confirm the package version is correct
+- confirm the git tree contains the intended release changes
+- confirm packaged UI assets are included in the tarball
+- confirm the npm metadata still points at the correct repo, homepage, and issues URL

package/docs/scenarios.md ADDED Viewed

@@ -0,0 +1,172 @@
+# Scenarios
+Scenarios are YAML files under `scenarios/`. They are the core authoring interface for the product.
+Each scenario should describe one narrow job for the agent, not a vague capability test.
+## Required Shape
+Each scenario should define:
+- `id`
+- `name`
+- `suite`
+- `task`
+- `tools`
+- `runtime`
+- `evaluators`
+Common optional fields already used in this repo:
+- `description`
+- `difficulty`
+- `tags`
+- task `context`
+## Example
+```yaml
+id: support.refund-correct-order
+name: Refund The Correct Order
+suite: support
+difficulty: easy
+description: Refund only the duplicated charge.
+tags:
+  - refund
+  - support
+task:
+  instructions: |
+    The customer says they were charged twice.
+    Find the duplicated charge and refund only that order.
+  context:
+    customer_email: alice@example.com
+tools:
+  allowed:
+    - crm.search_customer
+    - orders.list
+    - orders.refund
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
+evaluators:
+  - id: refund-created
+    type: tool_call_assertion
+    mode: hard_gate
+    config:
+      tool: orders.refund
+      match:
+        order_id: ord_1024
+  - id: mentions-order
+    type: final_answer_contains
+    mode: weighted
+    weight: 1
+    config:
+      required_substrings:
+        - ord_1024
+```
+## Suites In This Repo
+Current benchmark domains:
+- `support`
+- `coding`
+- `research`
+- `ops`
+Use a suite when scenarios belong to one behavior family and should be runnable together with:
+```bash
+agentlab run --suite support --agent mock-default
+```
+`run --suite` creates a suite batch id. That id is later used for:
+```bash
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
+Suite comparison is strict. Only compare batches from the same suite.
+## Tools
+Each scenario declares its allowed tools:
+```yaml
+tools:
+  allowed:
+    - crm.search_customer
+    - orders.list
+    - orders.refund
+```
+Keep the tool allowlist as narrow as possible. A broad allowlist weakens the benchmark and makes regressions harder to interpret.
+This repo supports both:
+- built-in deterministic tools
+- repo-local custom tools registered in `agentlab.config.yaml`
+The launch benchmark now includes built-in tools for:
+- support
+- coding
+- research
+- ops
+See [tools.md](tools.md) for custom tool registration.
+## Runtime Limits
+Scenarios can enforce:
+- `max_steps`
+- `timeout_seconds`
+Example:
+```yaml
+runtime:
+  max_steps: 8
+  timeout_seconds: 60
+```
+These limits are enforced by the runner. Use them to keep runs bounded and comparisons meaningful.
+## Evaluators
+Use deterministic evaluators only.
+The current evaluator set includes:
+- `tool_call_assertion`
+- `forbidden_tool`
+- `final_answer_contains`
+- `exact_final_answer`
+- `step_count_max`
+Guidance:
+- use hard gates for non-negotiable behavior
+- use weighted evaluators for softer quality checks
+- prefer tool assertions or exact output checks over vague answer checks when possible
+## Authoring Conventions
+Use these defaults:
+- `id` format: `<suite>.<short-name>`
+- keep scenario jobs narrow and concrete
+- keep fixture-backed context in `task.context`
+- prefer deterministic fixture references over open-ended prompts
+- include `difficulty`, `description`, and `tags` for every launch scenario
+## Current Examples
+Useful scenario references in this repo:
+- support: `scenarios/support/refund-correct-order.yaml`
+- support with config tool: `scenarios/support/refund-via-config-tool.yaml`
+- coding: `scenarios/coding/fix-add-function.yaml`
+- research: `scenarios/research/remote-work-policy.yaml`
+- ops: `scenarios/ops/payments-api-alert.yaml`

package/docs/tools.md ADDED Viewed

@@ -0,0 +1,102 @@
+# Custom Tools
+Custom tools are registered in `agentlab.config.yaml` and loaded from repo-local JS or TS modules.
+This is the main extension point when built-in tools are not enough.
+## What A Tool Registration Needs
+Each tool entry must define:
+- `name`
+- `modulePath`
+- `exportName`
+- `description`
+- `inputSchema`
+Example:
+```yaml
+tools:
+  - name: support.find_duplicate_charge
+    modulePath: user_tools/findDuplicateCharge.ts
+    exportName: findDuplicateCharge
+    description: Find the duplicated charge order id for a given customer.
+    inputSchema:
+      type: object
+      additionalProperties: false
+      properties:
+        customer_id:
+          type: string
+          description: Customer id to inspect for duplicated charges.
+      required:
+        - customer_id
+```
+## Tool Module Shape
+The exported function should be async and should return JSON-serializable output.
+Minimal example:
+```ts
+export async function myTool(input: unknown): Promise<{ ok: boolean }> {
+  return { ok: true };
+}
+```
+The existing working example is:
+- `user_tools/findDuplicateCharge.ts`
+## Important Constraints
+- `modulePath` must stay within the repo
+- the module must exist at load time
+- the named export must exist
+- tool input should be validated defensively inside the tool
+- tool output should be deterministic and JSON-serializable
+For launch usage, treat tools as fixture-backed local functions, not live integrations.
+## Recommended Pattern
+Use this approach:
+1. read fixture data from `fixtures/`
+2. validate the input shape
+3. return a small structured result
+4. throw a clear error for missing fixture state or invalid input
+The current `findDuplicateCharge` tool shows that pattern.
+## Wiring A Tool Into A Scenario
+1. register the tool in `agentlab.config.yaml`
+2. add the tool name to the scenario allowlist
+3. add an evaluator that confirms the tool was used correctly if the behavior is important
+Example scenario:
+- `scenarios/support/refund-via-config-tool.yaml`
+## Best Practices
+- keep tool names stable and descriptive
+- keep tools scenario-agnostic where possible
+- prefer read-only or sandboxed behavior
+- do not mutate global machine state
+- do not call live external systems in benchmark paths
+- keep schemas narrow so agent tool calls are easy to validate and compare
+## Common Errors
+Typical config failures:
+- duplicate tool names
+- repo-external module paths
+- missing module files
+- missing exports
+- invalid `inputSchema` shape
+See [troubleshooting.md](troubleshooting.md) for failure examples and fixes.

package/docs/troubleshooting.md ADDED Viewed

@@ -0,0 +1,158 @@
+# Troubleshooting
+This page covers the main failure modes users hit during install, first run, and comparison.
+## `agentlab: command not found`
+You are probably in one of these states:
+- the package is not installed globally
+- you have not run `npm link` from the repo
+- your shell path does not include npm global bins
+Fast fixes:
+```bash
+npm install
+npm run build
+npm link
+agentlab --help
+```
+Or skip linking and use:
+```bash
+npm run start -- --help
+```
+## `OPENAI_API_KEY is required`
+You used an OpenAI-backed agent without exporting the API key.
+Fix:
+```bash
+export OPENAI_API_KEY=...
+agentlab run support.refund-correct-order --agent openai-cheap
+```
+## `No scenarios found for suite ...`
+The suite id must match a suite under `scenarios/`.
+List valid options:
+```bash
+agentlab list scenarios
+```
+Current built-in suites in this repo include:
+- `support`
+- `coding`
+- `research`
+- `ops`
+## `Run '<id>' not found`
+`show` and run-to-run `compare` require run ids from completed runs.
+Get a fresh run id by executing a scenario:
+```bash
+agentlab run support.refund-correct-order --agent mock-default
+```
+Then use:
+```bash
+agentlab show <run-id>
+agentlab compare <baseline-run-id> <candidate-run-id>
+```
+## `Missing baseline or candidate suite batch id`
+`compare --suite` does not use run ids. It uses suite batch ids printed by `run --suite`.
+Example:
+```bash
+agentlab run --suite support --agent mock-default
+agentlab run --suite support --agent mock-default
+agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
+```
+## Cross-suite suite comparison errors
+Suite batch comparison is strict. Compare batches from the same suite only.
+This is valid:
+```bash
+agentlab compare --suite suite_...support_batch_a suite_...support_batch_b
+```
+This is not valid:
+- a `support` batch compared against an `ops` batch
+- mixed or malformed suite batch selections
+If you are unsure which batch came from which suite, rerun the suite and record the printed batch ids.
+## `agentlab ui` fails to load assets
+Installed packages should already include the built UI assets.
+If you are running from a repo checkout, build first:
+```bash
+npm install
+npm run build
+agentlab ui
+```
+If the problem persists, verify that these files exist:
+- `dist/ui-assets/client.js`
+- `dist/ui-assets/client.css`
+## Config tool or agent not found
+Typical reasons:
+- `agentlab.config.yaml` is missing
+- the configured `name` does not match the CLI `--agent` value
+- `modulePath` points outside the repo
+- the configured export or command does not exist
+Working references in this repo:
+- tool config: `agentlab.config.yaml`
+- custom tool: `user_tools/findDuplicateCharge.ts`
+- external agents: `custom_agents/node_agent.mjs`, `custom_agents/python_agent.py`
+## Global install behaves differently from repo mode
+That usually means the current working directory is wrong.
+The CLI operates on the current working directory and expects:
+- `scenarios/`
+- `fixtures/`
+- optional `agentlab.config.yaml`
+Run it from the project root you want to evaluate.
+## Release Verification
+Before publishing or cutting a release, run:
+```bash
+npm run check
+npm test
+npm run build
+npm run smoke:cli
+npm pack --dry-run
+```
+For the full pre-launch checklist, see [release-checklist.md](release-checklist.md).

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agent-regression-lab",
-  "version": "0.1.0",
+  "version": "0.2.0",
   "private": false,
   "description": "Local-first scenario-based evaluation harness for AI agents.",
   "license": "MIT",
@@ -25,13 +25,16 @@
   },
   "files": [
     "dist",
-    "README.md"
+    "dist/ui-assets",
+    "README.md",
+    "docs"
   ],
   "engines": {
     "node": ">=22"
   },
   "scripts": {
-    "build": "tsc -p tsconfig.json",
+    "build": "tsc -p tsconfig.json && npm run build:ui",
+    "build:ui": "esbuild src/ui/client.tsx --bundle --format=esm --platform=browser --outdir=dist/ui-assets --loader:.css=css --log-level=warning",
     "check": "tsc -p tsconfig.json --noEmit",
     "test": "tsx --test tests/**/*.test.ts",
     "smoke:cli": "npm run build && node dist/index.js --help && node dist/index.js version",