npm - agent-regression-lab - Versions diffs - 0.3.0 → 0.5.0 - Mend

agent-regression-lab 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/README.md +25 -4
package/bin/agentlab.js +2 -0
package/dist/config.js +13 -9
package/dist/index.js +14 -0
package/dist/init.js +88 -0
package/dist/tools.js +18 -2
package/dist/ui/App.js +49 -7
package/dist/ui-assets/client.css +1108 -116
package/dist/ui-assets/client.js +863 -426
package/docs/coding-agents.md +74 -0
package/docs/superpowers/plans/2026-04-13-phase-2-lite-phase-3-plan.md +160 -0
package/docs/superpowers/plans/2026-04-13-phase-one-npm-tools-plan.md +502 -0
package/docs/superpowers/plans/2026-04-16-regression-atlas-ui-redesign.md +1010 -0
package/docs/superpowers/specs/2026-04-13-phase-2-lite-phase-3-design.md +164 -0
package/docs/superpowers/specs/2026-04-16-regression-atlas-ui-redesign-design.md +417 -0
package/docs/tools.md +34 -3
package/docs/troubleshooting.md +55 -0
package/examples/coding-tools/README.md +21 -0
package/examples/coding-tools/index.js +11 -0
package/examples/coding-tools/package.json +8 -0
package/examples/support-tools/README.md +21 -0
package/examples/support-tools/index.js +8 -0
package/examples/support-tools/package.json +8 -0
package/package.json +6 -4

package/docs/superpowers/specs/2026-04-13-phase-2-lite-phase-3-design.md ADDED Viewed

@@ -0,0 +1,164 @@
+# Phase 2 Lite And Phase 3 Design
+## Goal
+Compress the original Phase 2 into a minimal integration-story pass, then move immediately into Phase 3 UI/demo polish.
+The intent is to preserve product legibility for new users without spending weeks on broad framework coverage before the product is visually demo-ready.
+## Why This Change
+The current product is technically credible, but the main remaining gap is not core capability. It is demonstration quality and onboarding clarity.
+Fully skipping Phase 2 would create a prettier product with weaker adoption paths:
+- users would see polished UI but still ask how it fits their workflow
+- the product would remain support-agent-coded in perception
+- the README and launch story would still lack recognizable entry points
+Keeping a trimmed Phase 2 solves that without delaying Phase 3 materially.
+## Recommended Roadmap Change
+Use this ordering:
+1. `Phase 2-lite`
+2. `Phase 3`
+Do not treat Phase 2-lite as a broad integration campaign. Treat it as the minimum viable integration story required to make Phase 3 polish meaningful.
+## Phase 2 Lite Scope
+Phase 2-lite should deliver only the pieces that make the product legible to new technical users.
+### Keep
+- `arl-test` as the canonical HTTP/live-agent example
+- one CI example using `agentlab run --suite-def pre_merge`
+- one coding-agent example or guide
+- 2-3 README entry points such as:
+  - start here for HTTP agents
+  - start here for coding agents
+  - start here for CI/pre-merge regression
+### Skip For Now
+- broad framework integration coverage
+- multiple framework-specific guides
+- large scenario-pack system work
+- marketplace/community work
+- many hero examples across every ecosystem
+## Phase 2 Lite Deliverables
+### 1. Canonical Integration Paths
+The product should have three obvious ways in:
+- HTTP/live service path
+  - anchored by `arl-test`
+- coding-agent path
+  - enough to prove ARL is not just for support agents
+- CI path
+  - GitHub Actions example using suite definitions
+These should be recognizable and copy-pasteable.
+### 2. README Entry Points
+The README should not just describe the architecture. It should route users by workflow.
+Recommended entry sections:
+- “If your agent runs as an HTTP service”
+- “If you are validating coding-agent changes”
+- “If you want pre-merge regression checks in CI”
+Each section should point to one canonical example, not many.
+### 3. Keep Scope Narrow
+Phase 2-lite should avoid product expansion.
+It should mainly be:
+- examples
+- README routing
+- one CI workflow example
+- one extra concrete use-case path beyond HTTP support
+## Phase 3 Scope
+After Phase 2-lite, Phase 3 becomes the main workstream.
+### Primary Goal
+Make the product demoable, screenshotable, and easier to understand visually.
+### Core Work
+- comparison view redesign
+- clearer red/green regression presentation
+- better trace visualization
+- stronger run history/dashboard view
+- visual polish that feels intentional rather than debug-console minimal
+- README screenshots or GIFs that show the regression story quickly
+### Design Constraint
+Phase 3 should improve clarity, not add ornamental UI.
+Every UI change should help users answer one of these questions faster:
+- what changed?
+- what failed?
+- where did it fail?
+- did the candidate regress?
+- should I trust this run?
+## Success Criteria
+### After Phase 2 Lite
+A new technical user can quickly identify:
+- how to use ARL with an HTTP agent
+- how to use ARL in CI
+- that ARL can also support coding-agent regression workflows
+### After Phase 3
+The product should be visually strong enough that:
+- screenshots are worth sharing
+- demos feel polished
+- mentors and early users understand the product faster
+- the UI helps explain value instead of requiring explanation around it
+## Non-Goals
+This roadmap change does not mean:
+- hosted platform work
+- broad plugin/framework ecosystem support
+- marketplace or virality mechanics
+- replacing core CLI authoring with UI-first configuration
+Those remain later-phase work.
+## Recommended Execution Order
+1. update internal roadmap/task tracking to reflect `Phase 2-lite`
+2. implement the minimal integration-story assets
+3. switch immediately to Phase 3 UI/demo polish
+## Decision
+Use a compressed integration phase, not a skipped integration phase.
+That is the best tradeoff between:
+- speed
+- product clarity
+- demo quality
+- launch readiness

package/docs/superpowers/specs/2026-04-16-regression-atlas-ui-redesign-design.md ADDED Viewed

@@ -0,0 +1,417 @@
+# Regression Atlas UI Redesign
+**Date:** April 16, 2026
+**Status:** Approved for spec review
+**Approach:** Scenario-graph-led forensic UI for Agent Regression Lab
+---
+## 1. Overview
+Redesign the local UI so it feels like the product itself instead of a generic admin shell. The current UI is structurally sound for inspection, but it is emotionally flat: table-first, low-drama, and visually interchangeable with internal tooling. The redesign should make the product feel like a purpose-built forensic environment for agent regression work.
+The new direction is `Regression Atlas`: a cinematic, industrial-lab interface organized around a scenario graph instead of a run list. The graph becomes the primary orienting surface. Runs, suite batches, traces, tool calls, evaluator results, and comparisons all attach back to scenarios as evidence artifacts. The visual language should feel measured, severe, and memorable without becoming decorative noise.
+**Primary goals:**
+- Make the UI feel unmistakably like Agent Regression Lab
+- Center the product around scenario topology rather than flat lists
+- Preserve efficient access to run inspection and compare workflows
+- Introduce unconventional but legible structures and components
+- Give the product a durable visual identity that can scale beyond alpha
+**Success criteria:**
+- The home screen is graph-led and immediately communicates the regression surface
+- Users can still reach run details and compare flows quickly without hunting
+- Distinctive components and layout choices create product-specific character
+- The interface is visually bold, but still usable for repeated technical inspection
+- Desktop feels cinematic; mobile remains navigable and coherent
+---
+## 2. Product Narrative
+The interface should feel like a machine for mapping behavioral stability. Users are not browsing rows. They are reading a terrain of workflows, incidents, and behavior changes. The product story becomes:
+1. identify the workflow area under scrutiny
+2. inspect scenario activity and recent instability
+3. open a scenario as a case surface
+4. move through evidence, traces, and comparisons with minimal context switching
+This narrative is important because it informs the layout. The graph is not a decorative hero banner. It is the product's main mental model.
+---
+## 3. Design Principles
+### Scenario First
+The UI should orient users around scenarios and suites before individual runs. Runs are evidence attached to a scenario node, not the top-level object.
+### Forensic, Not Futuristic
+The visual system should feel industrial and analytical, not sci-fi. Use instrument-like framing, measured geometry, etched dividers, and calibrated accents instead of glowing fantasy surfaces.
+### Cinematic Density
+The interface should give important objects room to breathe. Large layout gestures, dramatic negative space, and anchored focal regions should replace dense dashboard packing.
+### Deliberate Unconventionality
+Each major component should have a structural reason to exist beyond styling. The UI should not rely on generic cards, standard KPI strips, or bland tabs where a more product-specific pattern would clarify the product.
+### Operational Fallbacks
+Even with a graph-led experience, table and list views remain easy to reach. Familiar views are secondary tools, not removed capabilities.
+---
+## 4. Information Architecture
+### Primary Modes
+Replace the current route feel of "runs, detail, compare" with four persistent product modes:
+- `Atlas`: scenario graph home and main navigation surface
+- `Cases`: focused scenario or run inspection view
+- `Compare`: run-to-run and suite-to-suite staging plus comparison results
+- `Archive`: tabular history, filters, and fallback list workflows
+These modes can still map to the existing routes initially, but the UI should present them as coherent product territories.
+### Persistent Layout Frame
+The redesign should use a three-zone desktop shell:
+- `Left rail`: compact mode navigation, project identity, filter chips, and view toggles
+- `Center stage`: atlas canvas or primary evidence composition
+- `Right drawer`: contextual evidence rail that updates with selection
+An additional `bottom staging strip` should appear when users pin runs or suite batches for compare. This is a signature product behavior, not a temporary toast.
+### Route Strategy
+Existing routes can remain for implementation simplicity, but each one should visually belong to the new structure:
+- `/` becomes `Atlas`
+- `/runs/:id` becomes `Cases`
+- `/compare` becomes `Compare`
+- `/compare-suite` becomes `Compare`
+The `Archive` mode can initially be a view inside `/` or a derived state rather than a new route if needed.
+---
+## 5. Core Screens
+### Atlas Home
+This is the new landing experience.
+**Structure:**
+- oversized atlas canvas occupies most of the viewport
+- suite fields appear as bounded territories or chambers
+- scenario nodes float inside those fields with status, volatility, and freshness signals
+- a slim incident ribbon or suite activity rail appears near the top
+- selected node opens the right evidence drawer
+**Behavior:**
+- hovering a node reveals recent runs, latest verdict, and compare affordances
+- clicking a node locks selection and loads evidence in the drawer
+- clicking a suite field filters focus to that family and reshapes the graph
+- filters in the left rail update graph state without collapsing it into a flat table
+**Purpose:**
+The home screen should answer "where is risk clustering?" before it answers "what happened in this single run?"
+### Case Surface
+This replaces the plain run detail page.
+**Structure:**
+- top region: scenario identity, verdict totem, runtime metadata, and relationship breadcrumbs
+- center region: trace fracture view and final output
+- right region: evaluator evidence, tool fingerprints, and error fragments
+- lower region: related runs, neighboring scenarios, and quick compare actions
+**Purpose:**
+Make a run feel like a technical case file anchored to its scenario, not a long list of sections.
+### Compare Chamber
+This replaces the standard compare page with a more dramatic inspection composition.
+**Structure:**
+- comparison header framed like an instrument readout
+- baseline and candidate displayed as mirrored evidence columns
+- centerline shows verdict delta, runtime drift, step drift, and classification notes
+- tool and evaluator diffs appear as forensic comparison plates rather than generic cards
+**Purpose:**
+Turn compare into the product's most convincing high-stakes workflow, since comparison is central to the product story.
+### Archive View
+This keeps the current utility but changes its role.
+**Structure:**
+- table lives in a controlled archival mode with stronger typography and filtering
+- grouped rows, pinned compare actions, and scenario-first sorting
+- visual ties back to atlas selections
+**Purpose:**
+Preserve fast scanning and operational familiarity without letting the table define the product's identity.
+---
+## 6. Signature Components
+### Scenario Atlas
+The atlas is the primary hero object. It should not look like a stock node graph or mind map.
+**Rules:**
+- suites render as contained regions with distinct edge treatments
+- scenarios render as shaped nodes, not identical circles
+- node appearance reflects scenario role, status mix, and recent activity
+- active paths between nodes should imply workflow adjacency or shared incident lineage
+- graph motion should be slow, deliberate, and subtle, like a calibration field
+**Visual cues:**
+- pulse rings for recent runs
+- scored scars or notches for historical failures
+- thin route lines for scenario adjacency
+- region labels etched into the background plane instead of floating as badges
+### Verdict Totem
+Replace small pills as the main status expression with a larger instrument-style marker.
+**Rules:**
+- verdict totems combine shape, contrast, and minimal text
+- pass, fail, error, and neutral should differ in silhouette, not only color
+- totems appear in detail and compare views as anchors for visual scanning
+### Evidence Drawer
+The right-side drawer is always contextual and layered.
+**Contents can include:**
+- latest run facts
+- evaluator stack
+- tool activity summary
+- error fragment
+- compare pin action
+- related scenario paths
+This drawer should feel like a lab sidecar, not a modal sidebar.
+### Comparison Staging Strip
+Pinned runs and suite batches live in a visible bottom strip before users enter compare mode.
+**Rules:**
+- items can be added from atlas nodes, archive rows, and case surfaces
+- the strip visually suggests assembly, pairing, and readiness
+- once two compatible artifacts are pinned, compare becomes a primary action
+### Trace Fracture View
+Trace events should be rendered as segmented evidence seams.
+**Rules:**
+- steps read as a vertical or angled fracture line
+- each event opens into a chamber with payload details
+- tool calls, evaluator outputs, and errors use differentiated chamber treatments
+This preserves current trace data while making the inspection experience feel authored.
+---
+## 7. Visual System
+### Material Direction
+Use an industrial-lab language:
+- graphite, iron, mineral white, steel blue-gray
+- oxidized red-orange for primary accent
+- signal amber for active attention
+- acid-lime used sparingly for positive calibration moments
+Avoid the current warm parchment palette. The redesign should feel sharper, cooler, and more engineered.
+### Typography
+Use a paired system:
+- headline and large labels: condensed or narrow grotesk with authority
+- metadata, codes, ids, and micro-labels: technical mono
+Type should do more structural work than color. Region labels, mode labels, and evidence titles should feel engraved and deliberate.
+### Geometry
+Prefer:
+- clipped corners
+- asymmetrical panel cuts
+- inset borders
+- region frames
+- calibrated grid alignments
+Avoid:
+- default rounded dashboard cards
+- generic pill-heavy surfaces
+- equal-weight repeated rectangles
+### Background System
+The background should be an active atmospheric layer:
+- subtle field gradients
+- technical grid ghosts
+- radial inspection glows behind important objects
+- occasional sweep lines or scanning overlays
+These layers should support depth, not distract from text legibility.
+---
+## 8. Motion
+Motion should feel like instrumentation coming online.
+**Principles:**
+- slower entry motion with intention
+- staggered reveals for regions and evidence layers
+- atlas pulses and route-line drift should be ambient, not constant
+- drawer transitions should feel mechanical and precise
+- compare transitions should emphasize lock-in and alignment
+Avoid generic hover bounce, playful springiness, and ornamental motion noise.
+---
+## 9. Responsive Behavior
+### Desktop
+Desktop gets the full three-zone experience with atlas center stage, left rail, right drawer, and bottom staging strip.
+### Tablet
+Tablet keeps atlas first, but the evidence drawer becomes collapsible and the staging strip becomes a compact horizontal tray.
+### Mobile
+Mobile should not attempt to preserve the full atlas as-is.
+Instead:
+- atlas becomes a vertically stacked scenario field navigator
+- selected scenario opens a focused case sheet
+- compare staging becomes a docked drawer
+- archive table becomes stacked list rows with strong grouping
+The mobile goal is continuity of identity, not literal desktop parity.
+---
+## 10. Mapping To Existing Product Data
+The redesign should be driven by data the product already has.
+### Atlas Inputs
+The current run list can derive:
+- scenario ids
+- suite grouping
+- status distribution
+- latest provider
+- freshness by started time
+This is enough for an initial atlas without inventing new backend APIs.
+### Case Surface Inputs
+The current run detail payload already supports:
+- run metadata
+- evaluator results
+- tool calls
+- trace events
+- error detail
+The redesign should remap this into stronger presentation before asking for major data model changes.
+### Compare Inputs
+The current compare payload already supports:
+- classification
+- verdict delta
+- termination delta
+- output changed
+- notes
+- evaluator diffs
+- tool diffs
+This is sufficient to create a more authored compare chamber.
+---
+## 11. Risks And Guardrails
+### Risk: Style Over Utility
+If the graph becomes visually impressive but slower than the current list for common workflows, the redesign fails. The archive and compare-entry affordances must remain obvious and fast.
+### Risk: Graph Complexity Without Real Meaning
+If node placement and connections feel arbitrary, the atlas becomes decorative. The visual logic for grouping, adjacency, and activity must be explicit and stable.
+### Risk: Overusing Accent Colors
+A forensic interface loses credibility when every object screams. Reserve strong accents for verdicts, active selection, and meaningful change.
+### Risk: Mobile Collapse
+The cinematic desktop concept must degrade into a simplified but still authored mobile flow. Mobile should be designed intentionally, not compressed from desktop at the end.
+---
+## 12. Testing Strategy
+The redesign implementation should be validated at three levels:
+- structural tests for route rendering and core state transitions
+- visual smoke validation for atlas, case, compare, and archive layouts
+- interaction checks for selection, compare staging, and responsive behavior
+Specific implementation tests belong in the implementation plan, but the design assumes test coverage for both data mapping and layout resilience.
+---
+## 13. Initial Implementation Scope
+To keep the redesign shippable, the first implementation pass should include:
+- new shell and mode framing
+- atlas home driven by existing run list data
+- redesigned case surface for run detail
+- redesigned compare chamber for run and suite compare
+- archive fallback view
+- core visual system, typography, and motion primitives
+Out of scope for the first pass:
+- editable graph topology
+- advanced graph simulation
+- user-customizable atlas layouts
+- backend schema changes purely for design flourish
+---
+## 14. Final Recommendation
+Implement the UI redesign as `Regression Atlas`: a bold forensic interface where the scenario graph is the product's center of gravity, compare workflows feel like deliberate technical analysis, and every surface reinforces the identity of a local-first regression lab.
+The redesign should be memorable because it is structurally committed, not because it is visually loud. The product should feel like an instrument teams trust when behavior changes matter.

package/docs/tools.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Custom Tools
-Custom tools are registered in `agentlab.config.yaml` and loaded from repo-local JS or TS modules.
+Custom tools are registered in `agentlab.config.yaml` and can be loaded from repo-local JS/TS modules or installed npm packages.
 This is the main extension point when built-in tools are not enough.
@@ -9,12 +9,14 @@ This is the main extension point when built-in tools are not enough.
 Each tool entry must define:
 - `name`
-- `modulePath`
+- exactly one source:
+  - `modulePath`, or
+  - `package`
 - `exportName`
 - `description`
 - `inputSchema`
-Example:
+Repo-local example:
 ```yaml
 tools:
@@ -33,6 +35,25 @@ tools:
         - customer_id
 ```
+Installed package example:
+```yaml
+tools:
+  - name: support.find_duplicate_charge
+    package: "@agentlab/example-support-tools"
+    exportName: findDuplicateCharge
+    description: Find the duplicated charge order id for a given customer.
+    inputSchema:
+      type: object
+      additionalProperties: false
+      properties:
+        customer_id:
+          type: string
+          description: Customer id to inspect for duplicated charges.
+      required:
+        - customer_id
+```
 ## Tool Module Shape
 The exported function should be async and should return JSON-serializable output.
@@ -48,11 +69,15 @@ export async function myTool(input: unknown): Promise<{ ok: boolean }> {
 The existing working example is:
 - `user_tools/findDuplicateCharge.ts`
+- `examples/support-tools`
+- `examples/coding-tools`
 ## Important Constraints
+- each tool must define exactly one of `modulePath` or `package`
 - `modulePath` must stay within the repo
 - the module must exist at load time
+- installed packages must be resolvable from the current project
 - the named export must exist
 - tool input should be validated defensively inside the tool
 - tool output should be deterministic and JSON-serializable
@@ -100,3 +125,9 @@ Typical config failures:
 - invalid `inputSchema` shape
 See [troubleshooting.md](troubleshooting.md) for failure examples and fixes.
+For installed-package workflows, a good local path is:
+```bash
+npm install @agentlab/example-support-tools
+```