agent-regression-lab 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,417 @@
1
+ # Regression Atlas UI Redesign
2
+
3
+ **Date:** April 16, 2026
4
+ **Status:** Approved for spec review
5
+ **Approach:** Scenario-graph-led forensic UI for Agent Regression Lab
6
+
7
+ ---
8
+
9
+ ## 1. Overview
10
+
11
+ Redesign the local UI so it feels like the product itself instead of a generic admin shell. The current UI is structurally sound for inspection, but it is emotionally flat: table-first, low-drama, and visually interchangeable with internal tooling. The redesign should make the product feel like a purpose-built forensic environment for agent regression work.
12
+
13
+ The new direction is `Regression Atlas`: a cinematic, industrial-lab interface organized around a scenario graph instead of a run list. The graph becomes the primary orienting surface. Runs, suite batches, traces, tool calls, evaluator results, and comparisons all attach back to scenarios as evidence artifacts. The visual language should feel measured, severe, and memorable without becoming decorative noise.
14
+
15
+ **Primary goals:**
16
+ - Make the UI feel unmistakably like Agent Regression Lab
17
+ - Center the product around scenario topology rather than flat lists
18
+ - Preserve efficient access to run inspection and compare workflows
19
+ - Introduce unconventional but legible structures and components
20
+ - Give the product a durable visual identity that can scale beyond alpha
21
+
22
+ **Success criteria:**
23
+ - The home screen is graph-led and immediately communicates the regression surface
24
+ - Users can still reach run details and compare flows quickly without hunting
25
+ - Distinctive components and layout choices create product-specific character
26
+ - The interface is visually bold, but still usable for repeated technical inspection
27
+ - Desktop feels cinematic; mobile remains navigable and coherent
28
+
29
+ ---
30
+
31
+ ## 2. Product Narrative
32
+
33
+ The interface should feel like a machine for mapping behavioral stability. Users are not browsing rows. They are reading a terrain of workflows, incidents, and behavior changes. The product story becomes:
34
+
35
+ 1. identify the workflow area under scrutiny
36
+ 2. inspect scenario activity and recent instability
37
+ 3. open a scenario as a case surface
38
+ 4. move through evidence, traces, and comparisons with minimal context switching
39
+
40
+ This narrative is important because it informs the layout. The graph is not a decorative hero banner. It is the product's main mental model.
41
+
42
+ ---
43
+
44
+ ## 3. Design Principles
45
+
46
+ ### Scenario First
47
+
48
+ The UI should orient users around scenarios and suites before individual runs. Runs are evidence attached to a scenario node, not the top-level object.
49
+
50
+ ### Forensic, Not Futuristic
51
+
52
+ The visual system should feel industrial and analytical, not sci-fi. Use instrument-like framing, measured geometry, etched dividers, and calibrated accents instead of glowing fantasy surfaces.
53
+
54
+ ### Cinematic Density
55
+
56
+ The interface should give important objects room to breathe. Large layout gestures, dramatic negative space, and anchored focal regions should replace dense dashboard packing.
57
+
58
+ ### Deliberate Unconventionality
59
+
60
+ Each major component should have a structural reason to exist beyond styling. The UI should not rely on generic cards, standard KPI strips, or bland tabs where a more product-specific pattern would clarify the product.
61
+
62
+ ### Operational Fallbacks
63
+
64
+ Even with a graph-led experience, table and list views remain easy to reach. Familiar views are secondary tools, not removed capabilities.
65
+
66
+ ---
67
+
68
+ ## 4. Information Architecture
69
+
70
+ ### Primary Modes
71
+
72
+ Replace the current route feel of "runs, detail, compare" with four persistent product modes:
73
+
74
+ - `Atlas`: scenario graph home and main navigation surface
75
+ - `Cases`: focused scenario or run inspection view
76
+ - `Compare`: run-to-run and suite-to-suite staging plus comparison results
77
+ - `Archive`: tabular history, filters, and fallback list workflows
78
+
79
+ These modes can still map to the existing routes initially, but the UI should present them as coherent product territories.
80
+
81
+ ### Persistent Layout Frame
82
+
83
+ The redesign should use a three-zone desktop shell:
84
+
85
+ - `Left rail`: compact mode navigation, project identity, filter chips, and view toggles
86
+ - `Center stage`: atlas canvas or primary evidence composition
87
+ - `Right drawer`: contextual evidence rail that updates with selection
88
+
89
+ An additional `bottom staging strip` should appear when users pin runs or suite batches for compare. This is a signature product behavior, not a temporary toast.
90
+
91
+ ### Route Strategy
92
+
93
+ Existing routes can remain for implementation simplicity, but each one should visually belong to the new structure:
94
+
95
+ - `/` becomes `Atlas`
96
+ - `/runs/:id` becomes `Cases`
97
+ - `/compare` becomes `Compare`
98
+ - `/compare-suite` becomes `Compare`
99
+
100
+ The `Archive` mode can initially be a view inside `/` or a derived state rather than a new route if needed.
101
+
102
+ ---
103
+
104
+ ## 5. Core Screens
105
+
106
+ ### Atlas Home
107
+
108
+ This is the new landing experience.
109
+
110
+ **Structure:**
111
+ - oversized atlas canvas occupies most of the viewport
112
+ - suite fields appear as bounded territories or chambers
113
+ - scenario nodes float inside those fields with status, volatility, and freshness signals
114
+ - a slim incident ribbon or suite activity rail appears near the top
115
+ - selected node opens the right evidence drawer
116
+
117
+ **Behavior:**
118
+ - hovering a node reveals recent runs, latest verdict, and compare affordances
119
+ - clicking a node locks selection and loads evidence in the drawer
120
+ - clicking a suite field filters focus to that family and reshapes the graph
121
+ - filters in the left rail update graph state without collapsing it into a flat table
122
+
123
+ **Purpose:**
124
+ The home screen should answer "where is risk clustering?" before it answers "what happened in this single run?"
125
+
126
+ ### Case Surface
127
+
128
+ This replaces the plain run detail page.
129
+
130
+ **Structure:**
131
+ - top region: scenario identity, verdict totem, runtime metadata, and relationship breadcrumbs
132
+ - center region: trace fracture view and final output
133
+ - right region: evaluator evidence, tool fingerprints, and error fragments
134
+ - lower region: related runs, neighboring scenarios, and quick compare actions
135
+
136
+ **Purpose:**
137
+ Make a run feel like a technical case file anchored to its scenario, not a long list of sections.
138
+
139
+ ### Compare Chamber
140
+
141
+ This replaces the standard compare page with a more dramatic inspection composition.
142
+
143
+ **Structure:**
144
+ - comparison header framed like an instrument readout
145
+ - baseline and candidate displayed as mirrored evidence columns
146
+ - centerline shows verdict delta, runtime drift, step drift, and classification notes
147
+ - tool and evaluator diffs appear as forensic comparison plates rather than generic cards
148
+
149
+ **Purpose:**
150
+ Turn compare into the product's most convincing high-stakes workflow, since comparison is central to the product story.
151
+
152
+ ### Archive View
153
+
154
+ This keeps the current utility but changes its role.
155
+
156
+ **Structure:**
157
+ - table lives in a controlled archival mode with stronger typography and filtering
158
+ - grouped rows, pinned compare actions, and scenario-first sorting
159
+ - visual ties back to atlas selections
160
+
161
+ **Purpose:**
162
+ Preserve fast scanning and operational familiarity without letting the table define the product's identity.
163
+
164
+ ---
165
+
166
+ ## 6. Signature Components
167
+
168
+ ### Scenario Atlas
169
+
170
+ The atlas is the primary hero object. It should not look like a stock node graph or mind map.
171
+
172
+ **Rules:**
173
+ - suites render as contained regions with distinct edge treatments
174
+ - scenarios render as shaped nodes, not identical circles
175
+ - node appearance reflects scenario role, status mix, and recent activity
176
+ - active paths between nodes should imply workflow adjacency or shared incident lineage
177
+ - graph motion should be slow, deliberate, and subtle, like a calibration field
178
+
179
+ **Visual cues:**
180
+ - pulse rings for recent runs
181
+ - scored scars or notches for historical failures
182
+ - thin route lines for scenario adjacency
183
+ - region labels etched into the background plane instead of floating as badges
184
+
185
+ ### Verdict Totem
186
+
187
+ Replace small pills as the main status expression with a larger instrument-style marker.
188
+
189
+ **Rules:**
190
+ - verdict totems combine shape, contrast, and minimal text
191
+ - pass, fail, error, and neutral should differ in silhouette, not only color
192
+ - totems appear in detail and compare views as anchors for visual scanning
193
+
194
+ ### Evidence Drawer
195
+
196
+ The right-side drawer is always contextual and layered.
197
+
198
+ **Contents can include:**
199
+ - latest run facts
200
+ - evaluator stack
201
+ - tool activity summary
202
+ - error fragment
203
+ - compare pin action
204
+ - related scenario paths
205
+
206
+ This drawer should feel like a lab sidecar, not a modal sidebar.
207
+
208
+ ### Comparison Staging Strip
209
+
210
+ Pinned runs and suite batches live in a visible bottom strip before users enter compare mode.
211
+
212
+ **Rules:**
213
+ - items can be added from atlas nodes, archive rows, and case surfaces
214
+ - the strip visually suggests assembly, pairing, and readiness
215
+ - once two compatible artifacts are pinned, compare becomes a primary action
216
+
217
+ ### Trace Fracture View
218
+
219
+ Trace events should be rendered as segmented evidence seams.
220
+
221
+ **Rules:**
222
+ - steps read as a vertical or angled fracture line
223
+ - each event opens into a chamber with payload details
224
+ - tool calls, evaluator outputs, and errors use differentiated chamber treatments
225
+
226
+ This preserves current trace data while making the inspection experience feel authored.
227
+
228
+ ---
229
+
230
+ ## 7. Visual System
231
+
232
+ ### Material Direction
233
+
234
+ Use an industrial-lab language:
235
+
236
+ - graphite, iron, mineral white, steel blue-gray
237
+ - oxidized red-orange for primary accent
238
+ - signal amber for active attention
239
+ - acid-lime used sparingly for positive calibration moments
240
+
241
+ Avoid the current warm parchment palette. The redesign should feel sharper, cooler, and more engineered.
242
+
243
+ ### Typography
244
+
245
+ Use a paired system:
246
+
247
+ - headline and large labels: condensed or narrow grotesk with authority
248
+ - metadata, codes, ids, and micro-labels: technical mono
249
+
250
+ Type should do more structural work than color. Region labels, mode labels, and evidence titles should feel engraved and deliberate.
251
+
252
+ ### Geometry
253
+
254
+ Prefer:
255
+
256
+ - clipped corners
257
+ - asymmetrical panel cuts
258
+ - inset borders
259
+ - region frames
260
+ - calibrated grid alignments
261
+
262
+ Avoid:
263
+
264
+ - default rounded dashboard cards
265
+ - generic pill-heavy surfaces
266
+ - equal-weight repeated rectangles
267
+
268
+ ### Background System
269
+
270
+ The background should be an active atmospheric layer:
271
+
272
+ - subtle field gradients
273
+ - technical grid ghosts
274
+ - radial inspection glows behind important objects
275
+ - occasional sweep lines or scanning overlays
276
+
277
+ These layers should support depth, not distract from text legibility.
278
+
279
+ ---
280
+
281
+ ## 8. Motion
282
+
283
+ Motion should feel like instrumentation coming online.
284
+
285
+ **Principles:**
286
+ - slower entry motion with intention
287
+ - staggered reveals for regions and evidence layers
288
+ - atlas pulses and route-line drift should be ambient, not constant
289
+ - drawer transitions should feel mechanical and precise
290
+ - compare transitions should emphasize lock-in and alignment
291
+
292
+ Avoid generic hover bounce, playful springiness, and ornamental motion noise.
293
+
294
+ ---
295
+
296
+ ## 9. Responsive Behavior
297
+
298
+ ### Desktop
299
+
300
+ Desktop gets the full three-zone experience with atlas center stage, left rail, right drawer, and bottom staging strip.
301
+
302
+ ### Tablet
303
+
304
+ Tablet keeps atlas first, but the evidence drawer becomes collapsible and the staging strip becomes a compact horizontal tray.
305
+
306
+ ### Mobile
307
+
308
+ Mobile should not attempt to preserve the full atlas as-is.
309
+
310
+ Instead:
311
+ - atlas becomes a vertically stacked scenario field navigator
312
+ - selected scenario opens a focused case sheet
313
+ - compare staging becomes a docked drawer
314
+ - archive table becomes stacked list rows with strong grouping
315
+
316
+ The mobile goal is continuity of identity, not literal desktop parity.
317
+
318
+ ---
319
+
320
+ ## 10. Mapping To Existing Product Data
321
+
322
+ The redesign should be driven by data the product already has.
323
+
324
+ ### Atlas Inputs
325
+
326
+ The current run list can derive:
327
+ - scenario ids
328
+ - suite grouping
329
+ - status distribution
330
+ - latest provider
331
+ - freshness by started time
332
+
333
+ This is enough for an initial atlas without inventing new backend APIs.
334
+
335
+ ### Case Surface Inputs
336
+
337
+ The current run detail payload already supports:
338
+ - run metadata
339
+ - evaluator results
340
+ - tool calls
341
+ - trace events
342
+ - error detail
343
+
344
+ The redesign should remap this into stronger presentation before asking for major data model changes.
345
+
346
+ ### Compare Inputs
347
+
348
+ The current compare payload already supports:
349
+ - classification
350
+ - verdict delta
351
+ - termination delta
352
+ - output changed
353
+ - notes
354
+ - evaluator diffs
355
+ - tool diffs
356
+
357
+ This is sufficient to create a more authored compare chamber.
358
+
359
+ ---
360
+
361
+ ## 11. Risks And Guardrails
362
+
363
+ ### Risk: Style Over Utility
364
+
365
+ If the graph becomes visually impressive but slower than the current list for common workflows, the redesign fails. The archive and compare-entry affordances must remain obvious and fast.
366
+
367
+ ### Risk: Graph Complexity Without Real Meaning
368
+
369
+ If node placement and connections feel arbitrary, the atlas becomes decorative. The visual logic for grouping, adjacency, and activity must be explicit and stable.
370
+
371
+ ### Risk: Overusing Accent Colors
372
+
373
+ A forensic interface loses credibility when every object screams. Reserve strong accents for verdicts, active selection, and meaningful change.
374
+
375
+ ### Risk: Mobile Collapse
376
+
377
+ The cinematic desktop concept must degrade into a simplified but still authored mobile flow. Mobile should be designed intentionally, not compressed from desktop at the end.
378
+
379
+ ---
380
+
381
+ ## 12. Testing Strategy
382
+
383
+ The redesign implementation should be validated at three levels:
384
+
385
+ - structural tests for route rendering and core state transitions
386
+ - visual smoke validation for atlas, case, compare, and archive layouts
387
+ - interaction checks for selection, compare staging, and responsive behavior
388
+
389
+ Specific implementation tests belong in the implementation plan, but the design assumes test coverage for both data mapping and layout resilience.
390
+
391
+ ---
392
+
393
+ ## 13. Initial Implementation Scope
394
+
395
+ To keep the redesign shippable, the first implementation pass should include:
396
+
397
+ - new shell and mode framing
398
+ - atlas home driven by existing run list data
399
+ - redesigned case surface for run detail
400
+ - redesigned compare chamber for run and suite compare
401
+ - archive fallback view
402
+ - core visual system, typography, and motion primitives
403
+
404
+ Out of scope for the first pass:
405
+
406
+ - editable graph topology
407
+ - advanced graph simulation
408
+ - user-customizable atlas layouts
409
+ - backend schema changes purely for design flourish
410
+
411
+ ---
412
+
413
+ ## 14. Final Recommendation
414
+
415
+ Implement the UI redesign as `Regression Atlas`: a bold forensic interface where the scenario graph is the product's center of gravity, compare workflows feel like deliberate technical analysis, and every surface reinforces the identity of a local-first regression lab.
416
+
417
+ The redesign should be memorable because it is structurally committed, not because it is visually loud. The product should feel like an instrument teams trust when behavior changes matter.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agent-regression-lab",
3
- "version": "0.4.0",
3
+ "version": "0.5.0",
4
4
  "private": false,
5
5
  "description": "Local-first scenario-based evaluation harness for AI agents.",
6
6
  "license": "MIT",