agent-regression-lab 0.4.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -0,0 +1,417 @@
|
|
|
1
|
+
# Regression Atlas UI Redesign
|
|
2
|
+
|
|
3
|
+
**Date:** April 16, 2026
|
|
4
|
+
**Status:** Approved for spec review
|
|
5
|
+
**Approach:** Scenario-graph-led forensic UI for Agent Regression Lab
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. Overview
|
|
10
|
+
|
|
11
|
+
Redesign the local UI so it feels like the product itself instead of a generic admin shell. The current UI is structurally sound for inspection, but it is emotionally flat: table-first, low-drama, and visually interchangeable with internal tooling. The redesign should make the product feel like a purpose-built forensic environment for agent regression work.
|
|
12
|
+
|
|
13
|
+
The new direction is `Regression Atlas`: a cinematic, industrial-lab interface organized around a scenario graph instead of a run list. The graph becomes the primary orienting surface. Runs, suite batches, traces, tool calls, evaluator results, and comparisons all attach back to scenarios as evidence artifacts. The visual language should feel measured, severe, and memorable without becoming decorative noise.
|
|
14
|
+
|
|
15
|
+
**Primary goals:**
|
|
16
|
+
- Make the UI feel unmistakably like Agent Regression Lab
|
|
17
|
+
- Center the product around scenario topology rather than flat lists
|
|
18
|
+
- Preserve efficient access to run inspection and compare workflows
|
|
19
|
+
- Introduce unconventional but legible structures and components
|
|
20
|
+
- Give the product a durable visual identity that can scale beyond alpha
|
|
21
|
+
|
|
22
|
+
**Success criteria:**
|
|
23
|
+
- The home screen is graph-led and immediately communicates the regression surface
|
|
24
|
+
- Users can still reach run details and compare flows quickly without hunting
|
|
25
|
+
- Distinctive components and layout choices create product-specific character
|
|
26
|
+
- The interface is visually bold, but still usable for repeated technical inspection
|
|
27
|
+
- Desktop feels cinematic; mobile remains navigable and coherent
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 2. Product Narrative
|
|
32
|
+
|
|
33
|
+
The interface should feel like a machine for mapping behavioral stability. Users are not browsing rows. They are reading a terrain of workflows, incidents, and behavior changes. The product story becomes:
|
|
34
|
+
|
|
35
|
+
1. identify the workflow area under scrutiny
|
|
36
|
+
2. inspect scenario activity and recent instability
|
|
37
|
+
3. open a scenario as a case surface
|
|
38
|
+
4. move through evidence, traces, and comparisons with minimal context switching
|
|
39
|
+
|
|
40
|
+
This narrative is important because it informs the layout. The graph is not a decorative hero banner. It is the product's main mental model.
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## 3. Design Principles
|
|
45
|
+
|
|
46
|
+
### Scenario First
|
|
47
|
+
|
|
48
|
+
The UI should orient users around scenarios and suites before individual runs. Runs are evidence attached to a scenario node, not the top-level object.
|
|
49
|
+
|
|
50
|
+
### Forensic, Not Futuristic
|
|
51
|
+
|
|
52
|
+
The visual system should feel industrial and analytical, not sci-fi. Use instrument-like framing, measured geometry, etched dividers, and calibrated accents instead of glowing fantasy surfaces.
|
|
53
|
+
|
|
54
|
+
### Cinematic Density
|
|
55
|
+
|
|
56
|
+
The interface should give important objects room to breathe. Large layout gestures, dramatic negative space, and anchored focal regions should replace dense dashboard packing.
|
|
57
|
+
|
|
58
|
+
### Deliberate Unconventionality
|
|
59
|
+
|
|
60
|
+
Each major component should have a structural reason to exist beyond styling. The UI should not rely on generic cards, standard KPI strips, or bland tabs where a more product-specific pattern would clarify the product.
|
|
61
|
+
|
|
62
|
+
### Operational Fallbacks
|
|
63
|
+
|
|
64
|
+
Even with a graph-led experience, table and list views remain easy to reach. Familiar views are secondary tools, not removed capabilities.
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## 4. Information Architecture
|
|
69
|
+
|
|
70
|
+
### Primary Modes
|
|
71
|
+
|
|
72
|
+
Replace the current route feel of "runs, detail, compare" with four persistent product modes:
|
|
73
|
+
|
|
74
|
+
- `Atlas`: scenario graph home and main navigation surface
|
|
75
|
+
- `Cases`: focused scenario or run inspection view
|
|
76
|
+
- `Compare`: run-to-run and suite-to-suite staging plus comparison results
|
|
77
|
+
- `Archive`: tabular history, filters, and fallback list workflows
|
|
78
|
+
|
|
79
|
+
These modes can still map to the existing routes initially, but the UI should present them as coherent product territories.
|
|
80
|
+
|
|
81
|
+
### Persistent Layout Frame
|
|
82
|
+
|
|
83
|
+
The redesign should use a three-zone desktop shell:
|
|
84
|
+
|
|
85
|
+
- `Left rail`: compact mode navigation, project identity, filter chips, and view toggles
|
|
86
|
+
- `Center stage`: atlas canvas or primary evidence composition
|
|
87
|
+
- `Right drawer`: contextual evidence rail that updates with selection
|
|
88
|
+
|
|
89
|
+
An additional `bottom staging strip` should appear when users pin runs or suite batches for compare. This is a signature product behavior, not a temporary toast.
|
|
90
|
+
|
|
91
|
+
### Route Strategy
|
|
92
|
+
|
|
93
|
+
Existing routes can remain for implementation simplicity, but each one should visually belong to the new structure:
|
|
94
|
+
|
|
95
|
+
- `/` becomes `Atlas`
|
|
96
|
+
- `/runs/:id` becomes `Cases`
|
|
97
|
+
- `/compare` becomes `Compare`
|
|
98
|
+
- `/compare-suite` becomes `Compare`
|
|
99
|
+
|
|
100
|
+
The `Archive` mode can initially be a view inside `/` or a derived state rather than a new route if needed.
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## 5. Core Screens
|
|
105
|
+
|
|
106
|
+
### Atlas Home
|
|
107
|
+
|
|
108
|
+
This is the new landing experience.
|
|
109
|
+
|
|
110
|
+
**Structure:**
|
|
111
|
+
- oversized atlas canvas occupies most of the viewport
|
|
112
|
+
- suite fields appear as bounded territories or chambers
|
|
113
|
+
- scenario nodes float inside those fields with status, volatility, and freshness signals
|
|
114
|
+
- a slim incident ribbon or suite activity rail appears near the top
|
|
115
|
+
- selected node opens the right evidence drawer
|
|
116
|
+
|
|
117
|
+
**Behavior:**
|
|
118
|
+
- hovering a node reveals recent runs, latest verdict, and compare affordances
|
|
119
|
+
- clicking a node locks selection and loads evidence in the drawer
|
|
120
|
+
- clicking a suite field filters focus to that family and reshapes the graph
|
|
121
|
+
- filters in the left rail update graph state without collapsing it into a flat table
|
|
122
|
+
|
|
123
|
+
**Purpose:**
|
|
124
|
+
The home screen should answer "where is risk clustering?" before it answers "what happened in this single run?"
|
|
125
|
+
|
|
126
|
+
### Case Surface
|
|
127
|
+
|
|
128
|
+
This replaces the plain run detail page.
|
|
129
|
+
|
|
130
|
+
**Structure:**
|
|
131
|
+
- top region: scenario identity, verdict totem, runtime metadata, and relationship breadcrumbs
|
|
132
|
+
- center region: trace fracture view and final output
|
|
133
|
+
- right region: evaluator evidence, tool fingerprints, and error fragments
|
|
134
|
+
- lower region: related runs, neighboring scenarios, and quick compare actions
|
|
135
|
+
|
|
136
|
+
**Purpose:**
|
|
137
|
+
Make a run feel like a technical case file anchored to its scenario, not a long list of sections.
|
|
138
|
+
|
|
139
|
+
### Compare Chamber
|
|
140
|
+
|
|
141
|
+
This replaces the standard compare page with a more dramatic inspection composition.
|
|
142
|
+
|
|
143
|
+
**Structure:**
|
|
144
|
+
- comparison header framed like an instrument readout
|
|
145
|
+
- baseline and candidate displayed as mirrored evidence columns
|
|
146
|
+
- centerline shows verdict delta, runtime drift, step drift, and classification notes
|
|
147
|
+
- tool and evaluator diffs appear as forensic comparison plates rather than generic cards
|
|
148
|
+
|
|
149
|
+
**Purpose:**
|
|
150
|
+
Turn compare into the product's most convincing high-stakes workflow, since comparison is central to the product story.
|
|
151
|
+
|
|
152
|
+
### Archive View
|
|
153
|
+
|
|
154
|
+
This keeps the current utility but changes its role.
|
|
155
|
+
|
|
156
|
+
**Structure:**
|
|
157
|
+
- table lives in a controlled archival mode with stronger typography and filtering
|
|
158
|
+
- grouped rows, pinned compare actions, and scenario-first sorting
|
|
159
|
+
- visual ties back to atlas selections
|
|
160
|
+
|
|
161
|
+
**Purpose:**
|
|
162
|
+
Preserve fast scanning and operational familiarity without letting the table define the product's identity.
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
## 6. Signature Components
|
|
167
|
+
|
|
168
|
+
### Scenario Atlas
|
|
169
|
+
|
|
170
|
+
The atlas is the primary hero object. It should not look like a stock node graph or mind map.
|
|
171
|
+
|
|
172
|
+
**Rules:**
|
|
173
|
+
- suites render as contained regions with distinct edge treatments
|
|
174
|
+
- scenarios render as shaped nodes, not identical circles
|
|
175
|
+
- node appearance reflects scenario role, status mix, and recent activity
|
|
176
|
+
- active paths between nodes should imply workflow adjacency or shared incident lineage
|
|
177
|
+
- graph motion should be slow, deliberate, and subtle, like a calibration field
|
|
178
|
+
|
|
179
|
+
**Visual cues:**
|
|
180
|
+
- pulse rings for recent runs
|
|
181
|
+
- scored scars or notches for historical failures
|
|
182
|
+
- thin route lines for scenario adjacency
|
|
183
|
+
- region labels etched into the background plane instead of floating as badges
|
|
184
|
+
|
|
185
|
+
### Verdict Totem
|
|
186
|
+
|
|
187
|
+
Replace small pills as the main status expression with a larger instrument-style marker.
|
|
188
|
+
|
|
189
|
+
**Rules:**
|
|
190
|
+
- verdict totems combine shape, contrast, and minimal text
|
|
191
|
+
- pass, fail, error, and neutral should differ in silhouette, not only color
|
|
192
|
+
- totems appear in detail and compare views as anchors for visual scanning
|
|
193
|
+
|
|
194
|
+
### Evidence Drawer
|
|
195
|
+
|
|
196
|
+
The right-side drawer is always contextual and layered.
|
|
197
|
+
|
|
198
|
+
**Contents can include:**
|
|
199
|
+
- latest run facts
|
|
200
|
+
- evaluator stack
|
|
201
|
+
- tool activity summary
|
|
202
|
+
- error fragment
|
|
203
|
+
- compare pin action
|
|
204
|
+
- related scenario paths
|
|
205
|
+
|
|
206
|
+
This drawer should feel like a lab sidecar, not a modal sidebar.
|
|
207
|
+
|
|
208
|
+
### Comparison Staging Strip
|
|
209
|
+
|
|
210
|
+
Pinned runs and suite batches live in a visible bottom strip before users enter compare mode.
|
|
211
|
+
|
|
212
|
+
**Rules:**
|
|
213
|
+
- items can be added from atlas nodes, archive rows, and case surfaces
|
|
214
|
+
- the strip visually suggests assembly, pairing, and readiness
|
|
215
|
+
- once two compatible artifacts are pinned, compare becomes a primary action
|
|
216
|
+
|
|
217
|
+
### Trace Fracture View
|
|
218
|
+
|
|
219
|
+
Trace events should be rendered as segmented evidence seams.
|
|
220
|
+
|
|
221
|
+
**Rules:**
|
|
222
|
+
- steps read as a vertical or angled fracture line
|
|
223
|
+
- each event opens into a chamber with payload details
|
|
224
|
+
- tool calls, evaluator outputs, and errors use differentiated chamber treatments
|
|
225
|
+
|
|
226
|
+
This preserves current trace data while making the inspection experience feel authored.
|
|
227
|
+
|
|
228
|
+
---
|
|
229
|
+
|
|
230
|
+
## 7. Visual System
|
|
231
|
+
|
|
232
|
+
### Material Direction
|
|
233
|
+
|
|
234
|
+
Use an industrial-lab language:
|
|
235
|
+
|
|
236
|
+
- graphite, iron, mineral white, steel blue-gray
|
|
237
|
+
- oxidized red-orange for primary accent
|
|
238
|
+
- signal amber for active attention
|
|
239
|
+
- acid-lime used sparingly for positive calibration moments
|
|
240
|
+
|
|
241
|
+
Avoid the current warm parchment palette. The redesign should feel sharper, cooler, and more engineered.
|
|
242
|
+
|
|
243
|
+
### Typography
|
|
244
|
+
|
|
245
|
+
Use a paired system:
|
|
246
|
+
|
|
247
|
+
- headline and large labels: condensed or narrow grotesk with authority
|
|
248
|
+
- metadata, codes, ids, and micro-labels: technical mono
|
|
249
|
+
|
|
250
|
+
Type should do more structural work than color. Region labels, mode labels, and evidence titles should feel engraved and deliberate.
|
|
251
|
+
|
|
252
|
+
### Geometry
|
|
253
|
+
|
|
254
|
+
Prefer:
|
|
255
|
+
|
|
256
|
+
- clipped corners
|
|
257
|
+
- asymmetrical panel cuts
|
|
258
|
+
- inset borders
|
|
259
|
+
- region frames
|
|
260
|
+
- calibrated grid alignments
|
|
261
|
+
|
|
262
|
+
Avoid:
|
|
263
|
+
|
|
264
|
+
- default rounded dashboard cards
|
|
265
|
+
- generic pill-heavy surfaces
|
|
266
|
+
- equal-weight repeated rectangles
|
|
267
|
+
|
|
268
|
+
### Background System
|
|
269
|
+
|
|
270
|
+
The background should be an active atmospheric layer:
|
|
271
|
+
|
|
272
|
+
- subtle field gradients
|
|
273
|
+
- technical grid ghosts
|
|
274
|
+
- radial inspection glows behind important objects
|
|
275
|
+
- occasional sweep lines or scanning overlays
|
|
276
|
+
|
|
277
|
+
These layers should support depth, not distract from text legibility.
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## 8. Motion
|
|
282
|
+
|
|
283
|
+
Motion should feel like instrumentation coming online.
|
|
284
|
+
|
|
285
|
+
**Principles:**
|
|
286
|
+
- slower entry motion with intention
|
|
287
|
+
- staggered reveals for regions and evidence layers
|
|
288
|
+
- atlas pulses and route-line drift should be ambient, not constant
|
|
289
|
+
- drawer transitions should feel mechanical and precise
|
|
290
|
+
- compare transitions should emphasize lock-in and alignment
|
|
291
|
+
|
|
292
|
+
Avoid generic hover bounce, playful springiness, and ornamental motion noise.
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
## 9. Responsive Behavior
|
|
297
|
+
|
|
298
|
+
### Desktop
|
|
299
|
+
|
|
300
|
+
Desktop gets the full three-zone experience with atlas center stage, left rail, right drawer, and bottom staging strip.
|
|
301
|
+
|
|
302
|
+
### Tablet
|
|
303
|
+
|
|
304
|
+
Tablet keeps atlas first, but the evidence drawer becomes collapsible and the staging strip becomes a compact horizontal tray.
|
|
305
|
+
|
|
306
|
+
### Mobile
|
|
307
|
+
|
|
308
|
+
Mobile should not attempt to preserve the full atlas as-is.
|
|
309
|
+
|
|
310
|
+
Instead:
|
|
311
|
+
- atlas becomes a vertically stacked scenario field navigator
|
|
312
|
+
- selected scenario opens a focused case sheet
|
|
313
|
+
- compare staging becomes a docked drawer
|
|
314
|
+
- archive table becomes stacked list rows with strong grouping
|
|
315
|
+
|
|
316
|
+
The mobile goal is continuity of identity, not literal desktop parity.
|
|
317
|
+
|
|
318
|
+
---
|
|
319
|
+
|
|
320
|
+
## 10. Mapping To Existing Product Data
|
|
321
|
+
|
|
322
|
+
The redesign should be driven by data the product already has.
|
|
323
|
+
|
|
324
|
+
### Atlas Inputs
|
|
325
|
+
|
|
326
|
+
The current run list can derive:
|
|
327
|
+
- scenario ids
|
|
328
|
+
- suite grouping
|
|
329
|
+
- status distribution
|
|
330
|
+
- latest provider
|
|
331
|
+
- freshness by started time
|
|
332
|
+
|
|
333
|
+
This is enough for an initial atlas without inventing new backend APIs.
|
|
334
|
+
|
|
335
|
+
### Case Surface Inputs
|
|
336
|
+
|
|
337
|
+
The current run detail payload already supports:
|
|
338
|
+
- run metadata
|
|
339
|
+
- evaluator results
|
|
340
|
+
- tool calls
|
|
341
|
+
- trace events
|
|
342
|
+
- error detail
|
|
343
|
+
|
|
344
|
+
The redesign should remap this into stronger presentation before asking for major data model changes.
|
|
345
|
+
|
|
346
|
+
### Compare Inputs
|
|
347
|
+
|
|
348
|
+
The current compare payload already supports:
|
|
349
|
+
- classification
|
|
350
|
+
- verdict delta
|
|
351
|
+
- termination delta
|
|
352
|
+
- output changed
|
|
353
|
+
- notes
|
|
354
|
+
- evaluator diffs
|
|
355
|
+
- tool diffs
|
|
356
|
+
|
|
357
|
+
This is sufficient to create a more authored compare chamber.
|
|
358
|
+
|
|
359
|
+
---
|
|
360
|
+
|
|
361
|
+
## 11. Risks And Guardrails
|
|
362
|
+
|
|
363
|
+
### Risk: Style Over Utility
|
|
364
|
+
|
|
365
|
+
If the graph becomes visually impressive but slower than the current list for common workflows, the redesign fails. The archive and compare-entry affordances must remain obvious and fast.
|
|
366
|
+
|
|
367
|
+
### Risk: Graph Complexity Without Real Meaning
|
|
368
|
+
|
|
369
|
+
If node placement and connections feel arbitrary, the atlas becomes decorative. The visual logic for grouping, adjacency, and activity must be explicit and stable.
|
|
370
|
+
|
|
371
|
+
### Risk: Overusing Accent Colors
|
|
372
|
+
|
|
373
|
+
A forensic interface loses credibility when every object screams. Reserve strong accents for verdicts, active selection, and meaningful change.
|
|
374
|
+
|
|
375
|
+
### Risk: Mobile Collapse
|
|
376
|
+
|
|
377
|
+
The cinematic desktop concept must degrade into a simplified but still authored mobile flow. Mobile should be designed intentionally, not compressed from desktop at the end.
|
|
378
|
+
|
|
379
|
+
---
|
|
380
|
+
|
|
381
|
+
## 12. Testing Strategy
|
|
382
|
+
|
|
383
|
+
The redesign implementation should be validated at three levels:
|
|
384
|
+
|
|
385
|
+
- structural tests for route rendering and core state transitions
|
|
386
|
+
- visual smoke validation for atlas, case, compare, and archive layouts
|
|
387
|
+
- interaction checks for selection, compare staging, and responsive behavior
|
|
388
|
+
|
|
389
|
+
Specific implementation tests belong in the implementation plan, but the design assumes test coverage for both data mapping and layout resilience.
|
|
390
|
+
|
|
391
|
+
---
|
|
392
|
+
|
|
393
|
+
## 13. Initial Implementation Scope
|
|
394
|
+
|
|
395
|
+
To keep the redesign shippable, the first implementation pass should include:
|
|
396
|
+
|
|
397
|
+
- new shell and mode framing
|
|
398
|
+
- atlas home driven by existing run list data
|
|
399
|
+
- redesigned case surface for run detail
|
|
400
|
+
- redesigned compare chamber for run and suite compare
|
|
401
|
+
- archive fallback view
|
|
402
|
+
- core visual system, typography, and motion primitives
|
|
403
|
+
|
|
404
|
+
Out of scope for the first pass:
|
|
405
|
+
|
|
406
|
+
- editable graph topology
|
|
407
|
+
- advanced graph simulation
|
|
408
|
+
- user-customizable atlas layouts
|
|
409
|
+
- backend schema changes purely for design flourish
|
|
410
|
+
|
|
411
|
+
---
|
|
412
|
+
|
|
413
|
+
## 14. Final Recommendation
|
|
414
|
+
|
|
415
|
+
Implement the UI redesign as `Regression Atlas`: a bold forensic interface where the scenario graph is the product's center of gravity, compare workflows feel like deliberate technical analysis, and every surface reinforces the identity of a local-first regression lab.
|
|
416
|
+
|
|
417
|
+
The redesign should be memorable because it is structurally committed, not because it is visually loud. The product should feel like an instrument teams trust when behavior changes matter.
|