agent-regression-lab 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,74 @@
1
+ # Coding Agents
2
+
3
+ ARL supports coding-agent regression workflows through deterministic task scenarios.
4
+
5
+ Use this path when the runner should remain authoritative for:
6
+
7
+ - file inspection tools
8
+ - patch application tools
9
+ - step limits
10
+ - regression scoring
11
+
12
+ ## Start With The Built-In Coding Scenarios
13
+
14
+ This repo already includes two coding scenarios:
15
+
16
+ - `coding.fix-add-function`
17
+ - `coding.update-greeting`
18
+
19
+ Run one directly:
20
+
21
+ ```bash
22
+ agentlab run coding.fix-add-function --agent mock-default
23
+ ```
24
+
25
+ These scenarios use fixture-backed repo tools, which makes them useful for:
26
+
27
+ - prompt changes
28
+ - model comparisons
29
+ - patch-discipline checks
30
+ - pre-merge behavioral regression checks
31
+
32
+ ## Why This Matters
33
+
34
+ Coding agents often regress in subtle ways:
35
+
36
+ - they inspect too much of the repo
37
+ - they patch the wrong file
38
+ - they over-edit instead of making a narrow change
39
+ - they stop naming the changed file clearly
40
+
41
+ ARL helps by making those expectations explicit in scenario evaluators.
42
+
43
+ ## Minimal Workflow
44
+
45
+ 1. run one coding scenario locally
46
+ 2. inspect the run output and trace
47
+ 3. run it again against a changed prompt/model/agent variant
48
+ 4. compare the two runs
49
+
50
+ Example:
51
+
52
+ ```bash
53
+ agentlab run coding.fix-add-function --agent mock-default
54
+ agentlab run coding.fix-add-function --agent mock-default
55
+ agentlab compare <baseline-run-id> <candidate-run-id>
56
+ ```
57
+
58
+ ## When To Use Task Scenarios Versus HTTP
59
+
60
+ Use task scenarios for coding agents when:
61
+
62
+ - you want deterministic fixture-backed tools
63
+ - you want ARL to own the tool loop
64
+ - you want reproducible patch-evaluator behavior
65
+
66
+ Use HTTP/conversation scenarios only when the coding agent already exists as a running service and owns its own orchestration internally.
67
+
68
+ ## Next Step
69
+
70
+ If you want coding-agent checks in team workflows, pair these scenarios with suite definitions and CI:
71
+
72
+ ```bash
73
+ agentlab run --suite-def pre_merge --agent mock-default
74
+ ```
@@ -0,0 +1,160 @@
1
+ # Phase 2 Lite And Phase 3 Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Deliver a minimal integration story for new users, then improve the UI enough that ARL is easier to demo, screenshot, and understand visually.
6
+
7
+ **Architecture:** Keep Phase 2-lite focused on assets that clarify adoption: README routing, one coding-agent path, and one CI path. Keep Phase 3 focused on UI clarity instead of new product surface area by improving the runs dashboard, comparison screens, and trace presentation inside the existing React UI.
8
+
9
+ **Tech Stack:** TypeScript, React, node:test, esbuild, Markdown, GitHub Actions YAML
10
+
11
+ ---
12
+
13
+ ## File Map
14
+
15
+ **Roadmap and product docs**
16
+ - Modify: `.claude/active-tasks.md`
17
+ - Modify: `.claude/project.md`
18
+ - Modify: `README.md`
19
+
20
+ **Phase 2-lite assets**
21
+ - Create: `docs/coding-agents.md`
22
+ - Create: `.github/workflows/agentlab-pre-merge.yml`
23
+
24
+ **UI**
25
+ - Modify: `src/ui/App.tsx`
26
+ - Modify: `src/ui/styles.css`
27
+
28
+ **Tests**
29
+ - Modify: `tests/launch/ui-smoke.test.ts`
30
+
31
+ ---
32
+
33
+ ### Task 1: Reframe Roadmap To Phase 2-lite Then Phase 3
34
+
35
+ **Files:**
36
+ - Modify: `.claude/active-tasks.md`
37
+ - Modify: `.claude/project.md`
38
+
39
+ - [ ] Update active task tracking so the current next phase is `Phase 2-lite`, not the original full Phase 2.
40
+ - [ ] Update project memory so Phase 2-lite is the minimal integration-story pass and Phase 3 is the main visual/demo workstream.
41
+ - [ ] Keep the scope explicit:
42
+ - HTTP example via `arl-test`
43
+ - CI example
44
+ - coding-agent example
45
+ - then UI polish
46
+
47
+ Verification:
48
+
49
+ ```bash
50
+ rg -n "Phase 2-lite|Phase 3" .claude/active-tasks.md .claude/project.md
51
+ ```
52
+
53
+ ---
54
+
55
+ ### Task 2: Add Phase 2-lite Integration Assets
56
+
57
+ **Files:**
58
+ - Modify: `README.md`
59
+ - Create: `docs/coding-agents.md`
60
+ - Create: `.github/workflows/agentlab-pre-merge.yml`
61
+
62
+ - [ ] Add README routing sections:
63
+ - if your agent runs as an HTTP service
64
+ - if you are validating coding-agent changes
65
+ - if you want pre-merge regression checks in CI
66
+ - [ ] Add one coding-agent guide using the existing coding scenarios and current tool-loop model.
67
+ - [ ] Add one GitHub Actions example that runs:
68
+
69
+ ```bash
70
+ npm ci
71
+ npm run build
72
+ node dist/index.js run --suite-def pre_merge --agent mock-default
73
+ ```
74
+
75
+ - [ ] Keep this section narrow and copy-pasteable. No broad framework matrix.
76
+
77
+ Verification:
78
+
79
+ ```bash
80
+ rg -n "HTTP service|coding-agent|pre-merge|GitHub Actions" README.md docs/coding-agents.md .github/workflows/agentlab-pre-merge.yml
81
+ ```
82
+
83
+ ---
84
+
85
+ ### Task 3: Improve Runs Dashboard And Comparison UX
86
+
87
+ **Files:**
88
+ - Modify: `src/ui/App.tsx`
89
+ - Modify: `src/ui/styles.css`
90
+ - Modify: `tests/launch/ui-smoke.test.ts`
91
+
92
+ - [ ] Add a stronger runs dashboard summary at the top of the list page:
93
+ - total runs shown
94
+ - pass/fail/error counts
95
+ - most recent suite/context hint
96
+ - [ ] Redesign the compare page to make regressions visually obvious:
97
+ - top classification banner
98
+ - clearer delta cards
99
+ - evaluator/tool diff blocks with stronger hierarchy
100
+ - more obvious baseline vs candidate sections
101
+ - [ ] Make the suite compare page easier to scan:
102
+ - headline regression/improvement counts
103
+ - clearer scenario groupings
104
+
105
+ Verification:
106
+
107
+ ```bash
108
+ npx tsx --test tests/launch/ui-smoke.test.ts
109
+ ```
110
+
111
+ ---
112
+
113
+ ### Task 4: Improve Trace And Detail Presentation
114
+
115
+ **Files:**
116
+ - Modify: `src/ui/App.tsx`
117
+ - Modify: `src/ui/styles.css`
118
+ - Modify: `tests/launch/ui-smoke.test.ts`
119
+
120
+ - [ ] Replace the plain trace list with a more intentional timeline treatment:
121
+ - event badges or type labels
122
+ - stronger step grouping
123
+ - clearer source metadata
124
+ - [ ] Keep failure-first behavior intact.
125
+ - [ ] Preserve readability on narrow screens.
126
+
127
+ Verification:
128
+
129
+ ```bash
130
+ npx tsx --test tests/launch/ui-smoke.test.ts
131
+ ```
132
+
133
+ ---
134
+
135
+ ### Task 5: Full Verification
136
+
137
+ **Files:**
138
+ - Modify only if verification exposes issues
139
+
140
+ - [ ] Run focused UI/docs-related verification:
141
+
142
+ ```bash
143
+ npx tsx --test tests/launch/ui-smoke.test.ts tests/cliPackaging.test.ts
144
+ ```
145
+
146
+ - [ ] Run full suite:
147
+
148
+ ```bash
149
+ npm test
150
+ ```
151
+
152
+ - [ ] Run release gates:
153
+
154
+ ```bash
155
+ npm run check
156
+ npm run build
157
+ npm run smoke:cli
158
+ npm_config_cache=/tmp/agentlab-npm-cache npm pack --dry-run
159
+ ```
160
+