agent-regression-lab 0.3.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +25 -4
- package/bin/agentlab.js +2 -0
- package/dist/config.js +13 -9
- package/dist/index.js +14 -0
- package/dist/init.js +88 -0
- package/dist/tools.js +18 -2
- package/dist/ui/App.js +49 -7
- package/dist/ui-assets/client.css +1108 -116
- package/dist/ui-assets/client.js +863 -426
- package/docs/coding-agents.md +74 -0
- package/docs/superpowers/plans/2026-04-13-phase-2-lite-phase-3-plan.md +160 -0
- package/docs/superpowers/plans/2026-04-13-phase-one-npm-tools-plan.md +502 -0
- package/docs/superpowers/plans/2026-04-16-regression-atlas-ui-redesign.md +1010 -0
- package/docs/superpowers/specs/2026-04-13-phase-2-lite-phase-3-design.md +164 -0
- package/docs/superpowers/specs/2026-04-16-regression-atlas-ui-redesign-design.md +417 -0
- package/docs/tools.md +34 -3
- package/docs/troubleshooting.md +55 -0
- package/examples/coding-tools/README.md +21 -0
- package/examples/coding-tools/index.js +11 -0
- package/examples/coding-tools/package.json +8 -0
- package/examples/support-tools/README.md +21 -0
- package/examples/support-tools/index.js +8 -0
- package/examples/support-tools/package.json +8 -0
- package/package.json +6 -4
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# Coding Agents
|
|
2
|
+
|
|
3
|
+
ARL supports coding-agent regression workflows through deterministic task scenarios.
|
|
4
|
+
|
|
5
|
+
Use this path when the runner should remain authoritative for:
|
|
6
|
+
|
|
7
|
+
- file inspection tools
|
|
8
|
+
- patch application tools
|
|
9
|
+
- step limits
|
|
10
|
+
- regression scoring
|
|
11
|
+
|
|
12
|
+
## Start With The Built-In Coding Scenarios
|
|
13
|
+
|
|
14
|
+
This repo already includes two coding scenarios:
|
|
15
|
+
|
|
16
|
+
- `coding.fix-add-function`
|
|
17
|
+
- `coding.update-greeting`
|
|
18
|
+
|
|
19
|
+
Run one directly:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
agentlab run coding.fix-add-function --agent mock-default
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
These scenarios use fixture-backed repo tools, which makes them useful for:
|
|
26
|
+
|
|
27
|
+
- prompt changes
|
|
28
|
+
- model comparisons
|
|
29
|
+
- patch-discipline checks
|
|
30
|
+
- pre-merge behavioral regression checks
|
|
31
|
+
|
|
32
|
+
## Why This Matters
|
|
33
|
+
|
|
34
|
+
Coding agents often regress in subtle ways:
|
|
35
|
+
|
|
36
|
+
- they inspect too much of the repo
|
|
37
|
+
- they patch the wrong file
|
|
38
|
+
- they over-edit instead of making a narrow change
|
|
39
|
+
- they stop naming the changed file clearly
|
|
40
|
+
|
|
41
|
+
ARL helps by making those expectations explicit in scenario evaluators.
|
|
42
|
+
|
|
43
|
+
## Minimal Workflow
|
|
44
|
+
|
|
45
|
+
1. run one coding scenario locally
|
|
46
|
+
2. inspect the run output and trace
|
|
47
|
+
3. run it again against a changed prompt/model/agent variant
|
|
48
|
+
4. compare the two runs
|
|
49
|
+
|
|
50
|
+
Example:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
agentlab run coding.fix-add-function --agent mock-default
|
|
54
|
+
agentlab run coding.fix-add-function --agent mock-default
|
|
55
|
+
agentlab compare <baseline-run-id> <candidate-run-id>
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## When To Use Task Scenarios Versus HTTP
|
|
59
|
+
|
|
60
|
+
Use task scenarios for coding agents when:
|
|
61
|
+
|
|
62
|
+
- you want deterministic fixture-backed tools
|
|
63
|
+
- you want ARL to own the tool loop
|
|
64
|
+
- you want reproducible patch-evaluator behavior
|
|
65
|
+
|
|
66
|
+
Use HTTP/conversation scenarios only when the coding agent already exists as a running service and owns its own orchestration internally.
|
|
67
|
+
|
|
68
|
+
## Next Step
|
|
69
|
+
|
|
70
|
+
If you want coding-agent checks in team workflows, pair these scenarios with suite definitions and CI:
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
agentlab run --suite-def pre_merge --agent mock-default
|
|
74
|
+
```
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Phase 2 Lite And Phase 3 Implementation Plan
|
|
2
|
+
|
|
3
|
+
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
4
|
+
|
|
5
|
+
**Goal:** Deliver a minimal integration story for new users, then improve the UI enough that ARL is easier to demo, screenshot, and understand visually.
|
|
6
|
+
|
|
7
|
+
**Architecture:** Keep Phase 2-lite focused on assets that clarify adoption: README routing, one coding-agent path, and one CI path. Keep Phase 3 focused on UI clarity instead of new product surface area by improving the runs dashboard, comparison screens, and trace presentation inside the existing React UI.
|
|
8
|
+
|
|
9
|
+
**Tech Stack:** TypeScript, React, node:test, esbuild, Markdown, GitHub Actions YAML
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## File Map
|
|
14
|
+
|
|
15
|
+
**Roadmap and product docs**
|
|
16
|
+
- Modify: `.claude/active-tasks.md`
|
|
17
|
+
- Modify: `.claude/project.md`
|
|
18
|
+
- Modify: `README.md`
|
|
19
|
+
|
|
20
|
+
**Phase 2-lite assets**
|
|
21
|
+
- Create: `docs/coding-agents.md`
|
|
22
|
+
- Create: `.github/workflows/agentlab-pre-merge.yml`
|
|
23
|
+
|
|
24
|
+
**UI**
|
|
25
|
+
- Modify: `src/ui/App.tsx`
|
|
26
|
+
- Modify: `src/ui/styles.css`
|
|
27
|
+
|
|
28
|
+
**Tests**
|
|
29
|
+
- Modify: `tests/launch/ui-smoke.test.ts`
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
### Task 1: Reframe Roadmap To Phase 2-lite Then Phase 3
|
|
34
|
+
|
|
35
|
+
**Files:**
|
|
36
|
+
- Modify: `.claude/active-tasks.md`
|
|
37
|
+
- Modify: `.claude/project.md`
|
|
38
|
+
|
|
39
|
+
- [ ] Update active task tracking so the current next phase is `Phase 2-lite`, not the original full Phase 2.
|
|
40
|
+
- [ ] Update project memory so Phase 2-lite is the minimal integration-story pass and Phase 3 is the main visual/demo workstream.
|
|
41
|
+
- [ ] Keep the scope explicit:
|
|
42
|
+
- HTTP example via `arl-test`
|
|
43
|
+
- CI example
|
|
44
|
+
- coding-agent example
|
|
45
|
+
- then UI polish
|
|
46
|
+
|
|
47
|
+
Verification:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
rg -n "Phase 2-lite|Phase 3" .claude/active-tasks.md .claude/project.md
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
### Task 2: Add Phase 2-lite Integration Assets
|
|
56
|
+
|
|
57
|
+
**Files:**
|
|
58
|
+
- Modify: `README.md`
|
|
59
|
+
- Create: `docs/coding-agents.md`
|
|
60
|
+
- Create: `.github/workflows/agentlab-pre-merge.yml`
|
|
61
|
+
|
|
62
|
+
- [ ] Add README routing sections:
|
|
63
|
+
- if your agent runs as an HTTP service
|
|
64
|
+
- if you are validating coding-agent changes
|
|
65
|
+
- if you want pre-merge regression checks in CI
|
|
66
|
+
- [ ] Add one coding-agent guide using the existing coding scenarios and current tool-loop model.
|
|
67
|
+
- [ ] Add one GitHub Actions example that runs:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
npm ci
|
|
71
|
+
npm run build
|
|
72
|
+
node dist/index.js run --suite-def pre_merge --agent mock-default
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
- [ ] Keep this section narrow and copy-pasteable. No broad framework matrix.
|
|
76
|
+
|
|
77
|
+
Verification:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
rg -n "HTTP service|coding-agent|pre-merge|GitHub Actions" README.md docs/coding-agents.md .github/workflows/agentlab-pre-merge.yml
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
### Task 3: Improve Runs Dashboard And Comparison UX
|
|
86
|
+
|
|
87
|
+
**Files:**
|
|
88
|
+
- Modify: `src/ui/App.tsx`
|
|
89
|
+
- Modify: `src/ui/styles.css`
|
|
90
|
+
- Modify: `tests/launch/ui-smoke.test.ts`
|
|
91
|
+
|
|
92
|
+
- [ ] Add a stronger runs dashboard summary at the top of the list page:
|
|
93
|
+
- total runs shown
|
|
94
|
+
- pass/fail/error counts
|
|
95
|
+
- most recent suite/context hint
|
|
96
|
+
- [ ] Redesign the compare page to make regressions visually obvious:
|
|
97
|
+
- top classification banner
|
|
98
|
+
- clearer delta cards
|
|
99
|
+
- evaluator/tool diff blocks with stronger hierarchy
|
|
100
|
+
- more obvious baseline vs candidate sections
|
|
101
|
+
- [ ] Make the suite compare page easier to scan:
|
|
102
|
+
- headline regression/improvement counts
|
|
103
|
+
- clearer scenario groupings
|
|
104
|
+
|
|
105
|
+
Verification:
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
npx tsx --test tests/launch/ui-smoke.test.ts
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
### Task 4: Improve Trace And Detail Presentation
|
|
114
|
+
|
|
115
|
+
**Files:**
|
|
116
|
+
- Modify: `src/ui/App.tsx`
|
|
117
|
+
- Modify: `src/ui/styles.css`
|
|
118
|
+
- Modify: `tests/launch/ui-smoke.test.ts`
|
|
119
|
+
|
|
120
|
+
- [ ] Replace the plain trace list with a more intentional timeline treatment:
|
|
121
|
+
- event badges or type labels
|
|
122
|
+
- stronger step grouping
|
|
123
|
+
- clearer source metadata
|
|
124
|
+
- [ ] Keep failure-first behavior intact.
|
|
125
|
+
- [ ] Preserve readability on narrow screens.
|
|
126
|
+
|
|
127
|
+
Verification:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
npx tsx --test tests/launch/ui-smoke.test.ts
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
### Task 5: Full Verification
|
|
136
|
+
|
|
137
|
+
**Files:**
|
|
138
|
+
- Modify only if verification exposes issues
|
|
139
|
+
|
|
140
|
+
- [ ] Run focused UI/docs-related verification:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
npx tsx --test tests/launch/ui-smoke.test.ts tests/cliPackaging.test.ts
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
- [ ] Run full suite:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
npm test
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
- [ ] Run release gates:
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
npm run check
|
|
156
|
+
npm run build
|
|
157
|
+
npm run smoke:cli
|
|
158
|
+
npm_config_cache=/tmp/agentlab-npm-cache npm pack --dry-run
|
|
159
|
+
```
|
|
160
|
+
|