agent-bober 0.4.3 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +30 -0
- package/agents/bober-evaluator.md +277 -8
- package/agents/bober-generator.md +155 -0
- package/agents/bober-planner.md +70 -0
- package/dist/cli/commands/init.js +1 -0
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/evaluators/builtin/playwright.d.ts +11 -0
- package/dist/evaluators/builtin/playwright.d.ts.map +1 -1
- package/dist/evaluators/builtin/playwright.js +259 -12
- package/dist/evaluators/builtin/playwright.js.map +1 -1
- package/package.json +1 -1
- package/skills/bober.eval/SKILL.md +145 -148
- package/skills/bober.playwright/SKILL.md +429 -0
- package/skills/bober.playwright/references/playwright-patterns.md +377 -0
- package/skills/bober.run/SKILL.md +425 -118
- package/skills/bober.sprint/SKILL.md +147 -57
- package/templates/presets/nextjs/bober.config.json +2 -1
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: bober.eval
|
|
3
|
-
description:
|
|
3
|
+
description: Spawn an evaluator subagent to independently assess the current sprint state against its contract, producing structured pass/fail feedback.
|
|
4
4
|
argument-hint: "[contract-id]"
|
|
5
5
|
handoffs:
|
|
6
6
|
- label: "Rework Sprint"
|
|
@@ -11,13 +11,15 @@ handoffs:
|
|
|
11
11
|
prompt: "Move to the next sprint"
|
|
12
12
|
---
|
|
13
13
|
|
|
14
|
-
# bober.eval — Standalone Evaluation
|
|
14
|
+
# bober.eval — Standalone Evaluation Orchestrator
|
|
15
15
|
|
|
16
|
-
You are
|
|
16
|
+
You are the **orchestrator** for a standalone evaluation run. You do NOT evaluate the code yourself. You spawn the evaluator as a subagent using the **Agent tool**, then process and save its results.
|
|
17
|
+
|
|
18
|
+
The evaluator subagent runs in its own isolated context window. It receives ONLY the information you explicitly pass in its prompt.
|
|
17
19
|
|
|
18
20
|
## When to Use This Skill
|
|
19
21
|
|
|
20
|
-
- **During development:** To check
|
|
22
|
+
- **During development:** To check progress against criteria before running the full sprint loop
|
|
21
23
|
- **After manual changes:** When you have fixed something the Generator produced and want to re-evaluate
|
|
22
24
|
- **For debugging:** To understand exactly what is passing and failing in a sprint
|
|
23
25
|
- **As a standalone QA check:** To evaluate any codebase state against a sprint contract
|
|
@@ -36,171 +38,167 @@ You are running the **bober.eval** skill. Your job is to independently evaluate
|
|
|
36
38
|
|
|
37
39
|
Read the contract and its parent PlanSpec.
|
|
38
40
|
|
|
39
|
-
## Step 2:
|
|
41
|
+
## Step 2: Gather Context
|
|
40
42
|
|
|
41
43
|
Read `bober.config.json` and extract:
|
|
42
44
|
- `evaluator.strategies`: The configured evaluation strategies
|
|
43
45
|
- `evaluator.model`: The model to use (informational)
|
|
44
46
|
- `commands`: The project commands for build, test, lint, typecheck
|
|
45
47
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
Before running evaluation strategies, verify the environment:
|
|
49
|
-
|
|
50
|
-
1. **Check if dependencies are installed:**
|
|
51
|
-
```bash
|
|
52
|
-
# Check for installed dependencies (varies by stack)
|
|
53
|
-
# Node.js: ls node_modules/.package-lock.json 2>/dev/null
|
|
54
|
-
# Rust/Anchor: check target/ directory
|
|
55
|
-
# Solidity/Hardhat: ls node_modules/.package-lock.json 2>/dev/null
|
|
56
|
-
# Solidity/Foundry: check lib/ directory
|
|
57
|
-
# Python: check venv or .venv
|
|
58
|
-
```
|
|
59
|
-
If dependencies are not installed, run the configured install command first.
|
|
60
|
-
|
|
61
|
-
2. **Check the current git branch:**
|
|
62
|
-
```bash
|
|
63
|
-
git branch --show-current
|
|
64
|
-
```
|
|
65
|
-
Note the branch for the evaluation report.
|
|
66
|
-
|
|
67
|
-
3. **Check for uncommitted changes:**
|
|
68
|
-
```bash
|
|
69
|
-
git status --porcelain
|
|
70
|
-
```
|
|
71
|
-
Note any uncommitted changes in the report. The evaluation should still proceed, but this is important context.
|
|
72
|
-
|
|
73
|
-
## Step 4: Execute Evaluation Strategies
|
|
74
|
-
|
|
75
|
-
Run each strategy configured in `evaluator.strategies` from the config. Execute them in this order for fastest feedback on failures:
|
|
48
|
+
Read `.bober/principles.md` if it exists.
|
|
76
49
|
|
|
77
|
-
|
|
50
|
+
Check the current git branch:
|
|
78
51
|
```bash
|
|
79
|
-
|
|
80
|
-
# e.g., npm run build, anchor build, forge build, cargo build, etc.
|
|
52
|
+
git branch --show-current
|
|
81
53
|
```
|
|
82
|
-
- Record the full output
|
|
83
|
-
- If the build fails, most other checks are unreliable -- still run them but note this
|
|
84
54
|
|
|
85
|
-
|
|
55
|
+
Check for uncommitted changes:
|
|
86
56
|
```bash
|
|
87
|
-
|
|
88
|
-
# e.g., npx tsc --noEmit, cargo clippy, solhint, mypy, etc.
|
|
57
|
+
git status --porcelain
|
|
89
58
|
```
|
|
90
|
-
- Record every type error with file path and line number
|
|
91
|
-
- Count total errors
|
|
92
59
|
|
|
93
|
-
|
|
94
|
-
```bash
|
|
95
|
-
# Use commands.lint from config (varies by stack)
|
|
96
|
-
# e.g., npm run lint, solhint, clippy, ruff, etc.
|
|
97
|
-
```
|
|
98
|
-
- Record every lint error (ignore warnings unless they indicate real problems)
|
|
99
|
-
- Count total errors
|
|
60
|
+
Determine the iteration number: if prior eval results exist for this contract in `.bober/eval-results/`, use the next iteration number. Otherwise, use 1.
|
|
100
61
|
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
62
|
+
## Step 3: Spawn the Evaluator Subagent
|
|
63
|
+
|
|
64
|
+
Use the **Agent tool** to spawn an evaluator subagent.
|
|
65
|
+
|
|
66
|
+
```
|
|
67
|
+
Agent tool call:
|
|
68
|
+
description: "Evaluate: <sprint title>"
|
|
69
|
+
prompt: <the full prompt below>
|
|
105
70
|
```
|
|
106
|
-
- Record which tests passed and which failed
|
|
107
|
-
- For failures, record the test name, expected vs actual output, and file location
|
|
108
|
-
- Check if any pre-existing tests broke (regression)
|
|
109
71
|
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
# Only run if configured and installed
|
|
113
|
-
npx playwright test 2>&1
|
|
72
|
+
**Build the evaluator prompt with ALL of these sections:**
|
|
73
|
+
|
|
114
74
|
```
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
75
|
+
You are the Bober Evaluator subagent. You have been spawned to independently evaluate a sprint.
|
|
76
|
+
|
|
77
|
+
## Sprint Contract
|
|
78
|
+
<paste the full SprintContract JSON from .bober/contracts/<contractId>.json>
|
|
79
|
+
|
|
80
|
+
## Project Configuration
|
|
81
|
+
Commands:
|
|
82
|
+
<paste the commands section from bober.config.json>
|
|
83
|
+
|
|
84
|
+
Evaluator config:
|
|
85
|
+
<paste the evaluator section from bober.config.json>
|
|
86
|
+
|
|
87
|
+
## Project Principles
|
|
88
|
+
<paste full text of .bober/principles.md or "No principles file found.">
|
|
89
|
+
|
|
90
|
+
## Context
|
|
91
|
+
- Contract ID: <contractId>
|
|
92
|
+
- Spec ID: <specId>
|
|
93
|
+
- Iteration: <N>
|
|
94
|
+
- Branch: <current git branch>
|
|
95
|
+
- Uncommitted changes: <yes/no, with list if yes>
|
|
96
|
+
|
|
97
|
+
## Generator's Completion Report (if available)
|
|
98
|
+
<paste the most recent generator report from .bober/handoffs/gen-report-<contractId>-*.json, or "No generator report available — evaluate based on current codebase state.">
|
|
99
|
+
|
|
100
|
+
## Instructions
|
|
101
|
+
1. Read the SprintContract at .bober/contracts/<contractId>.json
|
|
102
|
+
2. Read bober.config.json for configured eval strategies and commands
|
|
103
|
+
3. Run each configured evaluation strategy:
|
|
104
|
+
- Build/compile verification (commands.build)
|
|
105
|
+
- Type checking (commands.typecheck)
|
|
106
|
+
- Linting (commands.lint)
|
|
107
|
+
- Unit tests (commands.test)
|
|
108
|
+
- Playwright E2E (if configured)
|
|
109
|
+
- API checks (if configured)
|
|
110
|
+
- Custom strategies (if configured)
|
|
111
|
+
4. Verify EVERY success criterion in the contract one by one
|
|
112
|
+
5. Check for regressions (pre-existing tests, build stability, unexpected file changes)
|
|
113
|
+
6. Check adherence to project principles
|
|
114
|
+
7. Produce a structured EvalResult
|
|
115
|
+
|
|
116
|
+
IMPORTANT: You do NOT have Write or Edit tools. Output the EvalResult JSON in your response, and the orchestrator will save it to disk.
|
|
117
|
+
|
|
118
|
+
## Your Response
|
|
119
|
+
Respond with EXACTLY this JSON structure (no other text):
|
|
134
120
|
{
|
|
135
|
-
"
|
|
136
|
-
"
|
|
137
|
-
"
|
|
138
|
-
"
|
|
139
|
-
"
|
|
140
|
-
"
|
|
141
|
-
"
|
|
121
|
+
"evalId": "eval-<contractId>-<iteration>",
|
|
122
|
+
"contractId": "<contract ID>",
|
|
123
|
+
"specId": "<spec ID>",
|
|
124
|
+
"timestamp": "<ISO-8601>",
|
|
125
|
+
"iteration": <N>,
|
|
126
|
+
"overallResult": "pass | fail",
|
|
127
|
+
"score": {
|
|
128
|
+
"criteriaTotal": <N>,
|
|
129
|
+
"criteriaPassed": <N>,
|
|
130
|
+
"criteriaFailed": <N>,
|
|
131
|
+
"criteriaSkipped": <N>,
|
|
132
|
+
"requiredPassed": <N>,
|
|
133
|
+
"requiredFailed": <N>,
|
|
134
|
+
"requiredTotal": <N>
|
|
135
|
+
},
|
|
136
|
+
"strategyResults": [
|
|
137
|
+
{
|
|
138
|
+
"strategy": "<type>",
|
|
139
|
+
"required": true/false,
|
|
140
|
+
"result": "pass | fail | skipped",
|
|
141
|
+
"output": "<relevant output excerpt>",
|
|
142
|
+
"details": "<explanation if failed>"
|
|
143
|
+
}
|
|
144
|
+
],
|
|
145
|
+
"criteriaResults": [
|
|
146
|
+
{
|
|
147
|
+
"criterionId": "sc-X-Y",
|
|
148
|
+
"description": "<criterion description>",
|
|
149
|
+
"required": true/false,
|
|
150
|
+
"result": "pass | fail | skipped",
|
|
151
|
+
"evidence": "<specific evidence>",
|
|
152
|
+
"feedback": "<failure details if failed>"
|
|
153
|
+
}
|
|
154
|
+
],
|
|
155
|
+
"regressions": [
|
|
156
|
+
{
|
|
157
|
+
"description": "<what regressed>",
|
|
158
|
+
"evidence": "<how detected>",
|
|
159
|
+
"severity": "critical | major | minor"
|
|
160
|
+
}
|
|
161
|
+
],
|
|
162
|
+
"generatorFeedback": [
|
|
163
|
+
{
|
|
164
|
+
"priority": "critical | high | medium | low",
|
|
165
|
+
"category": "bug | missing-feature | regression | quality | performance",
|
|
166
|
+
"file": "<file path>",
|
|
167
|
+
"line": "<line number>",
|
|
168
|
+
"description": "<precise description>",
|
|
169
|
+
"expected": "<what should happen>",
|
|
170
|
+
"reproduction": "<steps to reproduce>"
|
|
171
|
+
}
|
|
172
|
+
],
|
|
173
|
+
"summary": "<2-3 sentence summary>"
|
|
142
174
|
}
|
|
143
175
|
```
|
|
144
176
|
|
|
145
|
-
## Step
|
|
146
|
-
|
|
147
|
-
Go through EVERY success criterion in the contract, one by one.
|
|
148
|
-
|
|
149
|
-
For each criterion:
|
|
150
|
-
|
|
151
|
-
1. **Read the criterion and its verification method**
|
|
152
|
-
2. **Gather evidence:**
|
|
153
|
-
- For `build`/`typecheck`/`lint`/`unit-test`/`playwright`: Use the strategy results from Step 4
|
|
154
|
-
- For `manual`: Read the relevant source files. Trace the code path. Verify the described behavior exists in the code.
|
|
155
|
-
- For `api-check`: Test the specific endpoint described in the criterion
|
|
156
|
-
- For `custom`: Run the custom command
|
|
157
|
-
3. **Make a judgment: pass, fail, or skipped**
|
|
158
|
-
4. **Record evidence supporting the judgment**
|
|
159
|
-
|
|
160
|
-
**Judgment rules:**
|
|
161
|
-
- `pass`: You have concrete evidence the criterion is met
|
|
162
|
-
- `fail`: You have concrete evidence the criterion is NOT met, or you cannot find evidence that it IS met
|
|
163
|
-
- `skipped`: The verification method cannot be executed (e.g., Playwright not installed)
|
|
164
|
-
|
|
165
|
-
**A criterion marked `required: true` MUST have a definitive pass or fail. It cannot be skipped.**
|
|
166
|
-
|
|
167
|
-
## Step 6: Check for Regressions
|
|
177
|
+
## Step 4: Process the Evaluator's Response
|
|
168
178
|
|
|
169
|
-
|
|
179
|
+
**After the evaluator subagent returns:**
|
|
170
180
|
|
|
171
|
-
1.
|
|
172
|
-
2.
|
|
173
|
-
3.
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
Generate the structured evaluation result following the schema in `skills/bober.eval/references/feedback-format.md`.
|
|
178
|
-
|
|
179
|
-
**Overall result determination:**
|
|
180
|
-
- **PASS:** ALL required strategies passed AND ALL required criteria passed AND no critical regressions
|
|
181
|
-
- **FAIL:** ANY required strategy failed OR ANY required criterion failed OR critical regression found
|
|
182
|
-
|
|
183
|
-
Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json`.
|
|
184
|
-
|
|
185
|
-
If this is the first evaluation for this contract, iteration = 1. Otherwise, read the contract's `iterationHistory` to determine the next iteration number.
|
|
181
|
+
1. Parse the evaluator's response to extract the EvalResult JSON.
|
|
182
|
+
2. Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json` (the evaluator cannot write files itself).
|
|
183
|
+
3. Append to `.bober/history.jsonl`:
|
|
184
|
+
```json
|
|
185
|
+
{"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
|
|
186
|
+
```
|
|
186
187
|
|
|
187
|
-
|
|
188
|
-
```json
|
|
189
|
-
{"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
|
|
190
|
-
```
|
|
188
|
+
If the subagent crashed or returned a malformed response, report the error clearly and suggest the user retry.
|
|
191
189
|
|
|
192
|
-
## Step
|
|
190
|
+
## Step 5: Output Report
|
|
193
191
|
|
|
194
192
|
Present results in a clear, human-readable format:
|
|
195
193
|
|
|
196
194
|
```
|
|
197
|
-
|
|
195
|
+
=== Evaluation Report: <sprint title> ===
|
|
198
196
|
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
197
|
+
Contract: <contractId>
|
|
198
|
+
Iteration: <N>
|
|
199
|
+
Result: PASS / FAIL
|
|
200
|
+
Branch: <current branch>
|
|
201
|
+
Uncommitted changes: yes/no
|
|
204
202
|
|
|
205
203
|
### Strategy Results
|
|
206
204
|
| Strategy | Required | Result |
|
|
@@ -223,15 +221,14 @@ Present results in a clear, human-readable format:
|
|
|
223
221
|
**sc-1-3: API returns 201 on valid registration**
|
|
224
222
|
- What failed: POST /api/auth/register returns 500 instead of 201
|
|
225
223
|
- Where: src/routes/auth.ts:42
|
|
226
|
-
- Evidence:
|
|
227
|
-
- Expected: 201 with
|
|
228
|
-
- Root cause: The database migration has not been run. The users table does not exist.
|
|
224
|
+
- Evidence: <command output>
|
|
225
|
+
- Expected: 201 with { id, email } response body
|
|
229
226
|
|
|
230
227
|
### Regressions (if any)
|
|
231
228
|
- <description>
|
|
232
229
|
|
|
233
230
|
### Summary
|
|
234
|
-
<2-3 sentence summary>
|
|
231
|
+
<2-3 sentence summary from the evaluator>
|
|
235
232
|
```
|
|
236
233
|
|
|
237
234
|
## Next Steps
|
|
@@ -240,9 +237,9 @@ After completing this phase, suggest the following next steps to the user:
|
|
|
240
237
|
- `/bober-sprint` — Rework the failed sprint with evaluator feedback, or move to the next sprint
|
|
241
238
|
- `/bober-sprint` — Execute the next sprint if evaluation passed
|
|
242
239
|
|
|
243
|
-
##
|
|
240
|
+
## Error Handling
|
|
244
241
|
|
|
245
|
-
-
|
|
246
|
-
-
|
|
247
|
-
-
|
|
248
|
-
-
|
|
242
|
+
- **Subagent crash/timeout:** If the Agent tool call fails, report the error. Do not attempt to evaluate inline — the whole point is subagent isolation.
|
|
243
|
+
- **Subagent returns malformed response:** Try to extract any useful information from the response text. Report what you can and suggest retrying.
|
|
244
|
+
- **Missing contract:** Tell the user to run `/bober-plan` first.
|
|
245
|
+
- **Build broken:** The evaluator will detect and report this. You just relay the results.
|