agent-bober 0.4.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: bober.eval
3
- description: Run an independent evaluation of the current sprint state against its contract, producing structured pass/fail feedback.
3
+ description: Spawn an evaluator subagent to independently assess the current sprint state against its contract, producing structured pass/fail feedback.
4
4
  argument-hint: "[contract-id]"
5
5
  handoffs:
6
6
  - label: "Rework Sprint"
@@ -11,13 +11,15 @@ handoffs:
11
11
  prompt: "Move to the next sprint"
12
12
  ---
13
13
 
14
- # bober.eval — Standalone Evaluation Skill
14
+ # bober.eval — Standalone Evaluation Orchestrator
15
15
 
16
- You are running the **bober.eval** skill. Your job is to independently evaluate the current state of a sprint implementation against its contract and produce structured feedback. This skill can be run at any time, independently of the sprint execution loop.
16
+ You are the **orchestrator** for a standalone evaluation run. You do NOT evaluate the code yourself. You spawn the evaluator as a subagent using the **Agent tool**, then process and save its results.
17
+
18
+ The evaluator subagent runs in its own isolated context window. It receives ONLY the information you explicitly pass in its prompt.
17
19
 
18
20
  ## When to Use This Skill
19
21
 
20
- - **During development:** To check your progress against criteria before running the full sprint loop
22
+ - **During development:** To check progress against criteria before running the full sprint loop
21
23
  - **After manual changes:** When you have fixed something the Generator produced and want to re-evaluate
22
24
  - **For debugging:** To understand exactly what is passing and failing in a sprint
23
25
  - **As a standalone QA check:** To evaluate any codebase state against a sprint contract
@@ -36,171 +38,167 @@ You are running the **bober.eval** skill. Your job is to independently evaluate
36
38
 
37
39
  Read the contract and its parent PlanSpec.
38
40
 
39
- ## Step 2: Load Configuration
41
+ ## Step 2: Gather Context
40
42
 
41
43
  Read `bober.config.json` and extract:
42
44
  - `evaluator.strategies`: The configured evaluation strategies
43
45
  - `evaluator.model`: The model to use (informational)
44
46
  - `commands`: The project commands for build, test, lint, typecheck
45
47
 
46
- ## Step 3: Pre-Flight Checks
47
-
48
- Before running evaluation strategies, verify the environment:
49
-
50
- 1. **Check if dependencies are installed:**
51
- ```bash
52
- # Check for installed dependencies (varies by stack)
53
- # Node.js: ls node_modules/.package-lock.json 2>/dev/null
54
- # Rust/Anchor: check target/ directory
55
- # Solidity/Hardhat: ls node_modules/.package-lock.json 2>/dev/null
56
- # Solidity/Foundry: check lib/ directory
57
- # Python: check venv or .venv
58
- ```
59
- If dependencies are not installed, run the configured install command first.
60
-
61
- 2. **Check the current git branch:**
62
- ```bash
63
- git branch --show-current
64
- ```
65
- Note the branch for the evaluation report.
66
-
67
- 3. **Check for uncommitted changes:**
68
- ```bash
69
- git status --porcelain
70
- ```
71
- Note any uncommitted changes in the report. The evaluation should still proceed, but this is important context.
72
-
73
- ## Step 4: Execute Evaluation Strategies
74
-
75
- Run each strategy configured in `evaluator.strategies` from the config. Execute them in this order for fastest feedback on failures:
48
+ Read `.bober/principles.md` if it exists.
76
49
 
77
- ### Priority 1: Build/Compile Verification
50
+ Check the current git branch:
78
51
  ```bash
79
- # Use commands.build from config (varies by stack)
80
- # e.g., npm run build, anchor build, forge build, cargo build, etc.
52
+ git branch --show-current
81
53
  ```
82
- - Record the full output
83
- - If the build fails, most other checks are unreliable -- still run them but note this
84
54
 
85
- ### Priority 2: Type Checking / Static Analysis
55
+ Check for uncommitted changes:
86
56
  ```bash
87
- # Use commands.typecheck from config (varies by stack)
88
- # e.g., npx tsc --noEmit, cargo clippy, solhint, mypy, etc.
57
+ git status --porcelain
89
58
  ```
90
- - Record every type error with file path and line number
91
- - Count total errors
92
59
 
93
- ### Priority 3: Linting
94
- ```bash
95
- # Use commands.lint from config (varies by stack)
96
- # e.g., npm run lint, solhint, clippy, ruff, etc.
97
- ```
98
- - Record every lint error (ignore warnings unless they indicate real problems)
99
- - Count total errors
60
+ Determine the iteration number: if prior eval results exist for this contract in `.bober/eval-results/`, use the next iteration number. Otherwise, use 1.
100
61
 
101
- ### Priority 4: Unit Tests
102
- ```bash
103
- # Use commands.test from config (varies by stack)
104
- # e.g., npm test, anchor test, forge test, pytest, etc.
62
+ ## Step 3: Spawn the Evaluator Subagent
63
+
64
+ Use the **Agent tool** to spawn an evaluator subagent.
65
+
66
+ ```
67
+ Agent tool call:
68
+ description: "Evaluate: <sprint title>"
69
+ prompt: <the full prompt below>
105
70
  ```
106
- - Record which tests passed and which failed
107
- - For failures, record the test name, expected vs actual output, and file location
108
- - Check if any pre-existing tests broke (regression)
109
71
 
110
- ### Priority 5: E2E Tests (Playwright)
111
- ```bash
112
- # Only run if configured and installed
113
- npx playwright test 2>&1
72
+ **Build the evaluator prompt with ALL of these sections:**
73
+
114
74
  ```
115
- - If Playwright is not installed, mark as `skipped` (not `failed`)
116
- - Record which tests passed and failed
117
- - Note if screenshots are available
118
-
119
- ### Priority 6: API Checks
120
- - If the contract has API-related success criteria, start the dev server and test endpoints:
121
- ```bash
122
- # Start dev server in background
123
- # Test endpoints with curl
124
- curl -s -w "\n%{http_code}" http://localhost:<port>/api/<endpoint>
125
- ```
126
- - Record response status codes and body shapes
127
-
128
- ### Priority 7: Custom Strategies
129
- - For each strategy with `type: "custom"`, execute the command from the strategy's `config` field
130
- - Record the output and exit code
131
-
132
- **For each strategy, record:**
133
- ```json
75
+ You are the Bober Evaluator subagent. You have been spawned to independently evaluate a sprint.
76
+
77
+ ## Sprint Contract
78
+ <paste the full SprintContract JSON from .bober/contracts/<contractId>.json>
79
+
80
+ ## Project Configuration
81
+ Commands:
82
+ <paste the commands section from bober.config.json>
83
+
84
+ Evaluator config:
85
+ <paste the evaluator section from bober.config.json>
86
+
87
+ ## Project Principles
88
+ <paste full text of .bober/principles.md or "No principles file found.">
89
+
90
+ ## Context
91
+ - Contract ID: <contractId>
92
+ - Spec ID: <specId>
93
+ - Iteration: <N>
94
+ - Branch: <current git branch>
95
+ - Uncommitted changes: <yes/no, with list if yes>
96
+
97
+ ## Generator's Completion Report (if available)
98
+ <paste the most recent generator report from .bober/handoffs/gen-report-<contractId>-*.json, or "No generator report available — evaluate based on current codebase state.">
99
+
100
+ ## Instructions
101
+ 1. Read the SprintContract at .bober/contracts/<contractId>.json
102
+ 2. Read bober.config.json for configured eval strategies and commands
103
+ 3. Run each configured evaluation strategy:
104
+ - Build/compile verification (commands.build)
105
+ - Type checking (commands.typecheck)
106
+ - Linting (commands.lint)
107
+ - Unit tests (commands.test)
108
+ - Playwright E2E (if configured)
109
+ - API checks (if configured)
110
+ - Custom strategies (if configured)
111
+ 4. Verify EVERY success criterion in the contract one by one
112
+ 5. Check for regressions (pre-existing tests, build stability, unexpected file changes)
113
+ 6. Check adherence to project principles
114
+ 7. Produce a structured EvalResult
115
+
116
+ IMPORTANT: You do NOT have Write or Edit tools. Output the EvalResult JSON in your response, and the orchestrator will save it to disk.
117
+
118
+ ## Your Response
119
+ Respond with EXACTLY this JSON structure (no other text):
134
120
  {
135
- "strategy": "<type>",
136
- "required": true,
137
- "result": "pass | fail | skipped",
138
- "exitCode": 0,
139
- "output": "<relevant output>",
140
- "errorCount": 0,
141
- "details": "<explanation>"
121
+ "evalId": "eval-<contractId>-<iteration>",
122
+ "contractId": "<contract ID>",
123
+ "specId": "<spec ID>",
124
+ "timestamp": "<ISO-8601>",
125
+ "iteration": <N>,
126
+ "overallResult": "pass | fail",
127
+ "score": {
128
+ "criteriaTotal": <N>,
129
+ "criteriaPassed": <N>,
130
+ "criteriaFailed": <N>,
131
+ "criteriaSkipped": <N>,
132
+ "requiredPassed": <N>,
133
+ "requiredFailed": <N>,
134
+ "requiredTotal": <N>
135
+ },
136
+ "strategyResults": [
137
+ {
138
+ "strategy": "<type>",
139
+ "required": true/false,
140
+ "result": "pass | fail | skipped",
141
+ "output": "<relevant output excerpt>",
142
+ "details": "<explanation if failed>"
143
+ }
144
+ ],
145
+ "criteriaResults": [
146
+ {
147
+ "criterionId": "sc-X-Y",
148
+ "description": "<criterion description>",
149
+ "required": true/false,
150
+ "result": "pass | fail | skipped",
151
+ "evidence": "<specific evidence>",
152
+ "feedback": "<failure details if failed>"
153
+ }
154
+ ],
155
+ "regressions": [
156
+ {
157
+ "description": "<what regressed>",
158
+ "evidence": "<how detected>",
159
+ "severity": "critical | major | minor"
160
+ }
161
+ ],
162
+ "generatorFeedback": [
163
+ {
164
+ "priority": "critical | high | medium | low",
165
+ "category": "bug | missing-feature | regression | quality | performance",
166
+ "file": "<file path>",
167
+ "line": "<line number>",
168
+ "description": "<precise description>",
169
+ "expected": "<what should happen>",
170
+ "reproduction": "<steps to reproduce>"
171
+ }
172
+ ],
173
+ "summary": "<2-3 sentence summary>"
142
174
  }
143
175
  ```
144
176
 
145
- ## Step 5: Verify Success Criteria
146
-
147
- Go through EVERY success criterion in the contract, one by one.
148
-
149
- For each criterion:
150
-
151
- 1. **Read the criterion and its verification method**
152
- 2. **Gather evidence:**
153
- - For `build`/`typecheck`/`lint`/`unit-test`/`playwright`: Use the strategy results from Step 4
154
- - For `manual`: Read the relevant source files. Trace the code path. Verify the described behavior exists in the code.
155
- - For `api-check`: Test the specific endpoint described in the criterion
156
- - For `custom`: Run the custom command
157
- 3. **Make a judgment: pass, fail, or skipped**
158
- 4. **Record evidence supporting the judgment**
159
-
160
- **Judgment rules:**
161
- - `pass`: You have concrete evidence the criterion is met
162
- - `fail`: You have concrete evidence the criterion is NOT met, or you cannot find evidence that it IS met
163
- - `skipped`: The verification method cannot be executed (e.g., Playwright not installed)
164
-
165
- **A criterion marked `required: true` MUST have a definitive pass or fail. It cannot be skipped.**
166
-
167
- ## Step 6: Check for Regressions
177
+ ## Step 4: Process the Evaluator's Response
168
178
 
169
- Beyond the contract criteria, check for broader regressions:
179
+ **After the evaluator subagent returns:**
170
180
 
171
- 1. **Pre-existing test count:** If you can determine how many tests existed before the sprint, compare to the current count. Fewer passing tests = regression.
172
- 2. **Build stability:** Does the full project build, not just the new code?
173
- 3. **Unexpected file changes:** Use `git diff --stat` to see all changed files. Flag any files changed that are NOT in the contract's `estimatedFiles`.
174
-
175
- ## Step 7: Produce the EvalResult
176
-
177
- Generate the structured evaluation result following the schema in `skills/bober.eval/references/feedback-format.md`.
178
-
179
- **Overall result determination:**
180
- - **PASS:** ALL required strategies passed AND ALL required criteria passed AND no critical regressions
181
- - **FAIL:** ANY required strategy failed OR ANY required criterion failed OR critical regression found
182
-
183
- Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json`.
184
-
185
- If this is the first evaluation for this contract, iteration = 1. Otherwise, read the contract's `iterationHistory` to determine the next iteration number.
181
+ 1. Parse the evaluator's response to extract the EvalResult JSON.
182
+ 2. Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json` (the evaluator cannot write files itself).
183
+ 3. Append to `.bober/history.jsonl`:
184
+ ```json
185
+ {"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
186
+ ```
186
187
 
187
- Append to `.bober/history.jsonl`:
188
- ```json
189
- {"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
190
- ```
188
+ If the subagent crashed or returned a malformed response, report the error clearly and suggest the user retry.
191
189
 
192
- ## Step 8: Output Report
190
+ ## Step 5: Output Report
193
191
 
194
192
  Present results in a clear, human-readable format:
195
193
 
196
194
  ```
197
- ## Evaluation Report: <sprint title>
195
+ === Evaluation Report: <sprint title> ===
198
196
 
199
- **Contract:** <contractId>
200
- **Iteration:** <N>
201
- **Result:** PASS / FAIL
202
- **Branch:** <current branch>
203
- **Uncommitted changes:** yes/no
197
+ Contract: <contractId>
198
+ Iteration: <N>
199
+ Result: PASS / FAIL
200
+ Branch: <current branch>
201
+ Uncommitted changes: yes/no
204
202
 
205
203
  ### Strategy Results
206
204
  | Strategy | Required | Result |
@@ -223,15 +221,14 @@ Present results in a clear, human-readable format:
223
221
  **sc-1-3: API returns 201 on valid registration**
224
222
  - What failed: POST /api/auth/register returns 500 instead of 201
225
223
  - Where: src/routes/auth.ts:42
226
- - Evidence: `curl -X POST http://localhost:3000/api/auth/register -H "Content-Type: application/json" -d '{"email":"test@test.com","password":"password123"}' returned 500 with error "relation users does not exist"`
227
- - Expected: 201 with `{ id, email }` response body
228
- - Root cause: The database migration has not been run. The users table does not exist.
224
+ - Evidence: <command output>
225
+ - Expected: 201 with { id, email } response body
229
226
 
230
227
  ### Regressions (if any)
231
228
  - <description>
232
229
 
233
230
  ### Summary
234
- <2-3 sentence summary>
231
+ <2-3 sentence summary from the evaluator>
235
232
  ```
236
233
 
237
234
  ## Next Steps
@@ -240,9 +237,9 @@ After completing this phase, suggest the following next steps to the user:
240
237
  - `/bober-sprint` — Rework the failed sprint with evaluator feedback, or move to the next sprint
241
238
  - `/bober-sprint` — Execute the next sprint if evaluation passed
242
239
 
243
- ## Anti-Leniency Reminders
240
+ ## Error Handling
244
241
 
245
- - If a criterion says "the form displays an error message" and you can only verify the validation logic exists in code but cannot confirm the message renders, mark it as **fail** with a note about what you could not verify.
246
- - If the build has warnings that look like potential runtime errors (e.g., unused imports of things that should be used), flag them even if the build technically passes.
247
- - If a test passes but the test itself is trivial (e.g., `expect(true).toBe(true)`), note this in the report. A passing trivial test does not satisfy a functional criterion.
248
- - If the Generator's self-report says something works but you find evidence it does not, trust your evidence over the report.
242
+ - **Subagent crash/timeout:** If the Agent tool call fails, report the error. Do not attempt to evaluate inline the whole point is subagent isolation.
243
+ - **Subagent returns malformed response:** Try to extract any useful information from the response text. Report what you can and suggest retrying.
244
+ - **Missing contract:** Tell the user to run `/bober-plan` first.
245
+ - **Build broken:** The evaluator will detect and report this. You just relay the results.