npm - agent-bober - Versions diffs - 0.4.3 → 0.5.1 - Mend

agent-bober 0.4.3 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/README.md +30 -0
package/agents/bober-evaluator.md +277 -8
package/agents/bober-generator.md +155 -0
package/agents/bober-planner.md +70 -0
package/dist/cli/commands/init.js +1 -0
package/dist/cli/commands/init.js.map +1 -1
package/dist/evaluators/builtin/playwright.d.ts +11 -0
package/dist/evaluators/builtin/playwright.d.ts.map +1 -1
package/dist/evaluators/builtin/playwright.js +259 -12
package/dist/evaluators/builtin/playwright.js.map +1 -1
package/package.json +1 -1
package/skills/bober.eval/SKILL.md +145 -148
package/skills/bober.playwright/SKILL.md +429 -0
package/skills/bober.playwright/references/playwright-patterns.md +377 -0
package/skills/bober.run/SKILL.md +425 -118
package/skills/bober.sprint/SKILL.md +147 -57
package/templates/presets/nextjs/bober.config.json +2 -1

package/skills/bober.eval/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: bober.eval
-description: Run an independent evaluation of the current sprint state against its contract, producing structured pass/fail feedback.
+description: Spawn an evaluator subagent to independently assess the current sprint state against its contract, producing structured pass/fail feedback.
 argument-hint: "[contract-id]"
 handoffs:
   - label: "Rework Sprint"
@@ -11,13 +11,15 @@ handoffs:
     prompt: "Move to the next sprint"
 ---
-# bober.eval — Standalone Evaluation Skill
+# bober.eval — Standalone Evaluation Orchestrator
-You are running the **bober.eval** skill. Your job is to independently evaluate the current state of a sprint implementation against its contract and produce structured feedback. This skill can be run at any time, independently of the sprint execution loop.
+You are the **orchestrator** for a standalone evaluation run. You do NOT evaluate the code yourself. You spawn the evaluator as a subagent using the **Agent tool**, then process and save its results.
+The evaluator subagent runs in its own isolated context window. It receives ONLY the information you explicitly pass in its prompt.
 ## When to Use This Skill
-- **During development:** To check your progress against criteria before running the full sprint loop
+- **During development:** To check progress against criteria before running the full sprint loop
 - **After manual changes:** When you have fixed something the Generator produced and want to re-evaluate
 - **For debugging:** To understand exactly what is passing and failing in a sprint
 - **As a standalone QA check:** To evaluate any codebase state against a sprint contract
@@ -36,171 +38,167 @@ You are running the **bober.eval** skill. Your job is to independently evaluate
 Read the contract and its parent PlanSpec.
-## Step 2: Load Configuration
+## Step 2: Gather Context
 Read `bober.config.json` and extract:
 - `evaluator.strategies`: The configured evaluation strategies
 - `evaluator.model`: The model to use (informational)
 - `commands`: The project commands for build, test, lint, typecheck
-## Step 3: Pre-Flight Checks
-Before running evaluation strategies, verify the environment:
-1. **Check if dependencies are installed:**
-   ```bash
-   # Check for installed dependencies (varies by stack)
-   # Node.js: ls node_modules/.package-lock.json 2>/dev/null
-   # Rust/Anchor: check target/ directory
-   # Solidity/Hardhat: ls node_modules/.package-lock.json 2>/dev/null
-   # Solidity/Foundry: check lib/ directory
-   # Python: check venv or .venv
-   ```
-   If dependencies are not installed, run the configured install command first.
-2. **Check the current git branch:**
-   ```bash
-   git branch --show-current
-   ```
-   Note the branch for the evaluation report.
-3. **Check for uncommitted changes:**
-   ```bash
-   git status --porcelain
-   ```
-   Note any uncommitted changes in the report. The evaluation should still proceed, but this is important context.
-## Step 4: Execute Evaluation Strategies
-Run each strategy configured in `evaluator.strategies` from the config. Execute them in this order for fastest feedback on failures:
+Read `.bober/principles.md` if it exists.
-### Priority 1: Build/Compile Verification
+Check the current git branch:
 ```bash
-# Use commands.build from config (varies by stack)
-# e.g., npm run build, anchor build, forge build, cargo build, etc.
+git branch --show-current
 ```
-- Record the full output
-- If the build fails, most other checks are unreliable -- still run them but note this
-### Priority 2: Type Checking / Static Analysis
+Check for uncommitted changes:
 ```bash
-# Use commands.typecheck from config (varies by stack)
-# e.g., npx tsc --noEmit, cargo clippy, solhint, mypy, etc.
+git status --porcelain
 ```
-- Record every type error with file path and line number
-- Count total errors
-### Priority 3: Linting
-```bash
-# Use commands.lint from config (varies by stack)
-# e.g., npm run lint, solhint, clippy, ruff, etc.
-```
-- Record every lint error (ignore warnings unless they indicate real problems)
-- Count total errors
+Determine the iteration number: if prior eval results exist for this contract in `.bober/eval-results/`, use the next iteration number. Otherwise, use 1.
-### Priority 4: Unit Tests
-```bash
-# Use commands.test from config (varies by stack)
-# e.g., npm test, anchor test, forge test, pytest, etc.
+## Step 3: Spawn the Evaluator Subagent
+Use the **Agent tool** to spawn an evaluator subagent.
+```
+Agent tool call:
+  description: "Evaluate: <sprint title>"
+  prompt: <the full prompt below>
 ```
-- Record which tests passed and which failed
-- For failures, record the test name, expected vs actual output, and file location
-- Check if any pre-existing tests broke (regression)
-### Priority 5: E2E Tests (Playwright)
-```bash
-# Only run if configured and installed
-npx playwright test 2>&1
+**Build the evaluator prompt with ALL of these sections:**
 ```
-- If Playwright is not installed, mark as `skipped` (not `failed`)
-- Record which tests passed and failed
-- Note if screenshots are available
-### Priority 6: API Checks
-- If the contract has API-related success criteria, start the dev server and test endpoints:
-  ```bash
-  # Start dev server in background
-  # Test endpoints with curl
-  curl -s -w "\n%{http_code}" http://localhost:<port>/api/<endpoint>
-  ```
-- Record response status codes and body shapes
-### Priority 7: Custom Strategies
-- For each strategy with `type: "custom"`, execute the command from the strategy's `config` field
-- Record the output and exit code
-**For each strategy, record:**
-```json
+You are the Bober Evaluator subagent. You have been spawned to independently evaluate a sprint.
+## Sprint Contract
+<paste the full SprintContract JSON from .bober/contracts/<contractId>.json>
+## Project Configuration
+Commands:
+<paste the commands section from bober.config.json>
+Evaluator config:
+<paste the evaluator section from bober.config.json>
+## Project Principles
+<paste full text of .bober/principles.md or "No principles file found.">
+## Context
+- Contract ID: <contractId>
+- Spec ID: <specId>
+- Iteration: <N>
+- Branch: <current git branch>
+- Uncommitted changes: <yes/no, with list if yes>
+## Generator's Completion Report (if available)
+<paste the most recent generator report from .bober/handoffs/gen-report-<contractId>-*.json, or "No generator report available — evaluate based on current codebase state.">
+## Instructions
+1. Read the SprintContract at .bober/contracts/<contractId>.json
+2. Read bober.config.json for configured eval strategies and commands
+3. Run each configured evaluation strategy:
+   - Build/compile verification (commands.build)
+   - Type checking (commands.typecheck)
+   - Linting (commands.lint)
+   - Unit tests (commands.test)
+   - Playwright E2E (if configured)
+   - API checks (if configured)
+   - Custom strategies (if configured)
+4. Verify EVERY success criterion in the contract one by one
+5. Check for regressions (pre-existing tests, build stability, unexpected file changes)
+6. Check adherence to project principles
+7. Produce a structured EvalResult
+IMPORTANT: You do NOT have Write or Edit tools. Output the EvalResult JSON in your response, and the orchestrator will save it to disk.
+## Your Response
+Respond with EXACTLY this JSON structure (no other text):
 {
-  "strategy": "<type>",
-  "required": true,
-  "result": "pass | fail | skipped",
-  "exitCode": 0,
-  "output": "<relevant output>",
-  "errorCount": 0,
-  "details": "<explanation>"
+  "evalId": "eval-<contractId>-<iteration>",
+  "contractId": "<contract ID>",
+  "specId": "<spec ID>",
+  "timestamp": "<ISO-8601>",
+  "iteration": <N>,
+  "overallResult": "pass | fail",
+  "score": {
+    "criteriaTotal": <N>,
+    "criteriaPassed": <N>,
+    "criteriaFailed": <N>,
+    "criteriaSkipped": <N>,
+    "requiredPassed": <N>,
+    "requiredFailed": <N>,
+    "requiredTotal": <N>
+  },
+  "strategyResults": [
+    {
+      "strategy": "<type>",
+      "required": true/false,
+      "result": "pass | fail | skipped",
+      "output": "<relevant output excerpt>",
+      "details": "<explanation if failed>"
+    }
+  ],
+  "criteriaResults": [
+    {
+      "criterionId": "sc-X-Y",
+      "description": "<criterion description>",
+      "required": true/false,
+      "result": "pass | fail | skipped",
+      "evidence": "<specific evidence>",
+      "feedback": "<failure details if failed>"
+    }
+  ],
+  "regressions": [
+    {
+      "description": "<what regressed>",
+      "evidence": "<how detected>",
+      "severity": "critical | major | minor"
+    }
+  ],
+  "generatorFeedback": [
+    {
+      "priority": "critical | high | medium | low",
+      "category": "bug | missing-feature | regression | quality | performance",
+      "file": "<file path>",
+      "line": "<line number>",
+      "description": "<precise description>",
+      "expected": "<what should happen>",
+      "reproduction": "<steps to reproduce>"
+    }
+  ],
+  "summary": "<2-3 sentence summary>"
 }
 ```
-## Step 5: Verify Success Criteria
-Go through EVERY success criterion in the contract, one by one.
-For each criterion:
-1. **Read the criterion and its verification method**
-2. **Gather evidence:**
-   - For `build`/`typecheck`/`lint`/`unit-test`/`playwright`: Use the strategy results from Step 4
-   - For `manual`: Read the relevant source files. Trace the code path. Verify the described behavior exists in the code.
-   - For `api-check`: Test the specific endpoint described in the criterion
-   - For `custom`: Run the custom command
-3. **Make a judgment: pass, fail, or skipped**
-4. **Record evidence supporting the judgment**
-**Judgment rules:**
-- `pass`: You have concrete evidence the criterion is met
-- `fail`: You have concrete evidence the criterion is NOT met, or you cannot find evidence that it IS met
-- `skipped`: The verification method cannot be executed (e.g., Playwright not installed)
-**A criterion marked `required: true` MUST have a definitive pass or fail. It cannot be skipped.**
-## Step 6: Check for Regressions
+## Step 4: Process the Evaluator's Response
-Beyond the contract criteria, check for broader regressions:
+**After the evaluator subagent returns:**
-1. **Pre-existing test count:** If you can determine how many tests existed before the sprint, compare to the current count. Fewer passing tests = regression.
-2. **Build stability:** Does the full project build, not just the new code?
-3. **Unexpected file changes:** Use `git diff --stat` to see all changed files. Flag any files changed that are NOT in the contract's `estimatedFiles`.
-## Step 7: Produce the EvalResult
-Generate the structured evaluation result following the schema in `skills/bober.eval/references/feedback-format.md`.
-**Overall result determination:**
-- **PASS:** ALL required strategies passed AND ALL required criteria passed AND no critical regressions
-- **FAIL:** ANY required strategy failed OR ANY required criterion failed OR critical regression found
-Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json`.
-If this is the first evaluation for this contract, iteration = 1. Otherwise, read the contract's `iterationHistory` to determine the next iteration number.
+1. Parse the evaluator's response to extract the EvalResult JSON.
+2. Save the EvalResult to `.bober/eval-results/eval-<contractId>-<iteration>.json` (the evaluator cannot write files itself).
+3. Append to `.bober/history.jsonl`:
+   ```json
+   {"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
+   ```
-Append to `.bober/history.jsonl`:
-```json
-{"event":"eval-completed","contractId":"...","evalId":"...","result":"pass|fail","timestamp":"..."}
-```
+If the subagent crashed or returned a malformed response, report the error clearly and suggest the user retry.
-## Step 8: Output Report
+## Step 5: Output Report
 Present results in a clear, human-readable format:
 ```
-## Evaluation Report: <sprint title>
+=== Evaluation Report: <sprint title> ===
-**Contract:** <contractId>
-**Iteration:** <N>
-**Result:** PASS / FAIL
-**Branch:** <current branch>
-**Uncommitted changes:** yes/no
+Contract: <contractId>
+Iteration: <N>
+Result: PASS / FAIL
+Branch: <current branch>
+Uncommitted changes: yes/no
 ### Strategy Results
 | Strategy | Required | Result |
@@ -223,15 +221,14 @@ Present results in a clear, human-readable format:
 **sc-1-3: API returns 201 on valid registration**
 - What failed: POST /api/auth/register returns 500 instead of 201
 - Where: src/routes/auth.ts:42
-- Evidence: `curl -X POST http://localhost:3000/api/auth/register -H "Content-Type: application/json" -d '{"email":"test@test.com","password":"password123"}' returned 500 with error "relation users does not exist"`
-- Expected: 201 with `{ id, email }` response body
-- Root cause: The database migration has not been run. The users table does not exist.
+- Evidence: <command output>
+- Expected: 201 with { id, email } response body
 ### Regressions (if any)
 - <description>
 ### Summary
-<2-3 sentence summary>
+<2-3 sentence summary from the evaluator>
 ```
 ## Next Steps
@@ -240,9 +237,9 @@ After completing this phase, suggest the following next steps to the user:
 - `/bober-sprint` — Rework the failed sprint with evaluator feedback, or move to the next sprint
 - `/bober-sprint` — Execute the next sprint if evaluation passed
-## Anti-Leniency Reminders
+## Error Handling
-- If a criterion says "the form displays an error message" and you can only verify the validation logic exists in code but cannot confirm the message renders, mark it as **fail** with a note about what you could not verify.
-- If the build has warnings that look like potential runtime errors (e.g., unused imports of things that should be used), flag them even if the build technically passes.
-- If a test passes but the test itself is trivial (e.g., `expect(true).toBe(true)`), note this in the report. A passing trivial test does not satisfy a functional criterion.
-- If the Generator's self-report says something works but you find evidence it does not, trust your evidence over the report.
+- **Subagent crash/timeout:** If the Agent tool call fails, report the error. Do not attempt to evaluate inline — the whole point is subagent isolation.
+- **Subagent returns malformed response:** Try to extract any useful information from the response text. Report what you can and suggest retrying.
+- **Missing contract:** Tell the user to run `/bober-plan` first.
+- **Build broken:** The evaluator will detect and report this. You just relay the results.