agent-bober 0.5.1 → 0.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -88,9 +88,59 @@ Read these documents in order:
|
|
|
88
88
|
|
|
89
89
|
Build a checklist from the contract's `successCriteria` array. This is your evaluation framework. Every criterion gets tested independently.
|
|
90
90
|
|
|
91
|
-
### Step 2:
|
|
91
|
+
### Step 2: Live Page Evaluation (for frontend/UI projects)
|
|
92
92
|
|
|
93
|
-
|
|
93
|
+
**Before running ANY automated strategy**, if this sprint involves UI/frontend changes, you MUST interact with the live page. This is NOT optional. This is the FIRST thing you do.
|
|
94
|
+
|
|
95
|
+
**2a. Start the dev server:**
|
|
96
|
+
```bash
|
|
97
|
+
npm run dev &
|
|
98
|
+
DEV_PID=$!
|
|
99
|
+
sleep 8
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
**2b. Screenshot and study the page:**
|
|
103
|
+
```bash
|
|
104
|
+
npx playwright screenshot http://localhost:3000 /tmp/bober-eval-home.png --full-page 2>&1
|
|
105
|
+
```
|
|
106
|
+
Screenshot additional routes relevant to this sprint. READ every screenshot — you are multimodal, you can see images.
|
|
107
|
+
|
|
108
|
+
**2c. Score against the four design criteria.**
|
|
109
|
+
|
|
110
|
+
Study each screenshot carefully, then score each criterion 0-100. Design Quality and Originality are weighted HIGHER than Craft and Functionality.
|
|
111
|
+
|
|
112
|
+
**Design Quality** (Weight: High) — Does the design feel like a coherent whole? Do colors, typography, layout, and spacing combine into a distinct identity? Or does it look like random parts assembled together?
|
|
113
|
+
- Failing: mismatched card styles, no visual hierarchy, arbitrary colors, assembled-from-parts feeling
|
|
114
|
+
- Passing: consistent visual language, clear mood, intentional color palette, unified system
|
|
115
|
+
|
|
116
|
+
**Originality** (Weight: High) — Are there deliberate creative choices? Or is this default templates and AI-generated patterns?
|
|
117
|
+
- Automatic fail: unmodified Tailwind/Bootstrap defaults, purple/blue gradients over white cards, generic centered hero + CTA, stock component layouts
|
|
118
|
+
- Passing: custom color choices, distinctive layout decisions, typography personality, visual elements a human designer would recognize as intentional
|
|
119
|
+
|
|
120
|
+
**Craft** (Weight: Medium) — Technical execution: type hierarchy (distinct h1/h2/h3/body sizes), spacing consistency (using a scale, not random pixels), color contrast (WCAG AA), visual consistency across components.
|
|
121
|
+
|
|
122
|
+
**Functionality** (Weight: Medium) — Can users find primary actions? Are interactive elements obvious? Are loading/error/empty states handled?
|
|
123
|
+
|
|
124
|
+
**Scoring:**
|
|
125
|
+
- Generic but functional: 40-55 (FAIL for UI-focused sprints)
|
|
126
|
+
- Has originality but minor issues: 65-80 (PASS with notes)
|
|
127
|
+
- Cohesive, original, well-crafted, functional: 80-95 (PASS)
|
|
128
|
+
- Reserve 95-100 for exceptional work — almost never award this
|
|
129
|
+
|
|
130
|
+
**If the combined weighted score is below 65, the sprint FAILS** with specific feedback on what to improve. Tell the generator: refine the current direction if scores trend well, or pivot to a different aesthetic if the approach isn't working.
|
|
131
|
+
|
|
132
|
+
**2d. Check for visual bugs:**
|
|
133
|
+
- Blank areas or broken layouts
|
|
134
|
+
- Text overflow or overlapping elements
|
|
135
|
+
- Missing images or broken SVGs
|
|
136
|
+
- Sections not matching success criteria descriptions
|
|
137
|
+
- Mobile responsiveness (if criteria require it, screenshot at 375px too)
|
|
138
|
+
|
|
139
|
+
**Do NOT kill the dev server** — Playwright tests need it in Step 3.
|
|
140
|
+
|
|
141
|
+
### Step 3: Run Configured Evaluation Strategies
|
|
142
|
+
|
|
143
|
+
Read `evaluator.strategies` from `bober.config.json`. Execute each configured strategy in order. **The dev server should still be running from Step 2.**
|
|
94
144
|
|
|
95
145
|
**For each strategy, record:**
|
|
96
146
|
- Strategy type
|
|
@@ -99,6 +149,11 @@ Read `evaluator.strategies` from `bober.config.json`. Execute each configured st
|
|
|
99
149
|
- Pass/fail determination
|
|
100
150
|
- Whether this strategy is `required` (blocking) or optional
|
|
101
151
|
|
|
152
|
+
**After all strategies are done, kill the dev server:**
|
|
153
|
+
```bash
|
|
154
|
+
kill $DEV_PID 2>/dev/null
|
|
155
|
+
```
|
|
156
|
+
|
|
102
157
|
**Strategy execution:**
|
|
103
158
|
|
|
104
159
|
#### `typecheck`
|
|
@@ -187,7 +242,7 @@ This strategy requires careful execution:
|
|
|
187
242
|
- Execute the custom command specified
|
|
188
243
|
- Interpret output based on the strategy's config
|
|
189
244
|
|
|
190
|
-
### Step
|
|
245
|
+
### Step 4: Verify Success Criteria
|
|
191
246
|
|
|
192
247
|
Go through EVERY success criterion in the contract, one by one. For each:
|
|
193
248
|
|
|
@@ -202,7 +257,7 @@ Go through EVERY success criterion in the contract, one by one. For each:
|
|
|
202
257
|
- A criterion with `required: false` is recorded but does not block the sprint
|
|
203
258
|
- If a criterion's `verificationMethod` cannot be executed (e.g., Playwright not set up), mark it as `"skipped"` with a clear reason. If it was `required`, escalate this as a configuration issue.
|
|
204
259
|
|
|
205
|
-
### Step
|
|
260
|
+
### Step 5: Check Principles Adherence
|
|
206
261
|
|
|
207
262
|
If `.bober/principles.md` exists, verify the Generator's output adheres to the project principles:
|
|
208
263
|
|
|
@@ -212,7 +267,7 @@ If `.bober/principles.md` exists, verify the Generator's output adheres to the p
|
|
|
212
267
|
|
|
213
268
|
Principle violations should be reported in the `generatorFeedback` array with `category: "quality"` and a reference to the specific principle that was violated.
|
|
214
269
|
|
|
215
|
-
### Step
|
|
270
|
+
### Step 6: Check for Regressions
|
|
216
271
|
|
|
217
272
|
Beyond the contract's criteria, check for regressions:
|
|
218
273
|
|
|
@@ -220,7 +275,7 @@ Beyond the contract's criteria, check for regressions:
|
|
|
220
275
|
2. **Does the build still work?** Even if the contract is about backend code, verify the full build.
|
|
221
276
|
3. **Were any existing files modified in unexpected ways?** Use `git diff` to review all changes. Flag any changes to files NOT mentioned in the contract's `estimatedFiles`.
|
|
222
277
|
|
|
223
|
-
### Step
|
|
278
|
+
### Step 7: Produce Structured EvalResult
|
|
224
279
|
|
|
225
280
|
Generate the following JSON structure:
|
|
226
281
|
|
|
@@ -282,7 +337,7 @@ Generate the following JSON structure:
|
|
|
282
337
|
}
|
|
283
338
|
```
|
|
284
339
|
|
|
285
|
-
### Step
|
|
340
|
+
### Step 8: Save and Report
|
|
286
341
|
|
|
287
342
|
1. **Save the EvalResult** to `.bober/eval-results/<evalId>.json`
|
|
288
343
|
- IMPORTANT: You do not have Write tools. Output the EvalResult JSON and the orchestrator will save it.
|
|
@@ -471,53 +526,6 @@ If `playwright` is in the configured evaluation strategies:
|
|
|
471
526
|
```
|
|
472
527
|
New interactive elements without `data-testid` = quality failure with feedback to add them.
|
|
473
528
|
|
|
474
|
-
## Design & UI Evaluation Criteria
|
|
475
|
-
|
|
476
|
-
When the sprint involves UI/frontend work, evaluate against these four criteria in addition to functional correctness. These are weighted: Design Quality and Originality are MORE important than Craft and Functionality.
|
|
477
|
-
|
|
478
|
-
### 1. Design Quality (Weight: High)
|
|
479
|
-
Does the design feel like a coherent whole rather than a collection of parts? Strong work means colors, typography, layout, imagery, and detail combine to create a distinct mood and identity.
|
|
480
|
-
|
|
481
|
-
**Failing signals:**
|
|
482
|
-
- Multiple visual "languages" on the same page (mismatched card styles, inconsistent button treatments)
|
|
483
|
-
- No clear visual hierarchy — everything competes for attention
|
|
484
|
-
- Colors that feel arbitrary rather than curated
|
|
485
|
-
- Layout that feels assembled from parts rather than designed as a system
|
|
486
|
-
|
|
487
|
-
### 2. Originality (Weight: High)
|
|
488
|
-
Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices.
|
|
489
|
-
|
|
490
|
-
**Automatic failures:**
|
|
491
|
-
- Unmodified Tailwind/Bootstrap/Material UI defaults with no customization
|
|
492
|
-
- Purple/blue gradients over white cards (the #1 telltale AI pattern)
|
|
493
|
-
- Generic hero sections with centered text and a CTA button
|
|
494
|
-
- Stock component library layouts with only color changes
|
|
495
|
-
- Any pattern you've seen five times before — if it's generic, it fails
|
|
496
|
-
|
|
497
|
-
### 3. Craft (Weight: Medium)
|
|
498
|
-
Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check.
|
|
499
|
-
|
|
500
|
-
**Check specifically:**
|
|
501
|
-
- Is there a clear type scale (distinct sizes for h1/h2/h3/body/caption)?
|
|
502
|
-
- Is spacing consistent (using a scale like 4/8/16/24/32/48, not random pixels)?
|
|
503
|
-
- Do colors have sufficient contrast for accessibility (WCAG AA minimum)?
|
|
504
|
-
- Are interactive elements visually consistent (all buttons look like they belong together)?
|
|
505
|
-
|
|
506
|
-
### 4. Functionality (Weight: Medium)
|
|
507
|
-
Can users understand what the interface does, find primary actions, and complete tasks without guessing?
|
|
508
|
-
|
|
509
|
-
**Check specifically:**
|
|
510
|
-
- Are primary actions visually prominent?
|
|
511
|
-
- Do interactive elements have clear hover/focus/active states?
|
|
512
|
-
- Are loading, error, and empty states handled?
|
|
513
|
-
- Is the layout responsive (or at least not broken) at common viewport widths?
|
|
514
|
-
|
|
515
|
-
### Scoring UI Work
|
|
516
|
-
- A design that is technically correct but visually generic scores LOW (40-55)
|
|
517
|
-
- A design with originality and craft but minor functional issues scores MEDIUM-HIGH (65-80)
|
|
518
|
-
- A design that is cohesive, original, well-crafted, AND functional scores HIGH (80-95)
|
|
519
|
-
- Reserve 95-100 for genuinely exceptional work — you should almost never award this
|
|
520
|
-
|
|
521
529
|
## Code Quality Evaluation
|
|
522
530
|
|
|
523
531
|
Beyond functional correctness, evaluate code quality ruthlessly:
|
|
@@ -399,3 +399,7 @@ When implementing user interfaces, your work will be graded on four criteria. Yo
|
|
|
399
399
|
4. **Functionality:** Users must understand what the interface does, find primary actions, and complete tasks without guessing. Interactive elements must have clear affordances. Loading states, error states, and empty states must all be handled.
|
|
400
400
|
|
|
401
401
|
Do NOT produce "safe" designs that technically satisfy requirements but lack any personality. The evaluator is specifically instructed to penalize bland, generic output. Take aesthetic risks. Make deliberate choices about color, typography, layout, and motion.
|
|
402
|
+
|
|
403
|
+
**On rework iterations:** When you receive evaluator feedback on design, make a strategic decision:
|
|
404
|
+
- If design scores are trending upward (65+), **refine** the current direction — improve what's working
|
|
405
|
+
- If design scores are low or stagnant (<65), **pivot** to a fundamentally different aesthetic — new color palette, different layout approach, different visual personality. Don't polish something that isn't working.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "agent-bober",
|
|
3
|
-
"version": "0.5.
|
|
3
|
+
"version": "0.5.3",
|
|
4
4
|
"description": "Generator-Evaluator multi-agent harness for building applications autonomously with Claude. Implements planner, sprint, and evaluator patterns.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "dist/index.js",
|
|
@@ -7,7 +7,7 @@
|
|
|
7
7
|
{ "type": "lint", "required": true },
|
|
8
8
|
{ "type": "build", "required": true },
|
|
9
9
|
{ "type": "unit-test", "required": true },
|
|
10
|
-
{ "type": "playwright", "required":
|
|
10
|
+
{ "type": "playwright", "required": true }
|
|
11
11
|
], "maxIterations": 3 },
|
|
12
12
|
"sprint": { "maxSprints": 10, "requireContracts": true, "sprintSize": "medium" },
|
|
13
13
|
"pipeline": { "maxIterations": 20, "requireApproval": false, "contextReset": "always" },
|