agentboot 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/README.md +8 -7
  2. package/agentboot.config.json +4 -1
  3. package/package.json +2 -2
  4. package/scripts/cli.ts +42 -14
  5. package/scripts/compile.ts +30 -7
  6. package/scripts/dev-sync.ts +1 -1
  7. package/scripts/lib/config.ts +17 -1
  8. package/scripts/validate.ts +12 -7
  9. package/.github/ISSUE_TEMPLATE/persona-request.md +0 -62
  10. package/.github/ISSUE_TEMPLATE/quality-feedback.md +0 -67
  11. package/.github/workflows/cla.yml +0 -25
  12. package/.github/workflows/validate.yml +0 -49
  13. package/.idea/agentboot.iml +0 -9
  14. package/.idea/misc.xml +0 -6
  15. package/.idea/modules.xml +0 -8
  16. package/.idea/vcs.xml +0 -6
  17. package/CLAUDE.md +0 -230
  18. package/CONTRIBUTING.md +0 -168
  19. package/PERSONAS.md +0 -156
  20. package/core/instructions/baseline.instructions.md +0 -133
  21. package/core/instructions/security.instructions.md +0 -186
  22. package/core/personas/code-reviewer/SKILL.md +0 -175
  23. package/core/personas/security-reviewer/SKILL.md +0 -233
  24. package/core/personas/test-data-expert/SKILL.md +0 -234
  25. package/core/personas/test-generator/SKILL.md +0 -262
  26. package/core/traits/audit-trail.md +0 -182
  27. package/core/traits/confidence-signaling.md +0 -172
  28. package/core/traits/critical-thinking.md +0 -129
  29. package/core/traits/schema-awareness.md +0 -132
  30. package/core/traits/source-citation.md +0 -174
  31. package/core/traits/structured-output.md +0 -199
  32. package/docs/ci-cd-automation.md +0 -548
  33. package/docs/claude-code-reference/README.md +0 -21
  34. package/docs/claude-code-reference/agentboot-coverage.md +0 -484
  35. package/docs/claude-code-reference/feature-inventory.md +0 -906
  36. package/docs/cli-commands-audit.md +0 -112
  37. package/docs/cli-design.md +0 -924
  38. package/docs/concepts.md +0 -1117
  39. package/docs/config-schema-audit.md +0 -121
  40. package/docs/configuration.md +0 -645
  41. package/docs/delivery-methods.md +0 -758
  42. package/docs/developer-onboarding.md +0 -342
  43. package/docs/extending.md +0 -448
  44. package/docs/getting-started.md +0 -298
  45. package/docs/knowledge-layer.md +0 -464
  46. package/docs/marketplace.md +0 -822
  47. package/docs/org-connection.md +0 -570
  48. package/docs/plans/architecture.md +0 -2429
  49. package/docs/plans/design.md +0 -2018
  50. package/docs/plans/prd.md +0 -1862
  51. package/docs/plans/stack-rank.md +0 -261
  52. package/docs/plans/technical-spec.md +0 -2755
  53. package/docs/privacy-and-safety.md +0 -807
  54. package/docs/prompt-optimization.md +0 -1071
  55. package/docs/test-plan.md +0 -972
  56. package/docs/third-party-ecosystem.md +0 -496
  57. package/domains/compliance-template/README.md +0 -173
  58. package/domains/compliance-template/traits/compliance-aware.md +0 -228
  59. package/examples/enterprise/agentboot.config.json +0 -184
  60. package/examples/minimal/agentboot.config.json +0 -46
  61. package/tests/REGRESSION-PLAN.md +0 -705
  62. package/tests/TEST-PLAN.md +0 -111
  63. package/tests/cli.test.ts +0 -705
  64. package/tests/pipeline.test.ts +0 -608
  65. package/tests/validate.test.ts +0 -278
  66. package/tsconfig.json +0 -62
package/docs/test-plan.md DELETED
@@ -1,972 +0,0 @@
1
- # AgentBoot Test Plan
2
-
3
- How to test a system whose outputs are non-deterministic, whose users are both
4
- humans and AI agents, and whose value is measured in behavioral quality — not
5
- binary pass/fail.
6
-
7
- ---
8
-
9
- ## Two Test Boundaries
10
-
11
- There are two completely separate things to test, owned by two different parties:
12
-
13
- ```
14
- ┌──────────────────────────────────┐ ┌─────────────────────────────────────┐
15
- │ AgentBoot Core │ │ Acme-Boot (Org's Personas Repo) │
16
- │ (this repo) │ │ (acme-corp/acme-personas) │
17
- │ │ │ │
18
- │ Owner: AgentBoot maintainers │ │ Owner: Acme's platform team │
19
- │ Cost: AgentBoot's budget │ │ Cost: Acme's budget │
20
- │ │ │ │
21
- │ What's tested: │ │ What's tested: │
22
- │ ├── compile.ts works │ │ ├── Acme's custom personas behave │
23
- │ ├── validate.ts catches errors │ │ ├── Acme's traits compose right │
24
- │ ├── sync.ts distributes right │ │ ├── Acme's gotchas are accurate │
25
- │ ├── lint rules are correct │ │ ├── Acme's hooks enforce policy │
26
- │ ├── CLI commands work │ │ ├── Acme's domain layer works │
27
- │ ├── Core personas are sane │ │ └── Acme's org config is valid │
28
- │ ├── Core traits compose │ │ │
29
- │ └── Plugin export is valid │ │ Uses: agentboot test, agentboot │
30
- │ │ │ lint, agentboot validate │
31
- │ Tests: vitest, CI on every PR │ │ Tests: same tools, Acme's CI │
32
- │ Budget: our CI costs │ │ Budget: Acme's API key + CI costs │
33
- └──────────────────────────────────┘ └─────────────────────────────────────┘
34
- ```
35
-
36
- **AgentBoot core** tests whether the build system, CLI, lint rules, and core
37
- personas work correctly. This is our responsibility, our cost, our CI.
38
-
39
- **Acme-boot** tests whether the org's custom personas, traits, gotchas, hooks,
40
- and domain layers work correctly. This is the org's responsibility, their cost,
41
- their CI — using tools that AgentBoot provides.
42
-
43
- AgentBoot ships the testing tools (`agentboot test`, `agentboot lint`,
44
- `agentboot validate`). The org uses them on their content. AgentBoot tests
45
- that the tools themselves work. The org tests that their content works.
46
-
47
- ---
48
-
49
- ## The Testing Challenge
50
-
51
- AgentBoot core has three fundamentally different layers to test:
52
-
53
- | Layer | Nature | Testing Approach |
54
- |-------|--------|-----------------|
55
- | **Build system** (compile, validate, sync) | Deterministic code | Traditional unit/integration tests |
56
- | **Persona output** (SKILL.md, CLAUDE.md, agents, rules) | Static files | Schema validation, lint, structural tests |
57
- | **Persona behavior** (what the persona actually DOES when invoked) | Non-deterministic LLM output | Behavioral assertions, LLM-as-judge, snapshot regression |
58
-
59
- The first two are standard software testing. The third is the novel problem.
60
-
61
- ---
62
-
63
- ## Test Pyramid
64
-
65
- ```
66
- ╱╲
67
- ╱ ╲
68
- ╱ E2E╲ Human review of persona output
69
- ╱ (rare)╲ in real repos. Manual, expensive.
70
- ╱──────────╲
71
- ╱ Behavioral ╲ LLM invocation with known inputs.
72
- ╱ (moderate) ╲ Assert on output patterns. ~$0.50/test.
73
- ╱────────────────╲
74
- ╱ Integration ╲ Build pipeline produces correct output.
75
- ╱ (frequent) ╲ File structure, content, format. Free.
76
- ╱──────────────────────╲
77
- ╱ Unit / Schema ╲ Config validation, frontmatter parsing,
78
- ╱ (very frequent) ╲ trait composition, lint rules. Free.
79
- ╱─────────────────────────────╲
80
- ```
81
-
82
- Run the bottom layers on every commit. Run behavioral tests on every PR. Run E2E
83
- reviews manually on major persona changes.
84
-
85
- ---
86
-
87
- ## Layer 1: Unit & Schema Tests (Free, Fast, Every Commit)
88
-
89
- ### What to Test
90
-
91
- **Config validation:**
92
- ```typescript
93
- // tests/config.test.ts
94
- describe('agentboot.config.json', () => {
95
- it('validates against JSON schema', () => { ... })
96
- it('rejects unknown fields', () => { ... })
97
- it('requires org field', () => { ... })
98
- it('validates group/team references match', () => { ... })
99
- it('validates persona IDs exist in core/personas/', () => { ... })
100
- it('validates trait IDs exist in core/traits/', () => { ... })
101
- })
102
- ```
103
-
104
- **Frontmatter parsing:**
105
- ```typescript
106
- // tests/frontmatter.test.ts
107
- describe('SKILL.md frontmatter', () => {
108
- it('parses all persona SKILL.md files without error', () => { ... })
109
- it('requires name field', () => { ... })
110
- it('requires description field', () => { ... })
111
- it('validates trait references resolve', () => { ... })
112
- it('validates weight values (HIGH/MEDIUM/LOW or 0.0-1.0)', () => { ... })
113
- })
114
- ```
115
-
116
- **Trait composition:**
117
- ```typescript
118
- // tests/composition.test.ts
119
- describe('trait composition', () => {
120
- it('inlines trait content at injection markers', () => { ... })
121
- it('resolves HIGH/MEDIUM/LOW to numeric weights', () => { ... })
122
- it('errors on missing trait reference', () => { ... })
123
- it('errors on circular trait dependency', () => { ... })
124
- it('composes multiple traits in declared order', () => { ... })
125
- })
126
- ```
127
-
128
- **Lint rules:**
129
- ```typescript
130
- // tests/lint.test.ts
131
- describe('lint rules', () => {
132
- it('detects vague language ("be thorough", "try to")', () => { ... })
133
- it('detects prompts exceeding token budget', () => { ... })
134
- it('detects credentials in prompt text', () => { ... })
135
- it('detects conflicting instructions across traits', () => { ... })
136
- it('detects unused traits', () => { ... })
137
- it('passes clean persona files', () => { ... })
138
- })
139
- ```
140
-
141
- **Sync logic:**
142
- ```typescript
143
- // tests/sync.test.ts
144
- describe('sync', () => {
145
- it('writes CC-native output to claude-code platform repos', () => { ... })
146
- it('writes cross-platform output to copilot platform repos', () => { ... })
147
- it('merges org + group + team scopes correctly', () => { ... })
148
- it('team overrides group on optional behaviors', () => { ... })
149
- it('org wins on mandatory behaviors', () => { ... })
150
- it('writes .agentboot-manifest.json tracking managed files', () => { ... })
151
- it('generates PERSONAS.md registry', () => { ... })
152
- })
153
- ```
154
-
155
- ### Tooling
156
-
157
- - **Test runner:** vitest (already in package.json)
158
- - **Assertion:** vitest built-in + custom matchers for frontmatter, token counting
159
- - **Fixtures:** `tests/fixtures/` with valid and invalid persona files
160
- - **CI:** Runs on every commit and PR. Must pass to merge.
161
-
162
- ---
163
-
164
- ## Layer 2: Integration Tests (Free, Moderate Speed, Every PR)
165
-
166
- ### What to Test
167
-
168
- **Full build pipeline:**
169
- ```typescript
170
- // tests/integration/build-pipeline.test.ts
171
- describe('full build', () => {
172
- it('validate → compile → sync produces expected output', () => {
173
- // Given: a test agentboot.config.json + personas + traits
174
- // When: run the full pipeline
175
- // Then: dist/ contains expected files with expected content
176
- })
177
-
178
- it('CC-native output has correct agent CLAUDE.md frontmatter', () => {
179
- // Check: name, description, model, permissionMode, maxTurns,
180
- // disallowedTools, skills, hooks, memory
181
- })
182
-
183
- it('CC-native output uses @imports not inlined traits', () => {
184
- // Check: CLAUDE.md contains @.claude/traits/critical-thinking.md
185
- // NOT the full trait content
186
- })
187
-
188
- it('cross-platform output has standalone inlined SKILL.md', () => {
189
- // Check: SKILL.md contains full trait content, no @imports
190
- })
191
-
192
- it('settings.json has hook entries from domain config', () => { ... })
193
- it('.mcp.json has server entries from domain config', () => { ... })
194
- it('rules have paths: frontmatter (not globs:)', () => { ... })
195
- })
196
- ```
197
-
198
- **Plugin export:**
199
- ```typescript
200
- // tests/integration/plugin-export.test.ts
201
- describe('plugin export', () => {
202
- it('produces valid plugin structure', () => {
203
- // .claude-plugin/plugin.json exists with correct name, version
204
- // agents/, skills/, hooks/ at root level (not inside .claude-plugin/)
205
- // marketplace.json valid if marketplace export
206
- })
207
-
208
- it('passes claude plugin validate', () => {
209
- // Run: claude plugin validate ./dist/plugin
210
- // Exit code 0
211
- })
212
- })
213
- ```
214
-
215
- **Discover + ingest:**
216
- ```typescript
217
- // tests/integration/discover.test.ts
218
- describe('discover', () => {
219
- it('finds CLAUDE.md files in test repo structure', () => { ... })
220
- it('finds .cursorrules and copilot-instructions.md', () => { ... })
221
- it('identifies near-duplicate content across repos', () => { ... })
222
- it('generates migration plan with correct classifications', () => { ... })
223
- it('does not modify source files (non-destructive)', () => { ... })
224
- })
225
- ```
226
-
227
- **Uninstall:**
228
- ```typescript
229
- // tests/integration/uninstall.test.ts
230
- describe('uninstall', () => {
231
- it('removes only files listed in .agentboot-manifest.json', () => { ... })
232
- it('preserves files not managed by AgentBoot', () => { ... })
233
- it('warns on modified managed files', () => { ... })
234
- it('restores pre-AgentBoot archive when requested', () => { ... })
235
- it('handles mixed content in CLAUDE.md', () => { ... })
236
- })
237
- ```
238
-
239
- ### Tooling
240
-
241
- - **Test runner:** vitest
242
- - **Filesystem:** Use temp directories (vitest's `tmpdir` or `os.tmpdir()`)
243
- - **Git fixtures:** Init test repos with known content for discover/sync tests
244
- - **CI:** Runs on every PR. Must pass to merge.
245
-
246
- ---
247
-
248
- ## Layer 3: Behavioral Tests (LLM Call, ~$0.50/test, Every PR to Personas)
249
-
250
- This is where it gets interesting. Testing whether a persona *behaves* correctly
251
- requires actually invoking it.
252
-
253
- ### The Testing Model
254
-
255
- ```
256
- Known input Persona Output Assert
257
- (crafted code → (invoked via → (structured → (pattern match
258
- with known bugs) claude -p) findings) against expected)
259
- ```
260
-
261
- ### Test File Format
262
-
263
- ```yaml
264
- # tests/behavioral/code-reviewer.test.yaml
265
-
266
- persona: code-reviewer
267
- model: haiku # Use cheapest model for tests (behavior, not quality)
268
- max_turns: 5
269
- max_budget_usd: 0.50
270
-
271
- setup:
272
- # Create test files that the persona will review
273
- files:
274
- - path: src/api/users.ts
275
- content: |
276
- export async function getUser(userId) {
277
- const query = `SELECT * FROM users WHERE id = ${userId}`;
278
- return db.execute(query);
279
- }
280
-
281
- cases:
282
- - name: catches-sql-injection
283
- prompt: "Review the file src/api/users.ts"
284
- expect:
285
- findings_min: 1
286
- severity_includes: [CRITICAL, ERROR]
287
- text_matches:
288
- - pattern: "SQL injection|parameterized|prepared statement"
289
- in: findings
290
- confidence_min: 0.7
291
-
292
- - name: no-false-positives-on-safe-code
293
- setup_override:
294
- files:
295
- - path: src/api/users.ts
296
- content: |
297
- export async function getUser(userId: number) {
298
- return db.execute('SELECT * FROM users WHERE id = $1', [userId]);
299
- }
300
- prompt: "Review the file src/api/users.ts"
301
- expect:
302
- findings_max: 0
303
- severity_excludes: [CRITICAL, ERROR]
304
-
305
- - name: structured-output-format
306
- prompt: "Review the file src/api/users.ts"
307
- expect:
308
- output_contains:
309
- - "CRITICAL" or "ERROR" or "WARN" or "INFO" # Severity labels
310
- - "src/api/users.ts" # File reference
311
- output_structure:
312
- has_sections: [findings, summary]
313
- ```
314
-
315
- ### Test Runner
316
-
317
- ```bash
318
- # Run all behavioral tests
319
- agentboot test --type behavioral
320
-
321
- # Run for one persona
322
- agentboot test --type behavioral --persona code-reviewer
323
-
324
- # Use a specific model (override test file)
325
- agentboot test --type behavioral --model sonnet
326
-
327
- # Cost cap for entire test suite
328
- agentboot test --type behavioral --max-budget 5.00
329
-
330
- # CI mode (exit codes, JSON summary)
331
- agentboot test --type behavioral --ci
332
- ```
333
-
334
- Under the hood, each test case runs:
335
-
336
- ```bash
337
- claude -p \
338
- --agent code-reviewer \
339
- --output-format json \
340
- --max-turns 5 \
341
- --max-budget-usd 0.50 \
342
- --permission-mode bypassPermissions \
343
- --no-session-persistence \
344
- "$PROMPT"
345
- ```
346
-
347
- The runner parses the JSON output and evaluates the `expect` assertions.
348
-
349
- ### Assertion Types
350
-
351
- | Assertion | What it checks | Example |
352
- |---|---|---|
353
- | `findings_min: N` | At least N findings | Persona found the bug |
354
- | `findings_max: N` | At most N findings | No false positives |
355
- | `severity_includes: [X]` | At least one finding has severity X | SQL injection flagged as CRITICAL |
356
- | `severity_excludes: [X]` | No findings have severity X | Clean code doesn't trigger ERROR |
357
- | `text_matches: [{pattern}]` | Regex match in output | "SQL injection" mentioned |
358
- | `text_excludes: [{pattern}]` | Regex must NOT match | Didn't hallucinate a finding |
359
- | `confidence_min: N` | All findings have confidence ≥ N | Persona is sure about SQL injection |
360
- | `output_contains: [X]` | Output includes literal strings | File reference present |
361
- | `output_structure: {}` | Structural checks on output | Has findings and summary sections |
362
- | `json_schema: path` | Output matches JSON schema | Structured output validates |
363
- | `token_max: N` | Output stays within token budget | Persona isn't verbose |
364
- | `duration_max_ms: N` | Execution time limit | Persona doesn't run away |
365
-
366
- ### Non-Determinism Strategy
367
-
368
- LLM output is non-deterministic. The same input may produce different findings across
369
- runs. The testing strategy:
370
-
371
- 1. **Test for patterns, not exact output.** Don't assert "the finding text is exactly
372
- X." Assert "the output contains a CRITICAL finding mentioning SQL injection."
373
-
374
- 2. **Test obvious cases.** Use inputs where any competent reviewer would find the
375
- issue. A parameterized SQL query with string interpolation is an obvious SQL
376
- injection — every run should catch it.
377
-
378
- 3. **Allow flake tolerance.** Run each behavioral test 3 times. Pass if 2/3 pass.
379
- This handles the rare case where the model misses something obvious. Configure:
380
- ```yaml
381
- flake_tolerance: 2 of 3 # Pass if 2 of 3 runs succeed
382
- ```
383
-
384
- 4. **Use cheap models for behavioral tests.** Haiku is sufficient to test whether a
385
- persona's prompt structure elicits the right behavior. If the prompt is good enough
386
- to work on Haiku, it'll work better on Sonnet/Opus. If it fails on Haiku, the
387
- prompt needs work regardless of model.
388
-
389
- 5. **Separate "does it work" from "how well does it work."** Behavioral tests check
390
- "does the persona catch the SQL injection?" (binary). Quality evaluation ("did it
391
- explain the fix well?") is a separate concern — see Layer 5.
392
-
393
- ---
394
-
395
- ## Layer 4: Snapshot / Regression Tests ($, Periodic)
396
-
397
- Compare persona output across versions to detect regressions.
398
-
399
- ### How It Works
400
-
401
- ```bash
402
- # Generate baseline snapshots
403
- agentboot test --type snapshot --update
404
-
405
- # Compare current output against baseline
406
- agentboot test --type snapshot
407
- ```
408
-
409
- The snapshot test:
410
- 1. Runs each persona against a fixed set of test inputs
411
- 2. Saves the structured output (findings, severities, count) as a snapshot
412
- 3. On subsequent runs, compares current output against the snapshot
413
- 4. Flags differences for human review
414
-
415
- ```
416
- $ agentboot test --type snapshot
417
-
418
- Snapshot Comparison: code-reviewer
419
- ──────────────────────────────────
420
-
421
- Test: sql-injection-detection
422
- Baseline: 1 CRITICAL (SQL injection)
423
- Current: 1 CRITICAL (SQL injection) + 1 WARN (missing type annotation)
424
- Status: CHANGED — new finding added
425
- → Is the new WARN correct? [y = update snapshot / n = investigate]
426
-
427
- Test: clean-code-no-findings
428
- Baseline: 0 findings
429
- Current: 0 findings
430
- Status: MATCH ✓
431
-
432
- Test: auth-middleware-review
433
- Baseline: 1 ERROR (missing auth check) + 2 WARN
434
- Current: 0 findings
435
- Status: REGRESSION ⚠️ — previously caught ERROR now missed
436
- → Investigate: trait change? prompt change? model change?
437
- ```
438
-
439
- ### When to Run
440
-
441
- - **After any persona prompt change** — did the edit improve or regress behavior?
442
- - **After trait updates** — did changing `critical-thinking` affect review quality?
443
- - **After model changes** — does the persona work as well on Sonnet as it did on Opus?
444
- - **Periodically (weekly)** — catch drift from model updates by the provider
445
-
446
- ### What Snapshots Contain
447
-
448
- Snapshots store structured summaries, not full output:
449
-
450
- ```json
451
- {
452
- "persona": "code-reviewer",
453
- "test_case": "sql-injection-detection",
454
- "snapshot_date": "2026-03-19",
455
- "model": "haiku",
456
- "findings_count": { "CRITICAL": 1, "ERROR": 0, "WARN": 0, "INFO": 0 },
457
- "finding_patterns": ["SQL injection", "parameterized"],
458
- "total_tokens": 1200,
459
- "duration_ms": 8500
460
- }
461
- ```
462
-
463
- Not the full prose output — just the structural signature. This makes comparison
464
- reliable across non-deterministic runs.
465
-
466
- ---
467
-
468
- ## Layer 5: LLM-as-Judge ($$, Major Changes Only)
469
-
470
- For qualitative evaluation that can't be reduced to pattern matching: "Is this review
471
- actually good? Is it thorough? Would a senior engineer agree with it?"
472
-
473
- ### How It Works
474
-
475
- A separate LLM call evaluates the persona's output:
476
-
477
- ```yaml
478
- # tests/eval/code-reviewer-quality.eval.yaml
479
-
480
- persona_under_test: code-reviewer
481
- judge_model: opus # Use strongest model as judge
482
- max_budget_usd: 2.00
483
-
484
- cases:
485
- - name: review-quality-auth-endpoint
486
- input_file: tests/fixtures/auth-endpoint-with-bugs.ts
487
- persona_prompt: "Review this file"
488
- judge_prompt: |
489
- You are a senior staff engineer evaluating the quality of an AI code review.
490
-
491
- The code being reviewed:
492
- {input}
493
-
494
- The review produced:
495
- {persona_output}
496
-
497
- Evaluate on these dimensions (1-5 scale):
498
- 1. Completeness: Did it find the important issues?
499
- 2. Accuracy: Are the findings correct? Any false positives?
500
- 3. Specificity: Are suggestions actionable with file:line references?
501
- 4. Prioritization: Are severity levels appropriate?
502
- 5. Tone: Professional, constructive, not pedantic?
503
-
504
- Known issues in the code (ground truth):
505
- - SQL injection on line 12
506
- - Missing rate limiting on POST endpoint
507
- - Auth token not validated for expiry
508
-
509
- Score each dimension 1-5. Explain your reasoning. Then give an overall
510
- pass/fail: does this review meet the bar for a senior engineer's review?
511
- expect:
512
- judge_score_min:
513
- completeness: 3
514
- accuracy: 4
515
- overall: "pass"
516
- ```
517
-
518
- ### When to Use
519
-
520
- - **Major persona prompt rewrites** — did the rewrite improve quality?
521
- - **New personas** — does the new persona meet the bar before shipping?
522
- - **Model migration** — switching from Opus to Sonnet — does quality hold?
523
- - **Quarterly quality audits** — periodic check on the full persona suite
524
-
525
- ### Cost Control
526
-
527
- LLM-as-judge is expensive (Opus as judge + persona invocation). Budget it:
528
-
529
- ```bash
530
- agentboot test --type eval --max-budget 20.00
531
-
532
- # Only run for specific personas
533
- agentboot test --type eval --persona security-reviewer
534
-
535
- # Skip if behavioral tests already passed (cascade)
536
- agentboot test --type eval --skip-if-behavioral-passed
537
- ```
538
-
539
- ---
540
-
541
- ## Layer 6: Human Review (Manual, Major Releases Only)
542
-
543
- The human is always in the loop for judgment calls that no automated test can make.
544
-
545
- ### When Humans Review
546
-
547
- | Trigger | What they review | Who |
548
- |---|---|---|
549
- | New persona ships | Full output on 3-5 real PRs | Platform team + domain expert |
550
- | Major trait change | Before/after comparison on real code | Platform team |
551
- | Quarterly audit | Random sample of 20 persona outputs | Platform team |
552
- | Quality escalation | Specific finding that a developer disputed | Persona author |
553
-
554
- ### How to Make It Efficient
555
-
556
- **The review tool:** `agentboot review` generates a side-by-side comparison:
557
-
558
- ```bash
559
- agentboot review --persona code-reviewer --sample 5
560
-
561
- # Human Review: code-reviewer (v1.3.0)
562
- # ──────────────────────────────────────
563
- #
564
- # Reviewing 5 randomly sampled outputs from the last 7 days.
565
- #
566
- # Sample 1/5: PR #234 (api-service)
567
- # ├── Findings: 1 ERROR, 3 WARN, 2 INFO
568
- # ├── [Show findings]
569
- # ├── [Show code context]
570
- # │
571
- # ├── Was this review accurate? [Yes] [Partially] [No]
572
- # ├── Were severity levels correct? [Yes] [Partially] [No]
573
- # ├── Would you add anything? [No] [Yes: ___]
574
- # └── Would you remove anything? [No] [Yes: ___]
575
- #
576
- # After all 5 samples:
577
- # ├── Overall quality score: ___/5
578
- # ├── Recommendation: [Ship as-is] [Needs tuning] [Needs rewrite]
579
- # └── Notes: ___
580
- ```
581
-
582
- This takes 10-15 minutes per persona. Not zero effort, but structured and focused.
583
- The reviewer isn't reading through raw sessions — they're evaluating curated samples
584
- with guided questions.
585
-
586
- **The cadence:** Platform team spends 1 hour/month reviewing persona quality.
587
- That's 4 personas × 15 minutes. The structured review tool makes this sustainable.
588
-
589
- ---
590
-
591
- ## Test Infrastructure
592
-
593
- ### CI Pipeline
594
-
595
- ```yaml
596
- # .github/workflows/agentboot-tests.yml
597
- name: AgentBoot Tests
598
- on:
599
- push:
600
- branches: [main]
601
- pull_request:
602
-
603
- jobs:
604
- unit-and-schema:
605
- runs-on: ubuntu-latest
606
- steps:
607
- - uses: actions/checkout@v4
608
- - uses: actions/setup-node@v4
609
- - run: npm ci
610
- - run: agentboot validate --strict
611
- - run: agentboot lint --severity error
612
- - run: npm run test # vitest unit + integration
613
-
614
- behavioral:
615
- if: github.event_name == 'pull_request'
616
- needs: unit-and-schema
617
- runs-on: ubuntu-latest
618
- steps:
619
- - uses: actions/checkout@v4
620
- - uses: actions/setup-node@v4
621
- - run: npm ci
622
- - run: agentboot test --type behavioral --ci --max-budget 5.00
623
- env:
624
- ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
625
-
626
- snapshot:
627
- if: contains(github.event.pull_request.labels.*.name, 'persona-change')
628
- needs: behavioral
629
- runs-on: ubuntu-latest
630
- steps:
631
- - uses: actions/checkout@v4
632
- - uses: actions/setup-node@v4
633
- - run: npm ci
634
- - run: agentboot test --type snapshot --ci
635
- env:
636
- ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
637
- ```
638
-
639
- ### Test Triggers
640
-
641
- | Layer | Trigger | Cost | Time |
642
- |-------|---------|------|------|
643
- | Unit / Schema | Every commit | Free | <10s |
644
- | Integration | Every commit | Free | <30s |
645
- | Behavioral | Every PR | ~$5 | ~2min |
646
- | Snapshot | PRs labeled `persona-change` | ~$5 | ~2min |
647
- | LLM-as-judge | Major changes (manual trigger) | ~$20 | ~5min |
648
- | Human review | Monthly / major release | Staff time | ~1hr |
649
-
650
- ### Cost Budget
651
-
652
- Monthly testing cost for a personas repo with 4 personas:
653
- - Unit/integration: **$0** (no API calls)
654
- - Behavioral: ~20 PRs/month × $5 = **$100/month**
655
- - Snapshot: ~5 persona changes/month × $5 = **$25/month**
656
- - LLM-as-judge: ~2 major changes/month × $20 = **$40/month**
657
- - **Total: ~$165/month** for automated testing
658
-
659
- That's less than one developer-hour of manual review — and it runs on every PR.
660
-
661
- ---
662
-
663
- ## Testing the Tests
664
-
665
- ### How Do You Know Your Behavioral Tests Are Good?
666
-
667
- **Mutation testing for personas.** Deliberately introduce known bugs into the
668
- persona prompt and verify that tests catch the regression:
669
-
670
- ```bash
671
- agentboot test --type mutation --persona code-reviewer
672
-
673
- # Mutation Testing: code-reviewer
674
- # ────────────────────────────────
675
- #
676
- # Mutation 1: Remove "SQL injection" from review checklist
677
- # Expected: catches-sql-injection test FAILS
678
- # Actual: catches-sql-injection test FAILED ✓ (mutation caught)
679
- #
680
- # Mutation 2: Change severity threshold (ERROR → INFO)
681
- # Expected: severity_includes assertion FAILS
682
- # Actual: severity_includes assertion FAILED ✓ (mutation caught)
683
- #
684
- # Mutation 3: Remove output format specification
685
- # Expected: structured-output-format test FAILS
686
- # Actual: structured-output-format test PASSED ✗ (mutation NOT caught)
687
- # → Your test doesn't verify output structure strictly enough
688
- #
689
- # Mutation score: 2/3 (67%)
690
- # → Consider adding stricter output structure assertions
691
- ```
692
-
693
- This is the "who tests the tests?" answer: mutations verify that tests actually
694
- detect the regressions they're supposed to detect.
695
-
696
- ---
697
-
698
- ## Agents Testing Agents
699
-
700
- ### The Philosophy
701
-
702
- AgentBoot personas are AI agents. Testing them with AI (behavioral tests, LLM-as-judge)
703
- is "agents testing agents." This is the right approach because:
704
-
705
- 1. **The output space is too large for handwritten assertions.** A code review can
706
- produce thousands of different valid outputs. Pattern matching covers the obvious
707
- cases; LLM-as-judge evaluates the nuanced ones.
708
-
709
- 2. **The evaluation criteria are subjective.** "Is this review thorough?" requires
710
- judgment. LLM-as-judge applies consistent judgment criteria at scale.
711
-
712
- 3. **The cost is proportional to the value.** Testing a persona costs ~$0.50-$2.00.
713
- A bad persona wasting $100/day in developer time and false positives costs far more.
714
-
715
- ### The Safeguard: Humans Always in the Loop
716
-
717
- AI-generated test results are **advisory, not authoritative.** The pipeline:
718
-
719
- ```
720
- Automated tests run → Results posted to PR → Human reviews before merge
721
- ```
722
-
723
- If behavioral tests pass and snapshot is stable, the human review is fast ("looks
724
- good, ship it"). If something fails, the human investigates. The automation removes
725
- burden, not judgment.
726
-
727
- **What humans decide that automation cannot:**
728
- - Is this new finding a genuine improvement or a new false positive?
729
- - Does this persona's tone match the org's culture?
730
- - Is this severity calibration appropriate for our risk tolerance?
731
- - Should we ship this persona change even though a snapshot changed? (sometimes yes)
732
-
733
- The test suite produces evidence. Humans make decisions. This is the "humans always
734
- in the loop" principle applied to testing.
735
-
736
- ---
737
-
738
- ## What Acme Tests (Org's Responsibility)
739
-
740
- When Acme's platform team creates their personas repo from AgentBoot, they inherit
741
- the testing tools but run them on their own content with their own CI and API keys.
742
-
743
- ### Acme's Test Layers
744
-
745
- | Layer | What Acme tests | Tool | Cost to Acme |
746
- |-------|----------------|------|-------------|
747
- | Schema/Lint | Their agentboot.config.json, custom persona frontmatter, custom traits | `agentboot validate`, `agentboot lint` | Free |
748
- | Build | Their personas compile without errors, sync produces expected output | `agentboot build --validate-only` | Free |
749
- | Behavioral | Their custom personas find the bugs they should find | `agentboot test --type behavioral` | ~$5/PR (Acme's API key) |
750
- | Snapshot | Their persona changes don't regress | `agentboot test --type snapshot` | ~$5/change (Acme's API key) |
751
- | Human review | Their personas produce quality output | `agentboot review` | Staff time |
752
-
753
- ### What Acme's CI Looks Like
754
-
755
- ```yaml
756
- # In acme-corp/acme-personas/.github/workflows/tests.yml
757
- name: Acme Persona Tests
758
- on: [push, pull_request]
759
-
760
- jobs:
761
- validate:
762
- runs-on: ubuntu-latest
763
- steps:
764
- - uses: actions/checkout@v4
765
- - run: npm ci
766
- - run: agentboot validate --strict
767
- - run: agentboot lint --severity error
768
-
769
- behavioral:
770
- if: github.event_name == 'pull_request'
771
- needs: validate
772
- runs-on: ubuntu-latest
773
- steps:
774
- - uses: actions/checkout@v4
775
- - run: npm ci
776
- - run: agentboot test --type behavioral --ci --max-budget 10.00
777
- env:
778
- ANTHROPIC_API_KEY: ${{ secrets.ACME_ANTHROPIC_KEY }} # Acme's key, Acme's cost
779
- ```
780
-
781
- AgentBoot provides the workflow template. Acme fills in their API key and adjusts
782
- the budget. The testing tools are the same; the content and cost are separate.
783
-
784
- ---
785
-
786
- ## "Is This My Bug or AgentBoot's Bug?"
787
-
788
- When something goes wrong, Acme needs to know: is the problem in their persona
789
- content (their fix) or in AgentBoot's build system (our fix)?
790
-
791
- ### The Diagnostic: `agentboot doctor --diagnose`
792
-
793
- ```bash
794
- $ agentboot doctor --diagnose
795
-
796
- Diagnosing: code-reviewer persona producing empty output
797
- ─────────────────────────────────────────────────────────
798
-
799
- Step 1: Core validation
800
- ✓ AgentBoot core version 1.2.0 (latest)
801
- ✓ Core traits compile without errors
802
- ✓ Core code-reviewer persona compiles without errors
803
- ✓ Core code-reviewer passes behavioral tests (3/3)
804
- → Core is healthy. If the problem is in the core code-reviewer,
805
- it's not reproducing with the default config.
806
-
807
- Step 2: Org layer validation
808
- ✓ agentboot.config.json valid
809
- ✓ All custom traits compile
810
- ✗ Custom extension for code-reviewer has an error:
811
- extensions/code-reviewer.md references trait "acme-standards"
812
- which doesn't exist in core/traits/ or Acme's custom traits.
813
- → LIKELY CAUSE: missing trait reference in Acme's extension
814
-
815
- Step 3: Compiled output check
816
- ✗ Compiled code-reviewer SKILL.md is 0 bytes
817
- → Build failed silently due to the missing trait reference
818
-
819
- ═══════════════════════════════════════════════════════
820
-
821
- Diagnosis: ACME CONTENT ISSUE
822
- The missing trait reference in extensions/code-reviewer.md causes
823
- the build to produce empty output.
824
-
825
- Fix: Either create core/traits/acme-standards.md or remove the
826
- reference from extensions/code-reviewer.md
827
-
828
- If you believe this is an AgentBoot bug (the build should NOT produce
829
- empty output on a missing trait — it should error), file an issue:
830
- → agentboot issue "Build produces empty output instead of error on missing trait"
831
- ```
832
-
833
- ### The Isolation Test
834
-
835
- The doctor runs a **layered isolation test** to pinpoint where the problem is:
836
-
837
- ```
838
- Layer 1: AgentBoot core only (no org content)
839
- → Does the core persona work with zero customization?
840
- → If NO: AgentBoot bug. File an issue.
841
- → If YES: continue.
842
-
843
- Layer 2: Core + org config (no custom personas/traits)
844
- → Does the core persona work with Acme's agentboot.config.json?
845
- → If NO: Config issue. Check config.
846
- → If YES: continue.
847
-
848
- Layer 3: Core + org config + org traits
849
- → Do Acme's custom traits compose without errors?
850
- → If NO: Trait issue. Check Acme's traits.
851
- → If YES: continue.
852
-
853
- Layer 4: Core + org config + org traits + org personas
854
- → Do Acme's custom personas compile and lint?
855
- → If NO: Persona issue. Check Acme's persona definitions.
856
- → If YES: continue.
857
-
858
- Layer 5: Core + org config + org traits + org personas + org extensions
859
- → Does the full stack work?
860
- → If NO: Extension issue. Check Acme's extensions.
861
- → If YES: problem is elsewhere (model, API, environment).
862
- ```
863
-
864
- Each layer adds one piece. The layer where it breaks is the layer that has the bug.
865
- If Layer 1 breaks, it's AgentBoot's problem. If Layer 3 breaks, it's Acme's traits.
866
-
867
- ### `agentboot issue` — Streamlined Bug Reporting
868
-
869
- When the diagnosis points to an AgentBoot bug, one command files it:
870
-
871
- ```bash
872
- agentboot issue "Build produces empty output instead of error on missing trait"
873
-
874
- # Filing issue against agentboot/agentboot
875
- #
876
- # Title: Build produces empty output instead of error on missing trait
877
- #
878
- # Auto-attached:
879
- # ├── AgentBoot version: 1.2.0
880
- # ├── Node version: 22.1.0
881
- # ├── OS: macOS 15.3
882
- # ├── Diagnosis output: (attached)
883
- # ├── agentboot.config.json: (attached, org-specific values redacted)
884
- # ├── Relevant error logs: (attached)
885
- # │
886
- # ├── NOT attached (privacy):
887
- # │ ├── Org persona content
888
- # │ ├── Custom trait content
889
- # │ ├── Developer prompts
890
- # │ └── Session transcripts
891
- #
892
- # Open issue in browser? [Y/n]
893
- ```
894
-
895
- The issue command:
896
- - Attaches environment info and diagnosis output
897
- - Redacts org-specific content (persona text, trait content, internal URLs)
898
- - Includes the config structure (field names and types, not values)
899
- - Never includes developer prompts or session data
900
- - Opens in browser for the user to review before submitting
901
-
902
- ### When It's Ambiguous
903
-
904
- Sometimes the bug is in the boundary — AgentBoot's build system should have caught
905
- an error in Acme's content but didn't. Example: Acme writes a persona with a
906
- circular trait reference. The build system should error; instead it loops forever.
907
-
908
- This is an **AgentBoot bug** (the build system should validate and reject) even
909
- though the root cause is in Acme's content (the circular reference). The fix goes
910
- into AgentBoot core (add circular reference detection to validate.ts), and Acme
911
- fixes their content.
912
-
913
- The diagnostic output makes this clear:
914
-
915
- ```
916
- Diagnosis: AGENTBOOT BUG (validation gap)
917
- Acme's content has a circular trait reference (A → B → A).
918
- AgentBoot's validator should catch this but doesn't.
919
-
920
- Workaround: Remove the circular reference in Acme's trait.
921
- Fix: AgentBoot should add circular reference detection.
922
- → agentboot issue "Validator doesn't catch circular trait references"
923
- ```
924
-
925
- ### The General Rule
926
-
927
- | Symptom | Likely Owner |
928
- |---------|-------------|
929
- | Build system crashes | AgentBoot |
930
- | Build produces wrong file structure | AgentBoot |
931
- | Validator doesn't catch invalid content | AgentBoot |
932
- | Lint rule has false positives/negatives | AgentBoot |
933
- | CLI command doesn't work | AgentBoot |
934
- | Core persona produces bad output | AgentBoot (prompt quality) or Anthropic (model regression) |
935
- | Custom persona produces bad output | Org's persona content |
936
- | Custom trait doesn't compose correctly | Org's trait (unless build system is wrong) |
937
- | Custom extension is ignored | Org's extension path/format (unless sync is broken) |
938
- | Sync writes wrong files | AgentBoot |
939
- | Sync writes right files but persona behaves wrong | Org's persona content |
940
- | Gotcha doesn't activate on matching files | Check `paths:` patterns (org) then check rule loading (AgentBoot) |
941
- | Hook doesn't fire | Check hook config (org) then check hook system (AgentBoot) |
942
- | Plugin doesn't install | Check plugin structure (org) then check export (AgentBoot) |
943
-
944
- ---
945
-
946
- ## What AgentBoot Needs to Build
947
-
948
- | Component | Phase | Cost |
949
- |-----------|-------|------|
950
- | Unit tests (config, frontmatter, composition, lint) | V1 | Free |
951
- | Integration tests (build pipeline, sync, plugin export) | V1 | Free |
952
- | Test fixtures (valid/invalid personas, known-buggy code) | V1 | Free |
953
- | `agentboot test --type deterministic` runner | V1 | Free |
954
- | CI workflow template | V1 | Free |
955
- | Behavioral test format (YAML) + runner | V1.5 | ~$5/run |
956
- | `agentboot test --type behavioral` with `claude -p` | V1.5 | ~$5/run |
957
- | Snapshot test format + runner | V1.5 | ~$5/run |
958
- | Flake tolerance (2-of-3 runs) | V1.5 | 3x cost |
959
- | LLM-as-judge eval format + runner | V2 | ~$20/run |
960
- | `agentboot review` (human review tool) | V2 | Staff time |
961
- | Mutation testing for personas | V2+ | ~$15/run |
962
- | GitHub Actions reusable workflow for tests | V1.5 | Free |
963
- | `agentboot doctor --diagnose` (layered isolation) | V1 | Free |
964
- | `agentboot issue` (streamlined bug reporting) | V1.5 | Free |
965
- | Org CI workflow template (acme runs on their content) | V1 | Free |
966
-
967
- ---
968
-
969
- *See also:*
970
- - [`docs/prompt-optimization.md`](prompt-optimization.md#6-prompt-testing-agentboot-test) — test types and YAML format
971
- - [`docs/ci-cd-automation.md`](ci-cd-automation.md) — `claude -p` flags for CI
972
- - [`docs/claude-code-reference/feature-inventory.md`](claude-code-reference/feature-inventory.md) — CLI flags