agentboot 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.github/ISSUE_TEMPLATE/persona-request.md +62 -0
- package/.github/ISSUE_TEMPLATE/quality-feedback.md +67 -0
- package/.github/workflows/cla.yml +25 -0
- package/.github/workflows/validate.yml +49 -0
- package/.idea/agentboot.iml +9 -0
- package/.idea/misc.xml +6 -0
- package/.idea/modules.xml +8 -0
- package/.idea/vcs.xml +6 -0
- package/CLA.md +98 -0
- package/CLAUDE.md +230 -0
- package/CONTRIBUTING.md +168 -0
- package/LICENSE +191 -0
- package/NOTICE +4 -0
- package/PERSONAS.md +156 -0
- package/README.md +172 -0
- package/agentboot.config.json +207 -0
- package/bin/agentboot.js +17 -0
- package/core/gotchas/README.md +35 -0
- package/core/instructions/baseline.instructions.md +133 -0
- package/core/instructions/security.instructions.md +186 -0
- package/core/personas/code-reviewer/SKILL.md +175 -0
- package/core/personas/code-reviewer/persona.config.json +11 -0
- package/core/personas/security-reviewer/SKILL.md +233 -0
- package/core/personas/security-reviewer/persona.config.json +11 -0
- package/core/personas/test-data-expert/SKILL.md +234 -0
- package/core/personas/test-data-expert/persona.config.json +10 -0
- package/core/personas/test-generator/SKILL.md +262 -0
- package/core/personas/test-generator/persona.config.json +10 -0
- package/core/traits/audit-trail.md +182 -0
- package/core/traits/confidence-signaling.md +172 -0
- package/core/traits/critical-thinking.md +129 -0
- package/core/traits/schema-awareness.md +132 -0
- package/core/traits/source-citation.md +174 -0
- package/core/traits/structured-output.md +199 -0
- package/docs/ci-cd-automation.md +548 -0
- package/docs/claude-code-reference/README.md +21 -0
- package/docs/claude-code-reference/agentboot-coverage.md +484 -0
- package/docs/claude-code-reference/feature-inventory.md +906 -0
- package/docs/cli-commands-audit.md +112 -0
- package/docs/cli-design.md +924 -0
- package/docs/concepts.md +1117 -0
- package/docs/config-schema-audit.md +121 -0
- package/docs/configuration.md +645 -0
- package/docs/delivery-methods.md +758 -0
- package/docs/developer-onboarding.md +342 -0
- package/docs/extending.md +448 -0
- package/docs/getting-started.md +298 -0
- package/docs/knowledge-layer.md +464 -0
- package/docs/marketplace.md +822 -0
- package/docs/org-connection.md +570 -0
- package/docs/plans/architecture.md +2429 -0
- package/docs/plans/design.md +2018 -0
- package/docs/plans/prd.md +1862 -0
- package/docs/plans/stack-rank.md +261 -0
- package/docs/plans/technical-spec.md +2755 -0
- package/docs/privacy-and-safety.md +807 -0
- package/docs/prompt-optimization.md +1071 -0
- package/docs/test-plan.md +972 -0
- package/docs/third-party-ecosystem.md +496 -0
- package/domains/compliance-template/README.md +173 -0
- package/domains/compliance-template/traits/compliance-aware.md +228 -0
- package/examples/enterprise/agentboot.config.json +184 -0
- package/examples/minimal/agentboot.config.json +46 -0
- package/package.json +63 -0
- package/repos.json +1 -0
- package/scripts/cli.ts +1069 -0
- package/scripts/compile.ts +1000 -0
- package/scripts/dev-sync.ts +149 -0
- package/scripts/lib/config.ts +137 -0
- package/scripts/lib/frontmatter.ts +61 -0
- package/scripts/sync.ts +687 -0
- package/scripts/validate.ts +421 -0
- package/tests/REGRESSION-PLAN.md +705 -0
- package/tests/TEST-PLAN.md +111 -0
- package/tests/cli.test.ts +705 -0
- package/tests/pipeline.test.ts +608 -0
- package/tests/validate.test.ts +278 -0
- package/tsconfig.json +62 -0
|
@@ -0,0 +1,972 @@
|
|
|
1
|
+
# AgentBoot Test Plan
|
|
2
|
+
|
|
3
|
+
How to test a system whose outputs are non-deterministic, whose users are both
|
|
4
|
+
humans and AI agents, and whose value is measured in behavioral quality — not
|
|
5
|
+
binary pass/fail.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Two Test Boundaries
|
|
10
|
+
|
|
11
|
+
There are two completely separate things to test, owned by two different parties:
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
┌──────────────────────────────────┐ ┌─────────────────────────────────────┐
|
|
15
|
+
│ AgentBoot Core │ │ Acme-Boot (Org's Personas Repo) │
|
|
16
|
+
│ (this repo) │ │ (acme-corp/acme-personas) │
|
|
17
|
+
│ │ │ │
|
|
18
|
+
│ Owner: AgentBoot maintainers │ │ Owner: Acme's platform team │
|
|
19
|
+
│ Cost: AgentBoot's budget │ │ Cost: Acme's budget │
|
|
20
|
+
│ │ │ │
|
|
21
|
+
│ What's tested: │ │ What's tested: │
|
|
22
|
+
│ ├── compile.ts works │ │ ├── Acme's custom personas behave │
|
|
23
|
+
│ ├── validate.ts catches errors │ │ ├── Acme's traits compose right │
|
|
24
|
+
│ ├── sync.ts distributes right │ │ ├── Acme's gotchas are accurate │
|
|
25
|
+
│ ├── lint rules are correct │ │ ├── Acme's hooks enforce policy │
|
|
26
|
+
│ ├── CLI commands work │ │ ├── Acme's domain layer works │
|
|
27
|
+
│ ├── Core personas are sane │ │ └── Acme's org config is valid │
|
|
28
|
+
│ ├── Core traits compose │ │ │
|
|
29
|
+
│ └── Plugin export is valid │ │ Uses: agentboot test, agentboot │
|
|
30
|
+
│ │ │ lint, agentboot validate │
|
|
31
|
+
│ Tests: vitest, CI on every PR │ │ Tests: same tools, Acme's CI │
|
|
32
|
+
│ Budget: our CI costs │ │ Budget: Acme's API key + CI costs │
|
|
33
|
+
└──────────────────────────────────┘ └─────────────────────────────────────┘
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
**AgentBoot core** tests whether the build system, CLI, lint rules, and core
|
|
37
|
+
personas work correctly. This is our responsibility, our cost, our CI.
|
|
38
|
+
|
|
39
|
+
**Acme-boot** tests whether the org's custom personas, traits, gotchas, hooks,
|
|
40
|
+
and domain layers work correctly. This is the org's responsibility, their cost,
|
|
41
|
+
their CI — using tools that AgentBoot provides.
|
|
42
|
+
|
|
43
|
+
AgentBoot ships the testing tools (`agentboot test`, `agentboot lint`,
|
|
44
|
+
`agentboot validate`). The org uses them on their content. AgentBoot tests
|
|
45
|
+
that the tools themselves work. The org tests that their content works.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## The Testing Challenge
|
|
50
|
+
|
|
51
|
+
AgentBoot core has three fundamentally different layers to test:
|
|
52
|
+
|
|
53
|
+
| Layer | Nature | Testing Approach |
|
|
54
|
+
|-------|--------|-----------------|
|
|
55
|
+
| **Build system** (compile, validate, sync) | Deterministic code | Traditional unit/integration tests |
|
|
56
|
+
| **Persona output** (SKILL.md, CLAUDE.md, agents, rules) | Static files | Schema validation, lint, structural tests |
|
|
57
|
+
| **Persona behavior** (what the persona actually DOES when invoked) | Non-deterministic LLM output | Behavioral assertions, LLM-as-judge, snapshot regression |
|
|
58
|
+
|
|
59
|
+
The first two are standard software testing. The third is the novel problem.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Test Pyramid
|
|
64
|
+
|
|
65
|
+
```
|
|
66
|
+
╱╲
|
|
67
|
+
╱ ╲
|
|
68
|
+
╱ E2E╲ Human review of persona output
|
|
69
|
+
╱ (rare)╲ in real repos. Manual, expensive.
|
|
70
|
+
╱──────────╲
|
|
71
|
+
╱ Behavioral ╲ LLM invocation with known inputs.
|
|
72
|
+
╱ (moderate) ╲ Assert on output patterns. ~$0.50/test.
|
|
73
|
+
╱────────────────╲
|
|
74
|
+
╱ Integration ╲ Build pipeline produces correct output.
|
|
75
|
+
╱ (frequent) ╲ File structure, content, format. Free.
|
|
76
|
+
╱──────────────────────╲
|
|
77
|
+
╱ Unit / Schema ╲ Config validation, frontmatter parsing,
|
|
78
|
+
╱ (very frequent) ╲ trait composition, lint rules. Free.
|
|
79
|
+
╱─────────────────────────────╲
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Run the bottom layers on every commit. Run behavioral tests on every PR. Run E2E
|
|
83
|
+
reviews manually on major persona changes.
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Layer 1: Unit & Schema Tests (Free, Fast, Every Commit)
|
|
88
|
+
|
|
89
|
+
### What to Test
|
|
90
|
+
|
|
91
|
+
**Config validation:**
|
|
92
|
+
```typescript
|
|
93
|
+
// tests/config.test.ts
|
|
94
|
+
describe('agentboot.config.json', () => {
|
|
95
|
+
it('validates against JSON schema', () => { ... })
|
|
96
|
+
it('rejects unknown fields', () => { ... })
|
|
97
|
+
it('requires org field', () => { ... })
|
|
98
|
+
it('validates group/team references match', () => { ... })
|
|
99
|
+
it('validates persona IDs exist in core/personas/', () => { ... })
|
|
100
|
+
it('validates trait IDs exist in core/traits/', () => { ... })
|
|
101
|
+
})
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
**Frontmatter parsing:**
|
|
105
|
+
```typescript
|
|
106
|
+
// tests/frontmatter.test.ts
|
|
107
|
+
describe('SKILL.md frontmatter', () => {
|
|
108
|
+
it('parses all persona SKILL.md files without error', () => { ... })
|
|
109
|
+
it('requires name field', () => { ... })
|
|
110
|
+
it('requires description field', () => { ... })
|
|
111
|
+
it('validates trait references resolve', () => { ... })
|
|
112
|
+
it('validates weight values (HIGH/MEDIUM/LOW or 0.0-1.0)', () => { ... })
|
|
113
|
+
})
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**Trait composition:**
|
|
117
|
+
```typescript
|
|
118
|
+
// tests/composition.test.ts
|
|
119
|
+
describe('trait composition', () => {
|
|
120
|
+
it('inlines trait content at injection markers', () => { ... })
|
|
121
|
+
it('resolves HIGH/MEDIUM/LOW to numeric weights', () => { ... })
|
|
122
|
+
it('errors on missing trait reference', () => { ... })
|
|
123
|
+
it('errors on circular trait dependency', () => { ... })
|
|
124
|
+
it('composes multiple traits in declared order', () => { ... })
|
|
125
|
+
})
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
**Lint rules:**
|
|
129
|
+
```typescript
|
|
130
|
+
// tests/lint.test.ts
|
|
131
|
+
describe('lint rules', () => {
|
|
132
|
+
it('detects vague language ("be thorough", "try to")', () => { ... })
|
|
133
|
+
it('detects prompts exceeding token budget', () => { ... })
|
|
134
|
+
it('detects credentials in prompt text', () => { ... })
|
|
135
|
+
it('detects conflicting instructions across traits', () => { ... })
|
|
136
|
+
it('detects unused traits', () => { ... })
|
|
137
|
+
it('passes clean persona files', () => { ... })
|
|
138
|
+
})
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**Sync logic:**
|
|
142
|
+
```typescript
|
|
143
|
+
// tests/sync.test.ts
|
|
144
|
+
describe('sync', () => {
|
|
145
|
+
it('writes CC-native output to claude-code platform repos', () => { ... })
|
|
146
|
+
it('writes cross-platform output to copilot platform repos', () => { ... })
|
|
147
|
+
it('merges org + group + team scopes correctly', () => { ... })
|
|
148
|
+
it('team overrides group on optional behaviors', () => { ... })
|
|
149
|
+
it('org wins on mandatory behaviors', () => { ... })
|
|
150
|
+
it('writes .agentboot-manifest.json tracking managed files', () => { ... })
|
|
151
|
+
it('generates PERSONAS.md registry', () => { ... })
|
|
152
|
+
})
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### Tooling
|
|
156
|
+
|
|
157
|
+
- **Test runner:** vitest (already in package.json)
|
|
158
|
+
- **Assertion:** vitest built-in + custom matchers for frontmatter, token counting
|
|
159
|
+
- **Fixtures:** `tests/fixtures/` with valid and invalid persona files
|
|
160
|
+
- **CI:** Runs on every commit and PR. Must pass to merge.
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## Layer 2: Integration Tests (Free, Moderate Speed, Every PR)
|
|
165
|
+
|
|
166
|
+
### What to Test
|
|
167
|
+
|
|
168
|
+
**Full build pipeline:**
|
|
169
|
+
```typescript
|
|
170
|
+
// tests/integration/build-pipeline.test.ts
|
|
171
|
+
describe('full build', () => {
|
|
172
|
+
it('validate → compile → sync produces expected output', () => {
|
|
173
|
+
// Given: a test agentboot.config.json + personas + traits
|
|
174
|
+
// When: run the full pipeline
|
|
175
|
+
// Then: dist/ contains expected files with expected content
|
|
176
|
+
})
|
|
177
|
+
|
|
178
|
+
it('CC-native output has correct agent CLAUDE.md frontmatter', () => {
|
|
179
|
+
// Check: name, description, model, permissionMode, maxTurns,
|
|
180
|
+
// disallowedTools, skills, hooks, memory
|
|
181
|
+
})
|
|
182
|
+
|
|
183
|
+
it('CC-native output uses @imports not inlined traits', () => {
|
|
184
|
+
// Check: CLAUDE.md contains @.claude/traits/critical-thinking.md
|
|
185
|
+
// NOT the full trait content
|
|
186
|
+
})
|
|
187
|
+
|
|
188
|
+
it('cross-platform output has standalone inlined SKILL.md', () => {
|
|
189
|
+
// Check: SKILL.md contains full trait content, no @imports
|
|
190
|
+
})
|
|
191
|
+
|
|
192
|
+
it('settings.json has hook entries from domain config', () => { ... })
|
|
193
|
+
it('.mcp.json has server entries from domain config', () => { ... })
|
|
194
|
+
it('rules have paths: frontmatter (not globs:)', () => { ... })
|
|
195
|
+
})
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**Plugin export:**
|
|
199
|
+
```typescript
|
|
200
|
+
// tests/integration/plugin-export.test.ts
|
|
201
|
+
describe('plugin export', () => {
|
|
202
|
+
it('produces valid plugin structure', () => {
|
|
203
|
+
// .claude-plugin/plugin.json exists with correct name, version
|
|
204
|
+
// agents/, skills/, hooks/ at root level (not inside .claude-plugin/)
|
|
205
|
+
// marketplace.json valid if marketplace export
|
|
206
|
+
})
|
|
207
|
+
|
|
208
|
+
it('passes claude plugin validate', () => {
|
|
209
|
+
// Run: claude plugin validate ./dist/plugin
|
|
210
|
+
// Exit code 0
|
|
211
|
+
})
|
|
212
|
+
})
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Discover + ingest:**
|
|
216
|
+
```typescript
|
|
217
|
+
// tests/integration/discover.test.ts
|
|
218
|
+
describe('discover', () => {
|
|
219
|
+
it('finds CLAUDE.md files in test repo structure', () => { ... })
|
|
220
|
+
it('finds .cursorrules and copilot-instructions.md', () => { ... })
|
|
221
|
+
it('identifies near-duplicate content across repos', () => { ... })
|
|
222
|
+
it('generates migration plan with correct classifications', () => { ... })
|
|
223
|
+
it('does not modify source files (non-destructive)', () => { ... })
|
|
224
|
+
})
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
**Uninstall:**
|
|
228
|
+
```typescript
|
|
229
|
+
// tests/integration/uninstall.test.ts
|
|
230
|
+
describe('uninstall', () => {
|
|
231
|
+
it('removes only files listed in .agentboot-manifest.json', () => { ... })
|
|
232
|
+
it('preserves files not managed by AgentBoot', () => { ... })
|
|
233
|
+
it('warns on modified managed files', () => { ... })
|
|
234
|
+
it('restores pre-AgentBoot archive when requested', () => { ... })
|
|
235
|
+
it('handles mixed content in CLAUDE.md', () => { ... })
|
|
236
|
+
})
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
### Tooling
|
|
240
|
+
|
|
241
|
+
- **Test runner:** vitest
|
|
242
|
+
- **Filesystem:** Use temp directories (vitest's `tmpdir` or `os.tmpdir()`)
|
|
243
|
+
- **Git fixtures:** Init test repos with known content for discover/sync tests
|
|
244
|
+
- **CI:** Runs on every PR. Must pass to merge.
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## Layer 3: Behavioral Tests (LLM Call, ~$0.50/test, Every PR to Personas)
|
|
249
|
+
|
|
250
|
+
This is where it gets interesting. Testing whether a persona *behaves* correctly
|
|
251
|
+
requires actually invoking it.
|
|
252
|
+
|
|
253
|
+
### The Testing Model
|
|
254
|
+
|
|
255
|
+
```
|
|
256
|
+
Known input Persona Output Assert
|
|
257
|
+
(crafted code → (invoked via → (structured → (pattern match
|
|
258
|
+
with known bugs) claude -p) findings) against expected)
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Test File Format
|
|
262
|
+
|
|
263
|
+
```yaml
|
|
264
|
+
# tests/behavioral/code-reviewer.test.yaml
|
|
265
|
+
|
|
266
|
+
persona: code-reviewer
|
|
267
|
+
model: haiku # Use cheapest model for tests (behavior, not quality)
|
|
268
|
+
max_turns: 5
|
|
269
|
+
max_budget_usd: 0.50
|
|
270
|
+
|
|
271
|
+
setup:
|
|
272
|
+
# Create test files that the persona will review
|
|
273
|
+
files:
|
|
274
|
+
- path: src/api/users.ts
|
|
275
|
+
content: |
|
|
276
|
+
export async function getUser(userId) {
|
|
277
|
+
const query = `SELECT * FROM users WHERE id = ${userId}`;
|
|
278
|
+
return db.execute(query);
|
|
279
|
+
}
|
|
280
|
+
|
|
281
|
+
cases:
|
|
282
|
+
- name: catches-sql-injection
|
|
283
|
+
prompt: "Review the file src/api/users.ts"
|
|
284
|
+
expect:
|
|
285
|
+
findings_min: 1
|
|
286
|
+
severity_includes: [CRITICAL, ERROR]
|
|
287
|
+
text_matches:
|
|
288
|
+
- pattern: "SQL injection|parameterized|prepared statement"
|
|
289
|
+
in: findings
|
|
290
|
+
confidence_min: 0.7
|
|
291
|
+
|
|
292
|
+
- name: no-false-positives-on-safe-code
|
|
293
|
+
setup_override:
|
|
294
|
+
files:
|
|
295
|
+
- path: src/api/users.ts
|
|
296
|
+
content: |
|
|
297
|
+
export async function getUser(userId: number) {
|
|
298
|
+
return db.execute('SELECT * FROM users WHERE id = $1', [userId]);
|
|
299
|
+
}
|
|
300
|
+
prompt: "Review the file src/api/users.ts"
|
|
301
|
+
expect:
|
|
302
|
+
findings_max: 0
|
|
303
|
+
severity_excludes: [CRITICAL, ERROR]
|
|
304
|
+
|
|
305
|
+
- name: structured-output-format
|
|
306
|
+
prompt: "Review the file src/api/users.ts"
|
|
307
|
+
expect:
|
|
308
|
+
output_contains:
|
|
309
|
+
- "CRITICAL" or "ERROR" or "WARN" or "INFO" # Severity labels
|
|
310
|
+
- "src/api/users.ts" # File reference
|
|
311
|
+
output_structure:
|
|
312
|
+
has_sections: [findings, summary]
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Test Runner
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
# Run all behavioral tests
|
|
319
|
+
agentboot test --type behavioral
|
|
320
|
+
|
|
321
|
+
# Run for one persona
|
|
322
|
+
agentboot test --type behavioral --persona code-reviewer
|
|
323
|
+
|
|
324
|
+
# Use a specific model (override test file)
|
|
325
|
+
agentboot test --type behavioral --model sonnet
|
|
326
|
+
|
|
327
|
+
# Cost cap for entire test suite
|
|
328
|
+
agentboot test --type behavioral --max-budget 5.00
|
|
329
|
+
|
|
330
|
+
# CI mode (exit codes, JSON summary)
|
|
331
|
+
agentboot test --type behavioral --ci
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
Under the hood, each test case runs:
|
|
335
|
+
|
|
336
|
+
```bash
|
|
337
|
+
claude -p \
|
|
338
|
+
--agent code-reviewer \
|
|
339
|
+
--output-format json \
|
|
340
|
+
--max-turns 5 \
|
|
341
|
+
--max-budget-usd 0.50 \
|
|
342
|
+
--permission-mode bypassPermissions \
|
|
343
|
+
--no-session-persistence \
|
|
344
|
+
"$PROMPT"
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
The runner parses the JSON output and evaluates the `expect` assertions.
|
|
348
|
+
|
|
349
|
+
### Assertion Types
|
|
350
|
+
|
|
351
|
+
| Assertion | What it checks | Example |
|
|
352
|
+
|---|---|---|
|
|
353
|
+
| `findings_min: N` | At least N findings | Persona found the bug |
|
|
354
|
+
| `findings_max: N` | At most N findings | No false positives |
|
|
355
|
+
| `severity_includes: [X]` | At least one finding has severity X | SQL injection flagged as CRITICAL |
|
|
356
|
+
| `severity_excludes: [X]` | No findings have severity X | Clean code doesn't trigger ERROR |
|
|
357
|
+
| `text_matches: [{pattern}]` | Regex match in output | "SQL injection" mentioned |
|
|
358
|
+
| `text_excludes: [{pattern}]` | Regex must NOT match | Didn't hallucinate a finding |
|
|
359
|
+
| `confidence_min: N` | All findings have confidence ≥ N | Persona is sure about SQL injection |
|
|
360
|
+
| `output_contains: [X]` | Output includes literal strings | File reference present |
|
|
361
|
+
| `output_structure: {}` | Structural checks on output | Has findings and summary sections |
|
|
362
|
+
| `json_schema: path` | Output matches JSON schema | Structured output validates |
|
|
363
|
+
| `token_max: N` | Output stays within token budget | Persona isn't verbose |
|
|
364
|
+
| `duration_max_ms: N` | Execution time limit | Persona doesn't run away |
|
|
365
|
+
|
|
366
|
+
### Non-Determinism Strategy
|
|
367
|
+
|
|
368
|
+
LLM output is non-deterministic. The same input may produce different findings across
|
|
369
|
+
runs. The testing strategy:
|
|
370
|
+
|
|
371
|
+
1. **Test for patterns, not exact output.** Don't assert "the finding text is exactly
|
|
372
|
+
X." Assert "the output contains a CRITICAL finding mentioning SQL injection."
|
|
373
|
+
|
|
374
|
+
2. **Test obvious cases.** Use inputs where any competent reviewer would find the
|
|
375
|
+
issue. A parameterized SQL query with string interpolation is an obvious SQL
|
|
376
|
+
injection — every run should catch it.
|
|
377
|
+
|
|
378
|
+
3. **Allow flake tolerance.** Run each behavioral test 3 times. Pass if 2/3 pass.
|
|
379
|
+
This handles the rare case where the model misses something obvious. Configure:
|
|
380
|
+
```yaml
|
|
381
|
+
flake_tolerance: 2 of 3 # Pass if 2 of 3 runs succeed
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
4. **Use cheap models for behavioral tests.** Haiku is sufficient to test whether a
|
|
385
|
+
persona's prompt structure elicits the right behavior. If the prompt is good enough
|
|
386
|
+
to work on Haiku, it'll work better on Sonnet/Opus. If it fails on Haiku, the
|
|
387
|
+
prompt needs work regardless of model.
|
|
388
|
+
|
|
389
|
+
5. **Separate "does it work" from "how well does it work."** Behavioral tests check
|
|
390
|
+
"does the persona catch the SQL injection?" (binary). Quality evaluation ("did it
|
|
391
|
+
explain the fix well?") is a separate concern — see Layer 5.
|
|
392
|
+
|
|
393
|
+
---
|
|
394
|
+
|
|
395
|
+
## Layer 4: Snapshot / Regression Tests ($, Periodic)
|
|
396
|
+
|
|
397
|
+
Compare persona output across versions to detect regressions.
|
|
398
|
+
|
|
399
|
+
### How It Works
|
|
400
|
+
|
|
401
|
+
```bash
|
|
402
|
+
# Generate baseline snapshots
|
|
403
|
+
agentboot test --type snapshot --update
|
|
404
|
+
|
|
405
|
+
# Compare current output against baseline
|
|
406
|
+
agentboot test --type snapshot
|
|
407
|
+
```
|
|
408
|
+
|
|
409
|
+
The snapshot test:
|
|
410
|
+
1. Runs each persona against a fixed set of test inputs
|
|
411
|
+
2. Saves the structured output (findings, severities, count) as a snapshot
|
|
412
|
+
3. On subsequent runs, compares current output against the snapshot
|
|
413
|
+
4. Flags differences for human review
|
|
414
|
+
|
|
415
|
+
```
|
|
416
|
+
$ agentboot test --type snapshot
|
|
417
|
+
|
|
418
|
+
Snapshot Comparison: code-reviewer
|
|
419
|
+
──────────────────────────────────
|
|
420
|
+
|
|
421
|
+
Test: sql-injection-detection
|
|
422
|
+
Baseline: 1 CRITICAL (SQL injection)
|
|
423
|
+
Current: 1 CRITICAL (SQL injection) + 1 WARN (missing type annotation)
|
|
424
|
+
Status: CHANGED — new finding added
|
|
425
|
+
→ Is the new WARN correct? [y = update snapshot / n = investigate]
|
|
426
|
+
|
|
427
|
+
Test: clean-code-no-findings
|
|
428
|
+
Baseline: 0 findings
|
|
429
|
+
Current: 0 findings
|
|
430
|
+
Status: MATCH ✓
|
|
431
|
+
|
|
432
|
+
Test: auth-middleware-review
|
|
433
|
+
Baseline: 1 ERROR (missing auth check) + 2 WARN
|
|
434
|
+
Current: 0 findings
|
|
435
|
+
Status: REGRESSION ⚠️ — previously caught ERROR now missed
|
|
436
|
+
→ Investigate: trait change? prompt change? model change?
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
### When to Run
|
|
440
|
+
|
|
441
|
+
- **After any persona prompt change** — did the edit improve or regress behavior?
|
|
442
|
+
- **After trait updates** — did changing `critical-thinking` affect review quality?
|
|
443
|
+
- **After model changes** — does the persona work as well on Sonnet as it did on Opus?
|
|
444
|
+
- **Periodically (weekly)** — catch drift from model updates by the provider
|
|
445
|
+
|
|
446
|
+
### What Snapshots Contain
|
|
447
|
+
|
|
448
|
+
Snapshots store structured summaries, not full output:
|
|
449
|
+
|
|
450
|
+
```json
|
|
451
|
+
{
|
|
452
|
+
"persona": "code-reviewer",
|
|
453
|
+
"test_case": "sql-injection-detection",
|
|
454
|
+
"snapshot_date": "2026-03-19",
|
|
455
|
+
"model": "haiku",
|
|
456
|
+
"findings_count": { "CRITICAL": 1, "ERROR": 0, "WARN": 0, "INFO": 0 },
|
|
457
|
+
"finding_patterns": ["SQL injection", "parameterized"],
|
|
458
|
+
"total_tokens": 1200,
|
|
459
|
+
"duration_ms": 8500
|
|
460
|
+
}
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
Not the full prose output — just the structural signature. This makes comparison
|
|
464
|
+
reliable across non-deterministic runs.
|
|
465
|
+
|
|
466
|
+
---
|
|
467
|
+
|
|
468
|
+
## Layer 5: LLM-as-Judge ($$, Major Changes Only)
|
|
469
|
+
|
|
470
|
+
For qualitative evaluation that can't be reduced to pattern matching: "Is this review
|
|
471
|
+
actually good? Is it thorough? Would a senior engineer agree with it?"
|
|
472
|
+
|
|
473
|
+
### How It Works
|
|
474
|
+
|
|
475
|
+
A separate LLM call evaluates the persona's output:
|
|
476
|
+
|
|
477
|
+
```yaml
|
|
478
|
+
# tests/eval/code-reviewer-quality.eval.yaml
|
|
479
|
+
|
|
480
|
+
persona_under_test: code-reviewer
|
|
481
|
+
judge_model: opus # Use strongest model as judge
|
|
482
|
+
max_budget_usd: 2.00
|
|
483
|
+
|
|
484
|
+
cases:
|
|
485
|
+
- name: review-quality-auth-endpoint
|
|
486
|
+
input_file: tests/fixtures/auth-endpoint-with-bugs.ts
|
|
487
|
+
persona_prompt: "Review this file"
|
|
488
|
+
judge_prompt: |
|
|
489
|
+
You are a senior staff engineer evaluating the quality of an AI code review.
|
|
490
|
+
|
|
491
|
+
The code being reviewed:
|
|
492
|
+
{input}
|
|
493
|
+
|
|
494
|
+
The review produced:
|
|
495
|
+
{persona_output}
|
|
496
|
+
|
|
497
|
+
Evaluate on these dimensions (1-5 scale):
|
|
498
|
+
1. Completeness: Did it find the important issues?
|
|
499
|
+
2. Accuracy: Are the findings correct? Any false positives?
|
|
500
|
+
3. Specificity: Are suggestions actionable with file:line references?
|
|
501
|
+
4. Prioritization: Are severity levels appropriate?
|
|
502
|
+
5. Tone: Professional, constructive, not pedantic?
|
|
503
|
+
|
|
504
|
+
Known issues in the code (ground truth):
|
|
505
|
+
- SQL injection on line 12
|
|
506
|
+
- Missing rate limiting on POST endpoint
|
|
507
|
+
- Auth token not validated for expiry
|
|
508
|
+
|
|
509
|
+
Score each dimension 1-5. Explain your reasoning. Then give an overall
|
|
510
|
+
pass/fail: does this review meet the bar for a senior engineer's review?
|
|
511
|
+
expect:
|
|
512
|
+
judge_score_min:
|
|
513
|
+
completeness: 3
|
|
514
|
+
accuracy: 4
|
|
515
|
+
overall: "pass"
|
|
516
|
+
```
|
|
517
|
+
|
|
518
|
+
### When to Use
|
|
519
|
+
|
|
520
|
+
- **Major persona prompt rewrites** — did the rewrite improve quality?
|
|
521
|
+
- **New personas** — does the new persona meet the bar before shipping?
|
|
522
|
+
- **Model migration** — switching from Opus to Sonnet — does quality hold?
|
|
523
|
+
- **Quarterly quality audits** — periodic check on the full persona suite
|
|
524
|
+
|
|
525
|
+
### Cost Control
|
|
526
|
+
|
|
527
|
+
LLM-as-judge is expensive (Opus as judge + persona invocation). Budget it:
|
|
528
|
+
|
|
529
|
+
```bash
|
|
530
|
+
agentboot test --type eval --max-budget 20.00
|
|
531
|
+
|
|
532
|
+
# Only run for specific personas
|
|
533
|
+
agentboot test --type eval --persona security-reviewer
|
|
534
|
+
|
|
535
|
+
# Skip if behavioral tests already passed (cascade)
|
|
536
|
+
agentboot test --type eval --skip-if-behavioral-passed
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
---
|
|
540
|
+
|
|
541
|
+
## Layer 6: Human Review (Manual, Major Releases Only)
|
|
542
|
+
|
|
543
|
+
The human is always in the loop for judgment calls that no automated test can make.
|
|
544
|
+
|
|
545
|
+
### When Humans Review
|
|
546
|
+
|
|
547
|
+
| Trigger | What they review | Who |
|
|
548
|
+
|---|---|---|
|
|
549
|
+
| New persona ships | Full output on 3-5 real PRs | Platform team + domain expert |
|
|
550
|
+
| Major trait change | Before/after comparison on real code | Platform team |
|
|
551
|
+
| Quarterly audit | Random sample of 20 persona outputs | Platform team |
|
|
552
|
+
| Quality escalation | Specific finding that a developer disputed | Persona author |
|
|
553
|
+
|
|
554
|
+
### How to Make It Efficient
|
|
555
|
+
|
|
556
|
+
**The review tool:** `agentboot review` generates a side-by-side comparison:
|
|
557
|
+
|
|
558
|
+
```bash
|
|
559
|
+
agentboot review --persona code-reviewer --sample 5
|
|
560
|
+
|
|
561
|
+
# Human Review: code-reviewer (v1.3.0)
|
|
562
|
+
# ──────────────────────────────────────
|
|
563
|
+
#
|
|
564
|
+
# Reviewing 5 randomly sampled outputs from the last 7 days.
|
|
565
|
+
#
|
|
566
|
+
# Sample 1/5: PR #234 (api-service)
|
|
567
|
+
# ├── Findings: 1 ERROR, 3 WARN, 2 INFO
|
|
568
|
+
# ├── [Show findings]
|
|
569
|
+
# ├── [Show code context]
|
|
570
|
+
# │
|
|
571
|
+
# ├── Was this review accurate? [Yes] [Partially] [No]
|
|
572
|
+
# ├── Were severity levels correct? [Yes] [Partially] [No]
|
|
573
|
+
# ├── Would you add anything? [No] [Yes: ___]
|
|
574
|
+
# └── Would you remove anything? [No] [Yes: ___]
|
|
575
|
+
#
|
|
576
|
+
# After all 5 samples:
|
|
577
|
+
# ├── Overall quality score: ___/5
|
|
578
|
+
# ├── Recommendation: [Ship as-is] [Needs tuning] [Needs rewrite]
|
|
579
|
+
# └── Notes: ___
|
|
580
|
+
```
|
|
581
|
+
|
|
582
|
+
This takes 10-15 minutes per persona. Not zero effort, but structured and focused.
|
|
583
|
+
The reviewer isn't reading through raw sessions — they're evaluating curated samples
|
|
584
|
+
with guided questions.
|
|
585
|
+
|
|
586
|
+
**The cadence:** Platform team spends 1 hour/month reviewing persona quality.
|
|
587
|
+
That's 4 personas × 15 minutes. The structured review tool makes this sustainable.
|
|
588
|
+
|
|
589
|
+
---
|
|
590
|
+
|
|
591
|
+
## Test Infrastructure
|
|
592
|
+
|
|
593
|
+
### CI Pipeline
|
|
594
|
+
|
|
595
|
+
```yaml
|
|
596
|
+
# .github/workflows/agentboot-tests.yml
|
|
597
|
+
name: AgentBoot Tests
|
|
598
|
+
on:
|
|
599
|
+
push:
|
|
600
|
+
branches: [main]
|
|
601
|
+
pull_request:
|
|
602
|
+
|
|
603
|
+
jobs:
|
|
604
|
+
unit-and-schema:
|
|
605
|
+
runs-on: ubuntu-latest
|
|
606
|
+
steps:
|
|
607
|
+
- uses: actions/checkout@v4
|
|
608
|
+
- uses: actions/setup-node@v4
|
|
609
|
+
- run: npm ci
|
|
610
|
+
- run: agentboot validate --strict
|
|
611
|
+
- run: agentboot lint --severity error
|
|
612
|
+
- run: npm run test # vitest unit + integration
|
|
613
|
+
|
|
614
|
+
behavioral:
|
|
615
|
+
if: github.event_name == 'pull_request'
|
|
616
|
+
needs: unit-and-schema
|
|
617
|
+
runs-on: ubuntu-latest
|
|
618
|
+
steps:
|
|
619
|
+
- uses: actions/checkout@v4
|
|
620
|
+
- uses: actions/setup-node@v4
|
|
621
|
+
- run: npm ci
|
|
622
|
+
- run: agentboot test --type behavioral --ci --max-budget 5.00
|
|
623
|
+
env:
|
|
624
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
625
|
+
|
|
626
|
+
snapshot:
|
|
627
|
+
if: contains(github.event.pull_request.labels.*.name, 'persona-change')
|
|
628
|
+
needs: behavioral
|
|
629
|
+
runs-on: ubuntu-latest
|
|
630
|
+
steps:
|
|
631
|
+
- uses: actions/checkout@v4
|
|
632
|
+
- uses: actions/setup-node@v4
|
|
633
|
+
- run: npm ci
|
|
634
|
+
- run: agentboot test --type snapshot --ci
|
|
635
|
+
env:
|
|
636
|
+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
637
|
+
```
|
|
638
|
+
|
|
639
|
+
### Test Triggers
|
|
640
|
+
|
|
641
|
+
| Layer | Trigger | Cost | Time |
|
|
642
|
+
|-------|---------|------|------|
|
|
643
|
+
| Unit / Schema | Every commit | Free | <10s |
|
|
644
|
+
| Integration | Every commit | Free | <30s |
|
|
645
|
+
| Behavioral | Every PR | ~$5 | ~2min |
|
|
646
|
+
| Snapshot | PRs labeled `persona-change` | ~$5 | ~2min |
|
|
647
|
+
| LLM-as-judge | Major changes (manual trigger) | ~$20 | ~5min |
|
|
648
|
+
| Human review | Monthly / major release | Staff time | ~1hr |
|
|
649
|
+
|
|
650
|
+
### Cost Budget
|
|
651
|
+
|
|
652
|
+
Monthly testing cost for a personas repo with 4 personas:
|
|
653
|
+
- Unit/integration: **$0** (no API calls)
|
|
654
|
+
- Behavioral: ~20 PRs/month × $5 = **$100/month**
|
|
655
|
+
- Snapshot: ~5 persona changes/month × $5 = **$25/month**
|
|
656
|
+
- LLM-as-judge: ~2 major changes/month × $20 = **$40/month**
|
|
657
|
+
- **Total: ~$165/month** for automated testing
|
|
658
|
+
|
|
659
|
+
That's less than one developer-hour of manual review — and it runs on every PR.
|
|
660
|
+
|
|
661
|
+
---
|
|
662
|
+
|
|
663
|
+
## Testing the Tests
|
|
664
|
+
|
|
665
|
+
### How Do You Know Your Behavioral Tests Are Good?
|
|
666
|
+
|
|
667
|
+
**Mutation testing for personas.** Deliberately introduce known bugs into the
|
|
668
|
+
persona prompt and verify that tests catch the regression:
|
|
669
|
+
|
|
670
|
+
```bash
|
|
671
|
+
agentboot test --type mutation --persona code-reviewer
|
|
672
|
+
|
|
673
|
+
# Mutation Testing: code-reviewer
|
|
674
|
+
# ────────────────────────────────
|
|
675
|
+
#
|
|
676
|
+
# Mutation 1: Remove "SQL injection" from review checklist
|
|
677
|
+
# Expected: catches-sql-injection test FAILS
|
|
678
|
+
# Actual: catches-sql-injection test FAILED ✓ (mutation caught)
|
|
679
|
+
#
|
|
680
|
+
# Mutation 2: Change severity threshold (ERROR → INFO)
|
|
681
|
+
# Expected: severity_includes assertion FAILS
|
|
682
|
+
# Actual: severity_includes assertion FAILED ✓ (mutation caught)
|
|
683
|
+
#
|
|
684
|
+
# Mutation 3: Remove output format specification
|
|
685
|
+
# Expected: structured-output-format test FAILS
|
|
686
|
+
# Actual: structured-output-format test PASSED ✗ (mutation NOT caught)
|
|
687
|
+
# → Your test doesn't verify output structure strictly enough
|
|
688
|
+
#
|
|
689
|
+
# Mutation score: 2/3 (67%)
|
|
690
|
+
# → Consider adding stricter output structure assertions
|
|
691
|
+
```
|
|
692
|
+
|
|
693
|
+
This is the "who tests the tests?" answer: mutations verify that tests actually
|
|
694
|
+
detect the regressions they're supposed to detect.
|
|
695
|
+
|
|
696
|
+
---
|
|
697
|
+
|
|
698
|
+
## Agents Testing Agents
|
|
699
|
+
|
|
700
|
+
### The Philosophy
|
|
701
|
+
|
|
702
|
+
AgentBoot personas are AI agents. Testing them with AI (behavioral tests, LLM-as-judge)
|
|
703
|
+
is "agents testing agents." This is the right approach because:
|
|
704
|
+
|
|
705
|
+
1. **The output space is too large for handwritten assertions.** A code review can
|
|
706
|
+
produce thousands of different valid outputs. Pattern matching covers the obvious
|
|
707
|
+
cases; LLM-as-judge evaluates the nuanced ones.
|
|
708
|
+
|
|
709
|
+
2. **The evaluation criteria are subjective.** "Is this review thorough?" requires
|
|
710
|
+
judgment. LLM-as-judge applies consistent judgment criteria at scale.
|
|
711
|
+
|
|
712
|
+
3. **The cost is proportional to the value.** Testing a persona costs ~$0.50-$2.00.
|
|
713
|
+
A bad persona wasting $100/day in developer time and false positives costs far more.
|
|
714
|
+
|
|
715
|
+
### The Safeguard: Humans Always in the Loop
|
|
716
|
+
|
|
717
|
+
AI-generated test results are **advisory, not authoritative.** The pipeline:
|
|
718
|
+
|
|
719
|
+
```
|
|
720
|
+
Automated tests run → Results posted to PR → Human reviews before merge
|
|
721
|
+
```
|
|
722
|
+
|
|
723
|
+
If behavioral tests pass and snapshot is stable, the human review is fast ("looks
|
|
724
|
+
good, ship it"). If something fails, the human investigates. The automation removes
|
|
725
|
+
burden, not judgment.
|
|
726
|
+
|
|
727
|
+
**What humans decide that automation cannot:**
|
|
728
|
+
- Is this new finding a genuine improvement or a new false positive?
|
|
729
|
+
- Does this persona's tone match the org's culture?
|
|
730
|
+
- Is this severity calibration appropriate for our risk tolerance?
|
|
731
|
+
- Should we ship this persona change even though a snapshot changed? (sometimes yes)
|
|
732
|
+
|
|
733
|
+
The test suite produces evidence. Humans make decisions. This is the "humans always
|
|
734
|
+
in the loop" principle applied to testing.
|
|
735
|
+
|
|
736
|
+
---
|
|
737
|
+
|
|
738
|
+
## What Acme Tests (Org's Responsibility)
|
|
739
|
+
|
|
740
|
+
When Acme's platform team creates their personas repo from AgentBoot, they inherit
|
|
741
|
+
the testing tools but run them on their own content with their own CI and API keys.
|
|
742
|
+
|
|
743
|
+
### Acme's Test Layers
|
|
744
|
+
|
|
745
|
+
| Layer | What Acme tests | Tool | Cost to Acme |
|
|
746
|
+
|-------|----------------|------|-------------|
|
|
747
|
+
| Schema/Lint | Their agentboot.config.json, custom persona frontmatter, custom traits | `agentboot validate`, `agentboot lint` | Free |
|
|
748
|
+
| Build | Their personas compile without errors, sync produces expected output | `agentboot build --validate-only` | Free |
|
|
749
|
+
| Behavioral | Their custom personas find the bugs they should find | `agentboot test --type behavioral` | ~$5/PR (Acme's API key) |
|
|
750
|
+
| Snapshot | Their persona changes don't regress | `agentboot test --type snapshot` | ~$5/change (Acme's API key) |
|
|
751
|
+
| Human review | Their personas produce quality output | `agentboot review` | Staff time |
|
|
752
|
+
|
|
753
|
+
### What Acme's CI Looks Like
|
|
754
|
+
|
|
755
|
+
```yaml
|
|
756
|
+
# In acme-corp/acme-personas/.github/workflows/tests.yml
|
|
757
|
+
name: Acme Persona Tests
|
|
758
|
+
on: [push, pull_request]
|
|
759
|
+
|
|
760
|
+
jobs:
|
|
761
|
+
validate:
|
|
762
|
+
runs-on: ubuntu-latest
|
|
763
|
+
steps:
|
|
764
|
+
- uses: actions/checkout@v4
|
|
765
|
+
- run: npm ci
|
|
766
|
+
- run: agentboot validate --strict
|
|
767
|
+
- run: agentboot lint --severity error
|
|
768
|
+
|
|
769
|
+
behavioral:
|
|
770
|
+
if: github.event_name == 'pull_request'
|
|
771
|
+
needs: validate
|
|
772
|
+
runs-on: ubuntu-latest
|
|
773
|
+
steps:
|
|
774
|
+
- uses: actions/checkout@v4
|
|
775
|
+
- run: npm ci
|
|
776
|
+
- run: agentboot test --type behavioral --ci --max-budget 10.00
|
|
777
|
+
env:
|
|
778
|
+
ANTHROPIC_API_KEY: ${{ secrets.ACME_ANTHROPIC_KEY }} # Acme's key, Acme's cost
|
|
779
|
+
```
|
|
780
|
+
|
|
781
|
+
AgentBoot provides the workflow template. Acme fills in their API key and adjusts
|
|
782
|
+
the budget. The testing tools are the same; the content and cost are separate.
|
|
783
|
+
|
|
784
|
+
---
|
|
785
|
+
|
|
786
|
+
## "Is This My Bug or AgentBoot's Bug?"
|
|
787
|
+
|
|
788
|
+
When something goes wrong, Acme needs to know: is the problem in their persona
|
|
789
|
+
content (their fix) or in AgentBoot's build system (our fix)?
|
|
790
|
+
|
|
791
|
+
### The Diagnostic: `agentboot doctor --diagnose`
|
|
792
|
+
|
|
793
|
+
```bash
|
|
794
|
+
$ agentboot doctor --diagnose
|
|
795
|
+
|
|
796
|
+
Diagnosing: code-reviewer persona producing empty output
|
|
797
|
+
─────────────────────────────────────────────────────────
|
|
798
|
+
|
|
799
|
+
Step 1: Core validation
|
|
800
|
+
✓ AgentBoot core version 1.2.0 (latest)
|
|
801
|
+
✓ Core traits compile without errors
|
|
802
|
+
✓ Core code-reviewer persona compiles without errors
|
|
803
|
+
✓ Core code-reviewer passes behavioral tests (3/3)
|
|
804
|
+
→ Core is healthy. If the problem is in the core code-reviewer,
|
|
805
|
+
it's not reproducing with the default config.
|
|
806
|
+
|
|
807
|
+
Step 2: Org layer validation
|
|
808
|
+
✓ agentboot.config.json valid
|
|
809
|
+
✓ All custom traits compile
|
|
810
|
+
✗ Custom extension for code-reviewer has an error:
|
|
811
|
+
extensions/code-reviewer.md references trait "acme-standards"
|
|
812
|
+
which doesn't exist in core/traits/ or Acme's custom traits.
|
|
813
|
+
→ LIKELY CAUSE: missing trait reference in Acme's extension
|
|
814
|
+
|
|
815
|
+
Step 3: Compiled output check
|
|
816
|
+
✗ Compiled code-reviewer SKILL.md is 0 bytes
|
|
817
|
+
→ Build failed silently due to the missing trait reference
|
|
818
|
+
|
|
819
|
+
═══════════════════════════════════════════════════════
|
|
820
|
+
|
|
821
|
+
Diagnosis: ACME CONTENT ISSUE
|
|
822
|
+
The missing trait reference in extensions/code-reviewer.md causes
|
|
823
|
+
the build to produce empty output.
|
|
824
|
+
|
|
825
|
+
Fix: Either create core/traits/acme-standards.md or remove the
|
|
826
|
+
reference from extensions/code-reviewer.md
|
|
827
|
+
|
|
828
|
+
If you believe this is an AgentBoot bug (the build should NOT produce
|
|
829
|
+
empty output on a missing trait — it should error), file an issue:
|
|
830
|
+
→ agentboot issue "Build produces empty output instead of error on missing trait"
|
|
831
|
+
```
|
|
832
|
+
|
|
833
|
+
### The Isolation Test
|
|
834
|
+
|
|
835
|
+
The doctor runs a **layered isolation test** to pinpoint where the problem is:
|
|
836
|
+
|
|
837
|
+
```
|
|
838
|
+
Layer 1: AgentBoot core only (no org content)
|
|
839
|
+
→ Does the core persona work with zero customization?
|
|
840
|
+
→ If NO: AgentBoot bug. File an issue.
|
|
841
|
+
→ If YES: continue.
|
|
842
|
+
|
|
843
|
+
Layer 2: Core + org config (no custom personas/traits)
|
|
844
|
+
→ Does the core persona work with Acme's agentboot.config.json?
|
|
845
|
+
→ If NO: Config issue. Check config.
|
|
846
|
+
→ If YES: continue.
|
|
847
|
+
|
|
848
|
+
Layer 3: Core + org config + org traits
|
|
849
|
+
→ Do Acme's custom traits compose without errors?
|
|
850
|
+
→ If NO: Trait issue. Check Acme's traits.
|
|
851
|
+
→ If YES: continue.
|
|
852
|
+
|
|
853
|
+
Layer 4: Core + org config + org traits + org personas
|
|
854
|
+
→ Do Acme's custom personas compile and lint?
|
|
855
|
+
→ If NO: Persona issue. Check Acme's persona definitions.
|
|
856
|
+
→ If YES: continue.
|
|
857
|
+
|
|
858
|
+
Layer 5: Core + org config + org traits + org personas + org extensions
|
|
859
|
+
→ Does the full stack work?
|
|
860
|
+
→ If NO: Extension issue. Check Acme's extensions.
|
|
861
|
+
→ If YES: problem is elsewhere (model, API, environment).
|
|
862
|
+
```
|
|
863
|
+
|
|
864
|
+
Each layer adds one piece. The layer where it breaks is the layer that has the bug.
|
|
865
|
+
If Layer 1 breaks, it's AgentBoot's problem. If Layer 3 breaks, it's Acme's traits.
|
|
866
|
+
|
|
867
|
+
### `agentboot issue` — Streamlined Bug Reporting
|
|
868
|
+
|
|
869
|
+
When the diagnosis points to an AgentBoot bug, one command files it:
|
|
870
|
+
|
|
871
|
+
```bash
|
|
872
|
+
agentboot issue "Build produces empty output instead of error on missing trait"
|
|
873
|
+
|
|
874
|
+
# Filing issue against agentboot/agentboot
|
|
875
|
+
#
|
|
876
|
+
# Title: Build produces empty output instead of error on missing trait
|
|
877
|
+
#
|
|
878
|
+
# Auto-attached:
|
|
879
|
+
# ├── AgentBoot version: 1.2.0
|
|
880
|
+
# ├── Node version: 22.1.0
|
|
881
|
+
# ├── OS: macOS 15.3
|
|
882
|
+
# ├── Diagnosis output: (attached)
|
|
883
|
+
# ├── agentboot.config.json: (attached, org-specific values redacted)
|
|
884
|
+
# ├── Relevant error logs: (attached)
|
|
885
|
+
# │
|
|
886
|
+
# ├── NOT attached (privacy):
|
|
887
|
+
# │ ├── Org persona content
|
|
888
|
+
# │ ├── Custom trait content
|
|
889
|
+
# │ ├── Developer prompts
|
|
890
|
+
# │ └── Session transcripts
|
|
891
|
+
#
|
|
892
|
+
# Open issue in browser? [Y/n]
|
|
893
|
+
```
|
|
894
|
+
|
|
895
|
+
The issue command:
|
|
896
|
+
- Attaches environment info and diagnosis output
|
|
897
|
+
- Redacts org-specific content (persona text, trait content, internal URLs)
|
|
898
|
+
- Includes the config structure (field names and types, not values)
|
|
899
|
+
- Never includes developer prompts or session data
|
|
900
|
+
- Opens in browser for the user to review before submitting
|
|
901
|
+
|
|
902
|
+
### When It's Ambiguous
|
|
903
|
+
|
|
904
|
+
Sometimes the bug is in the boundary — AgentBoot's build system should have caught
|
|
905
|
+
an error in Acme's content but didn't. Example: Acme writes a persona with a
|
|
906
|
+
circular trait reference. The build system should error; instead it loops forever.
|
|
907
|
+
|
|
908
|
+
This is an **AgentBoot bug** (the build system should validate and reject) even
|
|
909
|
+
though the root cause is in Acme's content (the circular reference). The fix goes
|
|
910
|
+
into AgentBoot core (add circular reference detection to validate.ts), and Acme
|
|
911
|
+
fixes their content.
|
|
912
|
+
|
|
913
|
+
The diagnostic output makes this clear:
|
|
914
|
+
|
|
915
|
+
```
|
|
916
|
+
Diagnosis: AGENTBOOT BUG (validation gap)
|
|
917
|
+
Acme's content has a circular trait reference (A → B → A).
|
|
918
|
+
AgentBoot's validator should catch this but doesn't.
|
|
919
|
+
|
|
920
|
+
Workaround: Remove the circular reference in Acme's trait.
|
|
921
|
+
Fix: AgentBoot should add circular reference detection.
|
|
922
|
+
→ agentboot issue "Validator doesn't catch circular trait references"
|
|
923
|
+
```
|
|
924
|
+
|
|
925
|
+
### The General Rule
|
|
926
|
+
|
|
927
|
+
| Symptom | Likely Owner |
|
|
928
|
+
|---------|-------------|
|
|
929
|
+
| Build system crashes | AgentBoot |
|
|
930
|
+
| Build produces wrong file structure | AgentBoot |
|
|
931
|
+
| Validator doesn't catch invalid content | AgentBoot |
|
|
932
|
+
| Lint rule has false positives/negatives | AgentBoot |
|
|
933
|
+
| CLI command doesn't work | AgentBoot |
|
|
934
|
+
| Core persona produces bad output | AgentBoot (prompt quality) or Anthropic (model regression) |
|
|
935
|
+
| Custom persona produces bad output | Org's persona content |
|
|
936
|
+
| Custom trait doesn't compose correctly | Org's trait (unless build system is wrong) |
|
|
937
|
+
| Custom extension is ignored | Org's extension path/format (unless sync is broken) |
|
|
938
|
+
| Sync writes wrong files | AgentBoot |
|
|
939
|
+
| Sync writes right files but persona behaves wrong | Org's persona content |
|
|
940
|
+
| Gotcha doesn't activate on matching files | Check `paths:` patterns (org) then check rule loading (AgentBoot) |
|
|
941
|
+
| Hook doesn't fire | Check hook config (org) then check hook system (AgentBoot) |
|
|
942
|
+
| Plugin doesn't install | Check plugin structure (org) then check export (AgentBoot) |
|
|
943
|
+
|
|
944
|
+
---
|
|
945
|
+
|
|
946
|
+
## What AgentBoot Needs to Build
|
|
947
|
+
|
|
948
|
+
| Component | Phase | Cost |
|
|
949
|
+
|-----------|-------|------|
|
|
950
|
+
| Unit tests (config, frontmatter, composition, lint) | V1 | Free |
|
|
951
|
+
| Integration tests (build pipeline, sync, plugin export) | V1 | Free |
|
|
952
|
+
| Test fixtures (valid/invalid personas, known-buggy code) | V1 | Free |
|
|
953
|
+
| `agentboot test --type deterministic` runner | V1 | Free |
|
|
954
|
+
| CI workflow template | V1 | Free |
|
|
955
|
+
| Behavioral test format (YAML) + runner | V1.5 | ~$5/run |
|
|
956
|
+
| `agentboot test --type behavioral` with `claude -p` | V1.5 | ~$5/run |
|
|
957
|
+
| Snapshot test format + runner | V1.5 | ~$5/run |
|
|
958
|
+
| Flake tolerance (2-of-3 runs) | V1.5 | 3x cost |
|
|
959
|
+
| LLM-as-judge eval format + runner | V2 | ~$20/run |
|
|
960
|
+
| `agentboot review` (human review tool) | V2 | Staff time |
|
|
961
|
+
| Mutation testing for personas | V2+ | ~$15/run |
|
|
962
|
+
| GitHub Actions reusable workflow for tests | V1.5 | Free |
|
|
963
|
+
| `agentboot doctor --diagnose` (layered isolation) | V1 | Free |
|
|
964
|
+
| `agentboot issue` (streamlined bug reporting) | V1.5 | Free |
|
|
965
|
+
| Org CI workflow template (acme runs on their content) | V1 | Free |
|
|
966
|
+
|
|
967
|
+
---
|
|
968
|
+
|
|
969
|
+
*See also:*
|
|
970
|
+
- [`docs/prompt-optimization.md`](prompt-optimization.md#6-prompt-testing-agentboot-test) — test types and YAML format
|
|
971
|
+
- [`docs/ci-cd-automation.md`](ci-cd-automation.md) — `claude -p` flags for CI
|
|
972
|
+
- [`docs/claude-code-reference/feature-inventory.md`](claude-code-reference/feature-inventory.md) — CLI flags
|