agileflow 2.61.0 → 2.63.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -9
- package/package.json +1 -1
- package/scripts/lib/counter.js +103 -0
- package/src/core/commands/auto.md +1 -0
- package/src/core/commands/babysit.md +170 -29
- package/src/core/commands/board.md +1 -0
- package/src/core/commands/ci.md +1 -0
- package/src/core/commands/compress.md +1 -0
- package/src/core/commands/deploy.md +1 -0
- package/src/core/commands/help.md +1 -0
- package/src/core/commands/research.md +1 -0
- package/src/core/commands/skill/create.md +566 -0
- package/src/core/commands/skill/delete.md +189 -0
- package/src/core/commands/skill/edit.md +245 -0
- package/src/core/commands/skill/list.md +155 -0
- package/src/core/commands/skill/test.md +249 -0
- package/src/core/commands/template.md +1 -0
- package/src/core/commands/tests.md +1 -0
- package/src/core/commands/update.md +1 -0
- package/src/core/commands/velocity.md +1 -0
- package/src/core/experts/refactor/expertise.yaml +17 -12
- package/src/core/templates/claude-settings.advanced.example.json +1 -1
- package/src/core/templates/claude-settings.example.json +1 -1
- package/tools/cli/commands/list.js +8 -13
- package/tools/cli/commands/uninstall.js +70 -0
- package/tools/cli/commands/update.js +21 -4
- package/tools/cli/installers/core/installer.js +20 -19
- package/tools/cli/installers/ide/_base-ide.js +18 -4
- package/tools/cli/installers/ide/claude-code.js +4 -15
- package/tools/cli/installers/ide/codex.js +9 -13
- package/tools/cli/lib/content-injector.js +162 -31
- package/tools/cli/lib/utils.js +87 -0
- package/src/core/skills/acceptance-criteria-generator/SKILL.md +0 -46
- package/src/core/skills/adr-template/SKILL.md +0 -62
- package/src/core/skills/agileflow-acceptance-criteria/SKILL.md +0 -156
- package/src/core/skills/agileflow-adr/SKILL.md +0 -147
- package/src/core/skills/agileflow-adr/examples/database-choice-example.md +0 -122
- package/src/core/skills/agileflow-adr/templates/adr-template.md +0 -69
- package/src/core/skills/agileflow-commit-messages/SKILL.md +0 -130
- package/src/core/skills/agileflow-commit-messages/reference/bad-examples.md +0 -168
- package/src/core/skills/agileflow-commit-messages/reference/good-examples.md +0 -120
- package/src/core/skills/agileflow-commit-messages/scripts/check-attribution.sh +0 -15
- package/src/core/skills/agileflow-epic-planner/SKILL.md +0 -184
- package/src/core/skills/agileflow-retro-facilitator/SKILL.md +0 -119
- package/src/core/skills/agileflow-retro-facilitator/cookbook/4ls.md +0 -86
- package/src/core/skills/agileflow-retro-facilitator/cookbook/glad-sad-mad.md +0 -79
- package/src/core/skills/agileflow-retro-facilitator/cookbook/start-stop-continue.md +0 -142
- package/src/core/skills/agileflow-retro-facilitator/prompts/action-items.md +0 -83
- package/src/core/skills/agileflow-sprint-planner/SKILL.md +0 -212
- package/src/core/skills/agileflow-story-writer/SKILL.md +0 -163
- package/src/core/skills/agileflow-story-writer/examples/good-story-example.md +0 -63
- package/src/core/skills/agileflow-story-writer/templates/story-template.md +0 -44
- package/src/core/skills/agileflow-tech-debt/SKILL.md +0 -215
- package/src/core/skills/api-documentation-generator/SKILL.md +0 -65
- package/src/core/skills/changelog-entry/SKILL.md +0 -55
- package/src/core/skills/commit-message-formatter/SKILL.md +0 -50
- package/src/core/skills/deployment-guide-generator/SKILL.md +0 -84
- package/src/core/skills/diagram-generator/SKILL.md +0 -65
- package/src/core/skills/error-handler-template/SKILL.md +0 -78
- package/src/core/skills/migration-checklist/SKILL.md +0 -82
- package/src/core/skills/pr-description/SKILL.md +0 -65
- package/src/core/skills/sql-schema-generator/SKILL.md +0 -69
- package/src/core/skills/story-skeleton/SKILL.md +0 -34
- package/src/core/skills/test-case-generator/SKILL.md +0 -63
- package/src/core/skills/type-definitions/SKILL.md +0 -65
- package/src/core/skills/validation-schema-generator/SKILL.md +0 -64
- package/src/core/skills/writing-skills/SKILL.md +0 -352
- package/src/core/skills/writing-skills/testing-skills-with-subagents.md +0 -232
|
@@ -1,352 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: writing-skills
|
|
3
|
-
description: Use when creating new skills, editing existing skills, or verifying skills work before deployment
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# Writing Skills
|
|
7
|
-
|
|
8
|
-
## Overview
|
|
9
|
-
|
|
10
|
-
**Writing skills IS Test-Driven Development applied to process documentation.**
|
|
11
|
-
|
|
12
|
-
You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
|
|
13
|
-
|
|
14
|
-
**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
|
|
15
|
-
|
|
16
|
-
## What is a Skill?
|
|
17
|
-
|
|
18
|
-
A **skill** is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.
|
|
19
|
-
|
|
20
|
-
**Skills are:** Reusable techniques, patterns, tools, reference guides
|
|
21
|
-
|
|
22
|
-
**Skills are NOT:** Narratives about how you solved a problem once
|
|
23
|
-
|
|
24
|
-
## TDD Mapping for Skills
|
|
25
|
-
|
|
26
|
-
| TDD Concept | Skill Creation |
|
|
27
|
-
|-------------|----------------|
|
|
28
|
-
| **Test case** | Pressure scenario with subagent |
|
|
29
|
-
| **Production code** | Skill document (SKILL.md) |
|
|
30
|
-
| **Test fails (RED)** | Agent violates rule without skill (baseline) |
|
|
31
|
-
| **Test passes (GREEN)** | Agent complies with skill present |
|
|
32
|
-
| **Refactor** | Close loopholes while maintaining compliance |
|
|
33
|
-
| **Write test first** | Run baseline scenario BEFORE writing skill |
|
|
34
|
-
| **Watch it fail** | Document exact rationalizations agent uses |
|
|
35
|
-
| **Minimal code** | Write skill addressing those specific violations |
|
|
36
|
-
| **Watch it pass** | Verify agent now complies |
|
|
37
|
-
| **Refactor cycle** | Find new rationalizations → plug → re-verify |
|
|
38
|
-
|
|
39
|
-
## When to Create a Skill
|
|
40
|
-
|
|
41
|
-
**Create when:**
|
|
42
|
-
- Technique wasn't intuitively obvious to you
|
|
43
|
-
- You'd reference this again across projects
|
|
44
|
-
- Pattern applies broadly (not project-specific)
|
|
45
|
-
- Others would benefit
|
|
46
|
-
|
|
47
|
-
**Don't create for:**
|
|
48
|
-
- One-off solutions
|
|
49
|
-
- Standard practices well-documented elsewhere
|
|
50
|
-
- Project-specific conventions (put in CLAUDE.md)
|
|
51
|
-
- Mechanical constraints (if enforceable with validation, automate it)
|
|
52
|
-
|
|
53
|
-
## Skill Types
|
|
54
|
-
|
|
55
|
-
### Technique
|
|
56
|
-
Concrete method with steps to follow (condition-based-waiting, root-cause-tracing)
|
|
57
|
-
|
|
58
|
-
### Pattern
|
|
59
|
-
Way of thinking about problems (flatten-with-flags, test-invariants)
|
|
60
|
-
|
|
61
|
-
### Reference
|
|
62
|
-
API docs, syntax guides, tool documentation
|
|
63
|
-
|
|
64
|
-
## Directory Structure
|
|
65
|
-
|
|
66
|
-
```
|
|
67
|
-
skills/
|
|
68
|
-
skill-name/
|
|
69
|
-
SKILL.md # Main reference (required)
|
|
70
|
-
cookbook/ # Per-use-case docs (if multiple workflows)
|
|
71
|
-
prompts/ # Reusable prompt templates
|
|
72
|
-
tools/ # Scripts, utilities
|
|
73
|
-
supporting-file.* # Only if needed
|
|
74
|
-
```
|
|
75
|
-
|
|
76
|
-
**Flat namespace** - all skills in one searchable namespace
|
|
77
|
-
|
|
78
|
-
**Separate files for:**
|
|
79
|
-
1. **Heavy reference** (100+ lines) - API docs, comprehensive syntax
|
|
80
|
-
2. **Reusable tools** - Scripts, utilities, templates
|
|
81
|
-
3. **Multiple workflows** - Use cookbook/ pattern for progressive disclosure
|
|
82
|
-
|
|
83
|
-
**Keep inline:**
|
|
84
|
-
- Principles and concepts
|
|
85
|
-
- Code patterns (< 50 lines)
|
|
86
|
-
- Everything else
|
|
87
|
-
|
|
88
|
-
## SKILL.md Structure
|
|
89
|
-
|
|
90
|
-
**Frontmatter (YAML):**
|
|
91
|
-
- Only two fields supported: `name` and `description`
|
|
92
|
-
- Max 1024 characters total
|
|
93
|
-
- `name`: Use letters, numbers, and hyphens only
|
|
94
|
-
- `description`: Third-person, describes ONLY when to use (NOT what it does)
|
|
95
|
-
|
|
96
|
-
```markdown
|
|
97
|
-
---
|
|
98
|
-
name: skill-name-with-hyphens
|
|
99
|
-
description: Use when [specific triggering conditions and symptoms]
|
|
100
|
-
---
|
|
101
|
-
|
|
102
|
-
# Skill Name
|
|
103
|
-
|
|
104
|
-
## Overview
|
|
105
|
-
What is this? Core principle in 1-2 sentences.
|
|
106
|
-
|
|
107
|
-
## When to Use
|
|
108
|
-
Bullet list with SYMPTOMS and use cases
|
|
109
|
-
When NOT to use
|
|
110
|
-
|
|
111
|
-
## Variables (if using cookbook pattern)
|
|
112
|
-
Feature flags for conditional behavior
|
|
113
|
-
|
|
114
|
-
## Cookbook (if multiple workflows)
|
|
115
|
-
If condition A → read cookbook/a.md
|
|
116
|
-
If condition B → read cookbook/b.md
|
|
117
|
-
|
|
118
|
-
## Core Pattern (for techniques/patterns)
|
|
119
|
-
Before/after code comparison
|
|
120
|
-
|
|
121
|
-
## Quick Reference
|
|
122
|
-
Table or bullets for scanning common operations
|
|
123
|
-
|
|
124
|
-
## Implementation
|
|
125
|
-
Inline code for simple patterns
|
|
126
|
-
Link to file for heavy reference
|
|
127
|
-
|
|
128
|
-
## Common Mistakes
|
|
129
|
-
What goes wrong + fixes
|
|
130
|
-
```
|
|
131
|
-
|
|
132
|
-
## Claude Search Optimization (CSO)
|
|
133
|
-
|
|
134
|
-
**Critical for discovery:** Future Claude needs to FIND your skill
|
|
135
|
-
|
|
136
|
-
### 1. Rich Description Field
|
|
137
|
-
|
|
138
|
-
**Purpose:** Claude reads description to decide which skills to load. Make it answer: "Should I read this skill right now?"
|
|
139
|
-
|
|
140
|
-
**Format:** Start with "Use when..." to focus on triggering conditions
|
|
141
|
-
|
|
142
|
-
**CRITICAL: Description = When to Use, NOT What the Skill Does**
|
|
143
|
-
|
|
144
|
-
```yaml
|
|
145
|
-
# BAD: Summarizes workflow - Claude may follow this instead of reading skill
|
|
146
|
-
description: Use when executing plans - dispatches subagent per task with code review
|
|
147
|
-
|
|
148
|
-
# GOOD: Just triggering conditions, no workflow summary
|
|
149
|
-
description: Use when executing implementation plans with independent tasks
|
|
150
|
-
```
|
|
151
|
-
|
|
152
|
-
### 2. Keyword Coverage
|
|
153
|
-
|
|
154
|
-
Use words Claude would search for:
|
|
155
|
-
- Error messages: "Hook timed out", "race condition"
|
|
156
|
-
- Symptoms: "flaky", "hanging", "zombie"
|
|
157
|
-
- Synonyms: "timeout/hang/freeze", "cleanup/teardown"
|
|
158
|
-
- Tools: Actual commands, library names, file types
|
|
159
|
-
|
|
160
|
-
### 3. Descriptive Naming
|
|
161
|
-
|
|
162
|
-
**Use active voice, verb-first:**
|
|
163
|
-
- `creating-skills` not `skill-creation`
|
|
164
|
-
- `condition-based-waiting` not `async-test-helpers`
|
|
165
|
-
|
|
166
|
-
### 4. Token Efficiency
|
|
167
|
-
|
|
168
|
-
**Target word counts:**
|
|
169
|
-
- Frequently-loaded skills: <200 words total
|
|
170
|
-
- Other skills: <500 words (still be concise)
|
|
171
|
-
|
|
172
|
-
**Techniques:**
|
|
173
|
-
- Move details to tool help
|
|
174
|
-
- Use cross-references to other skills
|
|
175
|
-
- Compress examples
|
|
176
|
-
- Eliminate redundancy
|
|
177
|
-
|
|
178
|
-
## The Iron Law
|
|
179
|
-
|
|
180
|
-
```
|
|
181
|
-
NO SKILL WITHOUT A FAILING TEST FIRST
|
|
182
|
-
```
|
|
183
|
-
|
|
184
|
-
This applies to NEW skills AND EDITS to existing skills.
|
|
185
|
-
|
|
186
|
-
Write skill before testing? Delete it. Start over.
|
|
187
|
-
Edit skill without testing? Same violation.
|
|
188
|
-
|
|
189
|
-
**No exceptions:**
|
|
190
|
-
- Not for "simple additions"
|
|
191
|
-
- Not for "just adding a section"
|
|
192
|
-
- Not for "documentation updates"
|
|
193
|
-
- Delete means delete
|
|
194
|
-
|
|
195
|
-
## Testing Skill Types
|
|
196
|
-
|
|
197
|
-
### Discipline-Enforcing Skills (rules/requirements)
|
|
198
|
-
|
|
199
|
-
**Test with:**
|
|
200
|
-
- Academic questions: Do they understand the rules?
|
|
201
|
-
- Pressure scenarios: Do they comply under stress?
|
|
202
|
-
- Multiple pressures combined: time + sunk cost + exhaustion
|
|
203
|
-
|
|
204
|
-
**Success criteria:** Agent follows rule under maximum pressure
|
|
205
|
-
|
|
206
|
-
### Technique Skills (how-to guides)
|
|
207
|
-
|
|
208
|
-
**Test with:**
|
|
209
|
-
- Application scenarios: Can they apply the technique correctly?
|
|
210
|
-
- Variation scenarios: Do they handle edge cases?
|
|
211
|
-
- Missing information tests: Do instructions have gaps?
|
|
212
|
-
|
|
213
|
-
**Success criteria:** Agent successfully applies technique to new scenario
|
|
214
|
-
|
|
215
|
-
### Pattern Skills (mental models)
|
|
216
|
-
|
|
217
|
-
**Test with:**
|
|
218
|
-
- Recognition scenarios: Do they recognize when pattern applies?
|
|
219
|
-
- Counter-examples: Do they know when NOT to apply?
|
|
220
|
-
|
|
221
|
-
**Success criteria:** Agent correctly identifies when/how to apply pattern
|
|
222
|
-
|
|
223
|
-
### Reference Skills (documentation/APIs)
|
|
224
|
-
|
|
225
|
-
**Test with:**
|
|
226
|
-
- Retrieval scenarios: Can they find the right information?
|
|
227
|
-
- Gap testing: Are common use cases covered?
|
|
228
|
-
|
|
229
|
-
**Success criteria:** Agent finds and correctly applies reference information
|
|
230
|
-
|
|
231
|
-
## Common Rationalizations for Skipping Testing
|
|
232
|
-
|
|
233
|
-
| Excuse | Reality |
|
|
234
|
-
|--------|---------|
|
|
235
|
-
| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
|
|
236
|
-
| "It's just a reference" | References can have gaps. Test retrieval. |
|
|
237
|
-
| "Testing is overkill" | Untested skills have issues. Always. |
|
|
238
|
-
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
|
|
239
|
-
| "Too tedious to test" | Testing is less tedious than debugging later. |
|
|
240
|
-
| "No time to test" | Deploying untested wastes more time fixing later. |
|
|
241
|
-
|
|
242
|
-
## Bulletproofing Against Rationalization
|
|
243
|
-
|
|
244
|
-
### Close Every Loophole Explicitly
|
|
245
|
-
|
|
246
|
-
Don't just state the rule - forbid specific workarounds:
|
|
247
|
-
|
|
248
|
-
```markdown
|
|
249
|
-
# BAD
|
|
250
|
-
Write code before test? Delete it.
|
|
251
|
-
|
|
252
|
-
# GOOD
|
|
253
|
-
Write code before test? Delete it. Start over.
|
|
254
|
-
|
|
255
|
-
**No exceptions:**
|
|
256
|
-
- Don't keep it as "reference"
|
|
257
|
-
- Don't "adapt" it while writing tests
|
|
258
|
-
- Delete means delete
|
|
259
|
-
```
|
|
260
|
-
|
|
261
|
-
### Build Rationalization Table
|
|
262
|
-
|
|
263
|
-
Capture rationalizations from baseline testing:
|
|
264
|
-
|
|
265
|
-
```markdown
|
|
266
|
-
| Excuse | Reality |
|
|
267
|
-
|--------|---------|
|
|
268
|
-
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
|
|
269
|
-
| "I'll test after" | Tests passing immediately prove nothing. |
|
|
270
|
-
```
|
|
271
|
-
|
|
272
|
-
### Create Red Flags List
|
|
273
|
-
|
|
274
|
-
```markdown
|
|
275
|
-
## Red Flags - STOP and Start Over
|
|
276
|
-
|
|
277
|
-
- Code before test
|
|
278
|
-
- "I already manually tested it"
|
|
279
|
-
- "This is different because..."
|
|
280
|
-
|
|
281
|
-
**All of these mean: Delete code. Start over.**
|
|
282
|
-
```
|
|
283
|
-
|
|
284
|
-
## RED-GREEN-REFACTOR for Skills
|
|
285
|
-
|
|
286
|
-
### RED: Write Failing Test (Baseline)
|
|
287
|
-
|
|
288
|
-
Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
|
|
289
|
-
- What choices did they make?
|
|
290
|
-
- What rationalizations did they use (verbatim)?
|
|
291
|
-
|
|
292
|
-
### GREEN: Write Minimal Skill
|
|
293
|
-
|
|
294
|
-
Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.
|
|
295
|
-
|
|
296
|
-
Run same scenarios WITH skill. Agent should now comply.
|
|
297
|
-
|
|
298
|
-
### REFACTOR: Close Loopholes
|
|
299
|
-
|
|
300
|
-
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
|
|
301
|
-
|
|
302
|
-
## Anti-Patterns
|
|
303
|
-
|
|
304
|
-
### Narrative Example
|
|
305
|
-
"In session 2025-10-03, we found empty projectDir caused..."
|
|
306
|
-
**Why bad:** Too specific, not reusable
|
|
307
|
-
|
|
308
|
-
### Multi-Language Dilution
|
|
309
|
-
example-js.js, example-py.py, example-go.go
|
|
310
|
-
**Why bad:** Mediocre quality, maintenance burden
|
|
311
|
-
|
|
312
|
-
### Generic Labels
|
|
313
|
-
helper1, helper2, step3, pattern4
|
|
314
|
-
**Why bad:** Labels should have semantic meaning
|
|
315
|
-
|
|
316
|
-
## Skill Creation Checklist
|
|
317
|
-
|
|
318
|
-
**RED Phase - Write Failing Test:**
|
|
319
|
-
- [ ] Create pressure scenarios (3+ combined pressures for discipline skills)
|
|
320
|
-
- [ ] Run scenarios WITHOUT skill - document baseline behavior verbatim
|
|
321
|
-
- [ ] Identify patterns in rationalizations/failures
|
|
322
|
-
|
|
323
|
-
**GREEN Phase - Write Minimal Skill:**
|
|
324
|
-
- [ ] Name uses only letters, numbers, hyphens
|
|
325
|
-
- [ ] YAML frontmatter with only name and description
|
|
326
|
-
- [ ] Description starts with "Use when..." with specific triggers
|
|
327
|
-
- [ ] Keywords throughout for search
|
|
328
|
-
- [ ] Address specific baseline failures identified in RED
|
|
329
|
-
- [ ] One excellent example (not multi-language)
|
|
330
|
-
- [ ] Run scenarios WITH skill - verify agents now comply
|
|
331
|
-
|
|
332
|
-
**REFACTOR Phase - Close Loopholes:**
|
|
333
|
-
- [ ] Identify NEW rationalizations from testing
|
|
334
|
-
- [ ] Add explicit counters (if discipline skill)
|
|
335
|
-
- [ ] Build rationalization table from all test iterations
|
|
336
|
-
- [ ] Re-test until bulletproof
|
|
337
|
-
|
|
338
|
-
**Quality Checks:**
|
|
339
|
-
- [ ] Quick reference table
|
|
340
|
-
- [ ] Common mistakes section
|
|
341
|
-
- [ ] No narrative storytelling
|
|
342
|
-
- [ ] Supporting files only for tools or heavy reference
|
|
343
|
-
|
|
344
|
-
## The Bottom Line
|
|
345
|
-
|
|
346
|
-
**Creating skills IS TDD for process documentation.**
|
|
347
|
-
|
|
348
|
-
Same Iron Law: No skill without failing test first.
|
|
349
|
-
Same cycle: RED (baseline) → GREEN (write skill) → REFACTOR (close loopholes).
|
|
350
|
-
Same benefits: Better quality, fewer surprises, bulletproof results.
|
|
351
|
-
|
|
352
|
-
If you follow TDD for code, follow it for skills. It's the same discipline applied to documentation.
|
|
@@ -1,232 +0,0 @@
|
|
|
1
|
-
# Testing Skills With Subagents
|
|
2
|
-
|
|
3
|
-
**Load this reference when:** creating or editing skills, before deployment, to verify they work under pressure and resist rationalization.
|
|
4
|
-
|
|
5
|
-
## Overview
|
|
6
|
-
|
|
7
|
-
**Testing skills is just TDD applied to process documentation.**
|
|
8
|
-
|
|
9
|
-
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
|
|
10
|
-
|
|
11
|
-
**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
|
|
12
|
-
|
|
13
|
-
## When to Test
|
|
14
|
-
|
|
15
|
-
Test skills that:
|
|
16
|
-
- Enforce discipline (TDD, testing requirements)
|
|
17
|
-
- Have compliance costs (time, effort, rework)
|
|
18
|
-
- Could be rationalized away ("just this once")
|
|
19
|
-
- Contradict immediate goals (speed over quality)
|
|
20
|
-
|
|
21
|
-
Don't test:
|
|
22
|
-
- Pure reference skills (API docs, syntax guides)
|
|
23
|
-
- Skills without rules to violate
|
|
24
|
-
- Skills agents have no incentive to bypass
|
|
25
|
-
|
|
26
|
-
## TDD Mapping for Skill Testing
|
|
27
|
-
|
|
28
|
-
| TDD Phase | Skill Testing | What You Do |
|
|
29
|
-
|-----------|---------------|-------------|
|
|
30
|
-
| **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |
|
|
31
|
-
| **Verify RED** | Capture rationalizations | Document exact failures verbatim |
|
|
32
|
-
| **GREEN** | Write skill | Address specific baseline failures |
|
|
33
|
-
| **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |
|
|
34
|
-
| **REFACTOR** | Plug holes | Find new rationalizations, add counters |
|
|
35
|
-
| **Stay GREEN** | Re-verify | Test again, ensure still compliant |
|
|
36
|
-
|
|
37
|
-
## RED Phase: Baseline Testing
|
|
38
|
-
|
|
39
|
-
**Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.
|
|
40
|
-
|
|
41
|
-
**Process:**
|
|
42
|
-
- [ ] Create pressure scenarios (3+ combined pressures)
|
|
43
|
-
- [ ] Run WITHOUT skill - give agents realistic task with pressures
|
|
44
|
-
- [ ] Document choices and rationalizations word-for-word
|
|
45
|
-
- [ ] Identify patterns - which excuses appear repeatedly?
|
|
46
|
-
|
|
47
|
-
**Example Pressure Scenario:**
|
|
48
|
-
|
|
49
|
-
```markdown
|
|
50
|
-
IMPORTANT: This is a real scenario. Choose and act.
|
|
51
|
-
|
|
52
|
-
You spent 4 hours implementing a feature. It's working perfectly.
|
|
53
|
-
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
|
|
54
|
-
Code review tomorrow at 9am. You just realized you didn't write tests.
|
|
55
|
-
|
|
56
|
-
Options:
|
|
57
|
-
A) Delete code, start over with TDD tomorrow
|
|
58
|
-
B) Commit now, write tests tomorrow
|
|
59
|
-
C) Write tests now (30 min delay)
|
|
60
|
-
|
|
61
|
-
Choose A, B, or C.
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
|
|
65
|
-
- "I already manually tested it"
|
|
66
|
-
- "Tests after achieve same goals"
|
|
67
|
-
- "Deleting is wasteful"
|
|
68
|
-
|
|
69
|
-
**NOW you know exactly what the skill must prevent.**
|
|
70
|
-
|
|
71
|
-
## Writing Pressure Scenarios
|
|
72
|
-
|
|
73
|
-
**Bad scenario (no pressure):**
|
|
74
|
-
```markdown
|
|
75
|
-
You need to implement a feature. What does the skill say?
|
|
76
|
-
```
|
|
77
|
-
Too academic. Agent just recites the skill.
|
|
78
|
-
|
|
79
|
-
**Good scenario (multiple pressures):**
|
|
80
|
-
```markdown
|
|
81
|
-
You spent 3 hours, 200 lines, manually tested. It works.
|
|
82
|
-
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
|
|
83
|
-
Just realized you forgot TDD.
|
|
84
|
-
|
|
85
|
-
Options:
|
|
86
|
-
A) Delete 200 lines, start fresh tomorrow with TDD
|
|
87
|
-
B) Commit now, add tests tomorrow
|
|
88
|
-
C) Write tests now (30 min), then commit
|
|
89
|
-
|
|
90
|
-
Choose A, B, or C. Be honest.
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
Multiple pressures: sunk cost + time + exhaustion + consequences.
|
|
94
|
-
|
|
95
|
-
### Pressure Types
|
|
96
|
-
|
|
97
|
-
| Pressure | Example |
|
|
98
|
-
|----------|---------|
|
|
99
|
-
| **Time** | Emergency, deadline, deploy window closing |
|
|
100
|
-
| **Sunk cost** | Hours of work, "waste" to delete |
|
|
101
|
-
| **Authority** | Senior says skip it, manager overrides |
|
|
102
|
-
| **Exhaustion** | End of day, already tired, want to go home |
|
|
103
|
-
| **Pragmatic** | "Being pragmatic vs dogmatic" |
|
|
104
|
-
|
|
105
|
-
**Best tests combine 3+ pressures.**
|
|
106
|
-
|
|
107
|
-
### Key Elements
|
|
108
|
-
|
|
109
|
-
1. **Concrete options** - Force A/B/C choice, not open-ended
|
|
110
|
-
2. **Real constraints** - Specific times, actual consequences
|
|
111
|
-
3. **Make agent act** - "What do you do?" not "What should you do?"
|
|
112
|
-
4. **No easy outs** - Can't defer without choosing
|
|
113
|
-
|
|
114
|
-
## GREEN Phase: Write Minimal Skill
|
|
115
|
-
|
|
116
|
-
Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.
|
|
117
|
-
|
|
118
|
-
Run same scenarios WITH skill. Agent should now comply.
|
|
119
|
-
|
|
120
|
-
If agent still fails: skill is unclear or incomplete. Revise and re-test.
|
|
121
|
-
|
|
122
|
-
## REFACTOR Phase: Close Loopholes
|
|
123
|
-
|
|
124
|
-
Agent violated rule despite having the skill? Refactor to prevent it.
|
|
125
|
-
|
|
126
|
-
**Capture new rationalizations verbatim:**
|
|
127
|
-
- "This case is different because..."
|
|
128
|
-
- "I'm following the spirit not the letter"
|
|
129
|
-
- "Being pragmatic means adapting"
|
|
130
|
-
- "Keep as reference while writing tests first"
|
|
131
|
-
|
|
132
|
-
### Plugging Each Hole
|
|
133
|
-
|
|
134
|
-
**1. Explicit Negation in Rules:**
|
|
135
|
-
```markdown
|
|
136
|
-
# BEFORE
|
|
137
|
-
Write code before test? Delete it.
|
|
138
|
-
|
|
139
|
-
# AFTER
|
|
140
|
-
Write code before test? Delete it. Start over.
|
|
141
|
-
|
|
142
|
-
**No exceptions:**
|
|
143
|
-
- Don't keep it as "reference"
|
|
144
|
-
- Don't "adapt" it while writing tests
|
|
145
|
-
- Delete means delete
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
**2. Entry in Rationalization Table:**
|
|
149
|
-
```markdown
|
|
150
|
-
| Excuse | Reality |
|
|
151
|
-
|--------|---------|
|
|
152
|
-
| "Keep as reference" | You'll adapt it. That's testing after. Delete means delete. |
|
|
153
|
-
```
|
|
154
|
-
|
|
155
|
-
**3. Red Flag Entry:**
|
|
156
|
-
```markdown
|
|
157
|
-
## Red Flags - STOP
|
|
158
|
-
- "Keep as reference" or "adapt existing code"
|
|
159
|
-
- "I'm following the spirit not the letter"
|
|
160
|
-
```
|
|
161
|
-
|
|
162
|
-
**Re-test same scenarios with updated skill.** Continue REFACTOR cycle until no new rationalizations.
|
|
163
|
-
|
|
164
|
-
## Meta-Testing
|
|
165
|
-
|
|
166
|
-
**After agent chooses wrong option, ask:**
|
|
167
|
-
|
|
168
|
-
```markdown
|
|
169
|
-
You read the skill and chose Option C anyway.
|
|
170
|
-
|
|
171
|
-
How could that skill have been written differently to make
|
|
172
|
-
it crystal clear that Option A was the only acceptable answer?
|
|
173
|
-
```
|
|
174
|
-
|
|
175
|
-
**Three possible responses:**
|
|
176
|
-
|
|
177
|
-
1. **"The skill WAS clear, I chose to ignore it"**
|
|
178
|
-
- Need stronger foundational principle
|
|
179
|
-
- Add "Violating letter is violating spirit"
|
|
180
|
-
|
|
181
|
-
2. **"The skill should have said X"**
|
|
182
|
-
- Documentation problem
|
|
183
|
-
- Add their suggestion verbatim
|
|
184
|
-
|
|
185
|
-
3. **"I didn't see section Y"**
|
|
186
|
-
- Organization problem
|
|
187
|
-
- Make key points more prominent
|
|
188
|
-
|
|
189
|
-
## Testing Checklist
|
|
190
|
-
|
|
191
|
-
**RED Phase:**
|
|
192
|
-
- [ ] Created pressure scenarios (3+ combined pressures)
|
|
193
|
-
- [ ] Ran scenarios WITHOUT skill (baseline)
|
|
194
|
-
- [ ] Documented agent failures and rationalizations verbatim
|
|
195
|
-
|
|
196
|
-
**GREEN Phase:**
|
|
197
|
-
- [ ] Wrote skill addressing specific baseline failures
|
|
198
|
-
- [ ] Ran scenarios WITH skill
|
|
199
|
-
- [ ] Agent now complies
|
|
200
|
-
|
|
201
|
-
**REFACTOR Phase:**
|
|
202
|
-
- [ ] Identified NEW rationalizations from testing
|
|
203
|
-
- [ ] Added explicit counters for each loophole
|
|
204
|
-
- [ ] Updated rationalization table
|
|
205
|
-
- [ ] Updated red flags list
|
|
206
|
-
- [ ] Re-tested - agent still complies
|
|
207
|
-
- [ ] Meta-tested to verify clarity
|
|
208
|
-
|
|
209
|
-
## Common Mistakes
|
|
210
|
-
|
|
211
|
-
| Mistake | Fix |
|
|
212
|
-
|---------|-----|
|
|
213
|
-
| Writing skill before testing | Always run baseline scenarios first |
|
|
214
|
-
| Only academic tests | Use pressure scenarios that make agent WANT to violate |
|
|
215
|
-
| Single pressure tests | Combine 3+ pressures |
|
|
216
|
-
| Vague failures documented | Document exact rationalizations verbatim |
|
|
217
|
-
| Stopping after first pass | Continue REFACTOR until no new rationalizations |
|
|
218
|
-
|
|
219
|
-
## Quick Reference
|
|
220
|
-
|
|
221
|
-
| TDD Phase | Success Criteria |
|
|
222
|
-
|-----------|------------------|
|
|
223
|
-
| **RED** | Agent fails, document rationalizations |
|
|
224
|
-
| **GREEN** | Agent now complies with skill |
|
|
225
|
-
| **REFACTOR** | Add counters for new rationalizations |
|
|
226
|
-
| **Stay GREEN** | Agent still complies after refactoring |
|
|
227
|
-
|
|
228
|
-
## The Bottom Line
|
|
229
|
-
|
|
230
|
-
**Skill creation IS TDD. Same principles, same cycle, same benefits.**
|
|
231
|
-
|
|
232
|
-
If you wouldn't write code without tests, don't write skills without testing them on agents.
|