agileflow 2.61.0 → 2.62.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/README.md +9 -9
  2. package/package.json +1 -1
  3. package/scripts/lib/counter.js +103 -0
  4. package/src/core/commands/auto.md +1 -0
  5. package/src/core/commands/babysit.md +170 -29
  6. package/src/core/commands/board.md +1 -0
  7. package/src/core/commands/ci.md +1 -0
  8. package/src/core/commands/compress.md +1 -0
  9. package/src/core/commands/deploy.md +1 -0
  10. package/src/core/commands/help.md +1 -0
  11. package/src/core/commands/research.md +1 -0
  12. package/src/core/commands/skill/create.md +566 -0
  13. package/src/core/commands/skill/delete.md +189 -0
  14. package/src/core/commands/skill/edit.md +245 -0
  15. package/src/core/commands/skill/list.md +155 -0
  16. package/src/core/commands/skill/test.md +249 -0
  17. package/src/core/commands/template.md +1 -0
  18. package/src/core/commands/tests.md +1 -0
  19. package/src/core/commands/update.md +1 -0
  20. package/src/core/commands/velocity.md +1 -0
  21. package/src/core/experts/refactor/expertise.yaml +17 -12
  22. package/src/core/templates/claude-settings.advanced.example.json +1 -1
  23. package/src/core/templates/claude-settings.example.json +1 -1
  24. package/tools/cli/commands/list.js +8 -13
  25. package/tools/cli/installers/core/installer.js +20 -19
  26. package/tools/cli/installers/ide/_base-ide.js +18 -4
  27. package/tools/cli/installers/ide/claude-code.js +4 -15
  28. package/tools/cli/installers/ide/codex.js +9 -13
  29. package/tools/cli/lib/content-injector.js +162 -31
  30. package/tools/cli/lib/utils.js +87 -0
  31. package/src/core/skills/acceptance-criteria-generator/SKILL.md +0 -46
  32. package/src/core/skills/adr-template/SKILL.md +0 -62
  33. package/src/core/skills/agileflow-acceptance-criteria/SKILL.md +0 -156
  34. package/src/core/skills/agileflow-adr/SKILL.md +0 -147
  35. package/src/core/skills/agileflow-adr/examples/database-choice-example.md +0 -122
  36. package/src/core/skills/agileflow-adr/templates/adr-template.md +0 -69
  37. package/src/core/skills/agileflow-commit-messages/SKILL.md +0 -130
  38. package/src/core/skills/agileflow-commit-messages/reference/bad-examples.md +0 -168
  39. package/src/core/skills/agileflow-commit-messages/reference/good-examples.md +0 -120
  40. package/src/core/skills/agileflow-commit-messages/scripts/check-attribution.sh +0 -15
  41. package/src/core/skills/agileflow-epic-planner/SKILL.md +0 -184
  42. package/src/core/skills/agileflow-retro-facilitator/SKILL.md +0 -119
  43. package/src/core/skills/agileflow-retro-facilitator/cookbook/4ls.md +0 -86
  44. package/src/core/skills/agileflow-retro-facilitator/cookbook/glad-sad-mad.md +0 -79
  45. package/src/core/skills/agileflow-retro-facilitator/cookbook/start-stop-continue.md +0 -142
  46. package/src/core/skills/agileflow-retro-facilitator/prompts/action-items.md +0 -83
  47. package/src/core/skills/agileflow-sprint-planner/SKILL.md +0 -212
  48. package/src/core/skills/agileflow-story-writer/SKILL.md +0 -163
  49. package/src/core/skills/agileflow-story-writer/examples/good-story-example.md +0 -63
  50. package/src/core/skills/agileflow-story-writer/templates/story-template.md +0 -44
  51. package/src/core/skills/agileflow-tech-debt/SKILL.md +0 -215
  52. package/src/core/skills/api-documentation-generator/SKILL.md +0 -65
  53. package/src/core/skills/changelog-entry/SKILL.md +0 -55
  54. package/src/core/skills/commit-message-formatter/SKILL.md +0 -50
  55. package/src/core/skills/deployment-guide-generator/SKILL.md +0 -84
  56. package/src/core/skills/diagram-generator/SKILL.md +0 -65
  57. package/src/core/skills/error-handler-template/SKILL.md +0 -78
  58. package/src/core/skills/migration-checklist/SKILL.md +0 -82
  59. package/src/core/skills/pr-description/SKILL.md +0 -65
  60. package/src/core/skills/sql-schema-generator/SKILL.md +0 -69
  61. package/src/core/skills/story-skeleton/SKILL.md +0 -34
  62. package/src/core/skills/test-case-generator/SKILL.md +0 -63
  63. package/src/core/skills/type-definitions/SKILL.md +0 -65
  64. package/src/core/skills/validation-schema-generator/SKILL.md +0 -64
  65. package/src/core/skills/writing-skills/SKILL.md +0 -352
  66. package/src/core/skills/writing-skills/testing-skills-with-subagents.md +0 -232
@@ -1,352 +0,0 @@
1
- ---
2
- name: writing-skills
3
- description: Use when creating new skills, editing existing skills, or verifying skills work before deployment
4
- ---
5
-
6
- # Writing Skills
7
-
8
- ## Overview
9
-
10
- **Writing skills IS Test-Driven Development applied to process documentation.**
11
-
12
- You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
13
-
14
- **Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
15
-
16
- ## What is a Skill?
17
-
18
- A **skill** is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.
19
-
20
- **Skills are:** Reusable techniques, patterns, tools, reference guides
21
-
22
- **Skills are NOT:** Narratives about how you solved a problem once
23
-
24
- ## TDD Mapping for Skills
25
-
26
- | TDD Concept | Skill Creation |
27
- |-------------|----------------|
28
- | **Test case** | Pressure scenario with subagent |
29
- | **Production code** | Skill document (SKILL.md) |
30
- | **Test fails (RED)** | Agent violates rule without skill (baseline) |
31
- | **Test passes (GREEN)** | Agent complies with skill present |
32
- | **Refactor** | Close loopholes while maintaining compliance |
33
- | **Write test first** | Run baseline scenario BEFORE writing skill |
34
- | **Watch it fail** | Document exact rationalizations agent uses |
35
- | **Minimal code** | Write skill addressing those specific violations |
36
- | **Watch it pass** | Verify agent now complies |
37
- | **Refactor cycle** | Find new rationalizations → plug → re-verify |
38
-
39
- ## When to Create a Skill
40
-
41
- **Create when:**
42
- - Technique wasn't intuitively obvious to you
43
- - You'd reference this again across projects
44
- - Pattern applies broadly (not project-specific)
45
- - Others would benefit
46
-
47
- **Don't create for:**
48
- - One-off solutions
49
- - Standard practices well-documented elsewhere
50
- - Project-specific conventions (put in CLAUDE.md)
51
- - Mechanical constraints (if enforceable with validation, automate it)
52
-
53
- ## Skill Types
54
-
55
- ### Technique
56
- Concrete method with steps to follow (condition-based-waiting, root-cause-tracing)
57
-
58
- ### Pattern
59
- Way of thinking about problems (flatten-with-flags, test-invariants)
60
-
61
- ### Reference
62
- API docs, syntax guides, tool documentation
63
-
64
- ## Directory Structure
65
-
66
- ```
67
- skills/
68
- skill-name/
69
- SKILL.md # Main reference (required)
70
- cookbook/ # Per-use-case docs (if multiple workflows)
71
- prompts/ # Reusable prompt templates
72
- tools/ # Scripts, utilities
73
- supporting-file.* # Only if needed
74
- ```
75
-
76
- **Flat namespace** - all skills in one searchable namespace
77
-
78
- **Separate files for:**
79
- 1. **Heavy reference** (100+ lines) - API docs, comprehensive syntax
80
- 2. **Reusable tools** - Scripts, utilities, templates
81
- 3. **Multiple workflows** - Use cookbook/ pattern for progressive disclosure
82
-
83
- **Keep inline:**
84
- - Principles and concepts
85
- - Code patterns (< 50 lines)
86
- - Everything else
87
-
88
- ## SKILL.md Structure
89
-
90
- **Frontmatter (YAML):**
91
- - Only two fields supported: `name` and `description`
92
- - Max 1024 characters total
93
- - `name`: Use letters, numbers, and hyphens only
94
- - `description`: Third-person, describes ONLY when to use (NOT what it does)
95
-
96
- ```markdown
97
- ---
98
- name: skill-name-with-hyphens
99
- description: Use when [specific triggering conditions and symptoms]
100
- ---
101
-
102
- # Skill Name
103
-
104
- ## Overview
105
- What is this? Core principle in 1-2 sentences.
106
-
107
- ## When to Use
108
- Bullet list with SYMPTOMS and use cases
109
- When NOT to use
110
-
111
- ## Variables (if using cookbook pattern)
112
- Feature flags for conditional behavior
113
-
114
- ## Cookbook (if multiple workflows)
115
- If condition A → read cookbook/a.md
116
- If condition B → read cookbook/b.md
117
-
118
- ## Core Pattern (for techniques/patterns)
119
- Before/after code comparison
120
-
121
- ## Quick Reference
122
- Table or bullets for scanning common operations
123
-
124
- ## Implementation
125
- Inline code for simple patterns
126
- Link to file for heavy reference
127
-
128
- ## Common Mistakes
129
- What goes wrong + fixes
130
- ```
131
-
132
- ## Claude Search Optimization (CSO)
133
-
134
- **Critical for discovery:** Future Claude needs to FIND your skill
135
-
136
- ### 1. Rich Description Field
137
-
138
- **Purpose:** Claude reads description to decide which skills to load. Make it answer: "Should I read this skill right now?"
139
-
140
- **Format:** Start with "Use when..." to focus on triggering conditions
141
-
142
- **CRITICAL: Description = When to Use, NOT What the Skill Does**
143
-
144
- ```yaml
145
- # BAD: Summarizes workflow - Claude may follow this instead of reading skill
146
- description: Use when executing plans - dispatches subagent per task with code review
147
-
148
- # GOOD: Just triggering conditions, no workflow summary
149
- description: Use when executing implementation plans with independent tasks
150
- ```
151
-
152
- ### 2. Keyword Coverage
153
-
154
- Use words Claude would search for:
155
- - Error messages: "Hook timed out", "race condition"
156
- - Symptoms: "flaky", "hanging", "zombie"
157
- - Synonyms: "timeout/hang/freeze", "cleanup/teardown"
158
- - Tools: Actual commands, library names, file types
159
-
160
- ### 3. Descriptive Naming
161
-
162
- **Use active voice, verb-first:**
163
- - `creating-skills` not `skill-creation`
164
- - `condition-based-waiting` not `async-test-helpers`
165
-
166
- ### 4. Token Efficiency
167
-
168
- **Target word counts:**
169
- - Frequently-loaded skills: <200 words total
170
- - Other skills: <500 words (still be concise)
171
-
172
- **Techniques:**
173
- - Move details to tool help
174
- - Use cross-references to other skills
175
- - Compress examples
176
- - Eliminate redundancy
177
-
178
- ## The Iron Law
179
-
180
- ```
181
- NO SKILL WITHOUT A FAILING TEST FIRST
182
- ```
183
-
184
- This applies to NEW skills AND EDITS to existing skills.
185
-
186
- Write skill before testing? Delete it. Start over.
187
- Edit skill without testing? Same violation.
188
-
189
- **No exceptions:**
190
- - Not for "simple additions"
191
- - Not for "just adding a section"
192
- - Not for "documentation updates"
193
- - Delete means delete
194
-
195
- ## Testing Skill Types
196
-
197
- ### Discipline-Enforcing Skills (rules/requirements)
198
-
199
- **Test with:**
200
- - Academic questions: Do they understand the rules?
201
- - Pressure scenarios: Do they comply under stress?
202
- - Multiple pressures combined: time + sunk cost + exhaustion
203
-
204
- **Success criteria:** Agent follows rule under maximum pressure
205
-
206
- ### Technique Skills (how-to guides)
207
-
208
- **Test with:**
209
- - Application scenarios: Can they apply the technique correctly?
210
- - Variation scenarios: Do they handle edge cases?
211
- - Missing information tests: Do instructions have gaps?
212
-
213
- **Success criteria:** Agent successfully applies technique to new scenario
214
-
215
- ### Pattern Skills (mental models)
216
-
217
- **Test with:**
218
- - Recognition scenarios: Do they recognize when pattern applies?
219
- - Counter-examples: Do they know when NOT to apply?
220
-
221
- **Success criteria:** Agent correctly identifies when/how to apply pattern
222
-
223
- ### Reference Skills (documentation/APIs)
224
-
225
- **Test with:**
226
- - Retrieval scenarios: Can they find the right information?
227
- - Gap testing: Are common use cases covered?
228
-
229
- **Success criteria:** Agent finds and correctly applies reference information
230
-
231
- ## Common Rationalizations for Skipping Testing
232
-
233
- | Excuse | Reality |
234
- |--------|---------|
235
- | "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
236
- | "It's just a reference" | References can have gaps. Test retrieval. |
237
- | "Testing is overkill" | Untested skills have issues. Always. |
238
- | "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
239
- | "Too tedious to test" | Testing is less tedious than debugging later. |
240
- | "No time to test" | Deploying untested wastes more time fixing later. |
241
-
242
- ## Bulletproofing Against Rationalization
243
-
244
- ### Close Every Loophole Explicitly
245
-
246
- Don't just state the rule - forbid specific workarounds:
247
-
248
- ```markdown
249
- # BAD
250
- Write code before test? Delete it.
251
-
252
- # GOOD
253
- Write code before test? Delete it. Start over.
254
-
255
- **No exceptions:**
256
- - Don't keep it as "reference"
257
- - Don't "adapt" it while writing tests
258
- - Delete means delete
259
- ```
260
-
261
- ### Build Rationalization Table
262
-
263
- Capture rationalizations from baseline testing:
264
-
265
- ```markdown
266
- | Excuse | Reality |
267
- |--------|---------|
268
- | "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
269
- | "I'll test after" | Tests passing immediately prove nothing. |
270
- ```
271
-
272
- ### Create Red Flags List
273
-
274
- ```markdown
275
- ## Red Flags - STOP and Start Over
276
-
277
- - Code before test
278
- - "I already manually tested it"
279
- - "This is different because..."
280
-
281
- **All of these mean: Delete code. Start over.**
282
- ```
283
-
284
- ## RED-GREEN-REFACTOR for Skills
285
-
286
- ### RED: Write Failing Test (Baseline)
287
-
288
- Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
289
- - What choices did they make?
290
- - What rationalizations did they use (verbatim)?
291
-
292
- ### GREEN: Write Minimal Skill
293
-
294
- Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.
295
-
296
- Run same scenarios WITH skill. Agent should now comply.
297
-
298
- ### REFACTOR: Close Loopholes
299
-
300
- Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
301
-
302
- ## Anti-Patterns
303
-
304
- ### Narrative Example
305
- "In session 2025-10-03, we found empty projectDir caused..."
306
- **Why bad:** Too specific, not reusable
307
-
308
- ### Multi-Language Dilution
309
- example-js.js, example-py.py, example-go.go
310
- **Why bad:** Mediocre quality, maintenance burden
311
-
312
- ### Generic Labels
313
- helper1, helper2, step3, pattern4
314
- **Why bad:** Labels should have semantic meaning
315
-
316
- ## Skill Creation Checklist
317
-
318
- **RED Phase - Write Failing Test:**
319
- - [ ] Create pressure scenarios (3+ combined pressures for discipline skills)
320
- - [ ] Run scenarios WITHOUT skill - document baseline behavior verbatim
321
- - [ ] Identify patterns in rationalizations/failures
322
-
323
- **GREEN Phase - Write Minimal Skill:**
324
- - [ ] Name uses only letters, numbers, hyphens
325
- - [ ] YAML frontmatter with only name and description
326
- - [ ] Description starts with "Use when..." with specific triggers
327
- - [ ] Keywords throughout for search
328
- - [ ] Address specific baseline failures identified in RED
329
- - [ ] One excellent example (not multi-language)
330
- - [ ] Run scenarios WITH skill - verify agents now comply
331
-
332
- **REFACTOR Phase - Close Loopholes:**
333
- - [ ] Identify NEW rationalizations from testing
334
- - [ ] Add explicit counters (if discipline skill)
335
- - [ ] Build rationalization table from all test iterations
336
- - [ ] Re-test until bulletproof
337
-
338
- **Quality Checks:**
339
- - [ ] Quick reference table
340
- - [ ] Common mistakes section
341
- - [ ] No narrative storytelling
342
- - [ ] Supporting files only for tools or heavy reference
343
-
344
- ## The Bottom Line
345
-
346
- **Creating skills IS TDD for process documentation.**
347
-
348
- Same Iron Law: No skill without failing test first.
349
- Same cycle: RED (baseline) → GREEN (write skill) → REFACTOR (close loopholes).
350
- Same benefits: Better quality, fewer surprises, bulletproof results.
351
-
352
- If you follow TDD for code, follow it for skills. It's the same discipline applied to documentation.
@@ -1,232 +0,0 @@
1
- # Testing Skills With Subagents
2
-
3
- **Load this reference when:** creating or editing skills, before deployment, to verify they work under pressure and resist rationalization.
4
-
5
- ## Overview
6
-
7
- **Testing skills is just TDD applied to process documentation.**
8
-
9
- You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
10
-
11
- **Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
12
-
13
- ## When to Test
14
-
15
- Test skills that:
16
- - Enforce discipline (TDD, testing requirements)
17
- - Have compliance costs (time, effort, rework)
18
- - Could be rationalized away ("just this once")
19
- - Contradict immediate goals (speed over quality)
20
-
21
- Don't test:
22
- - Pure reference skills (API docs, syntax guides)
23
- - Skills without rules to violate
24
- - Skills agents have no incentive to bypass
25
-
26
- ## TDD Mapping for Skill Testing
27
-
28
- | TDD Phase | Skill Testing | What You Do |
29
- |-----------|---------------|-------------|
30
- | **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |
31
- | **Verify RED** | Capture rationalizations | Document exact failures verbatim |
32
- | **GREEN** | Write skill | Address specific baseline failures |
33
- | **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |
34
- | **REFACTOR** | Plug holes | Find new rationalizations, add counters |
35
- | **Stay GREEN** | Re-verify | Test again, ensure still compliant |
36
-
37
- ## RED Phase: Baseline Testing
38
-
39
- **Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.
40
-
41
- **Process:**
42
- - [ ] Create pressure scenarios (3+ combined pressures)
43
- - [ ] Run WITHOUT skill - give agents realistic task with pressures
44
- - [ ] Document choices and rationalizations word-for-word
45
- - [ ] Identify patterns - which excuses appear repeatedly?
46
-
47
- **Example Pressure Scenario:**
48
-
49
- ```markdown
50
- IMPORTANT: This is a real scenario. Choose and act.
51
-
52
- You spent 4 hours implementing a feature. It's working perfectly.
53
- You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
54
- Code review tomorrow at 9am. You just realized you didn't write tests.
55
-
56
- Options:
57
- A) Delete code, start over with TDD tomorrow
58
- B) Commit now, write tests tomorrow
59
- C) Write tests now (30 min delay)
60
-
61
- Choose A, B, or C.
62
- ```
63
-
64
- Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
65
- - "I already manually tested it"
66
- - "Tests after achieve same goals"
67
- - "Deleting is wasteful"
68
-
69
- **NOW you know exactly what the skill must prevent.**
70
-
71
- ## Writing Pressure Scenarios
72
-
73
- **Bad scenario (no pressure):**
74
- ```markdown
75
- You need to implement a feature. What does the skill say?
76
- ```
77
- Too academic. Agent just recites the skill.
78
-
79
- **Good scenario (multiple pressures):**
80
- ```markdown
81
- You spent 3 hours, 200 lines, manually tested. It works.
82
- It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
83
- Just realized you forgot TDD.
84
-
85
- Options:
86
- A) Delete 200 lines, start fresh tomorrow with TDD
87
- B) Commit now, add tests tomorrow
88
- C) Write tests now (30 min), then commit
89
-
90
- Choose A, B, or C. Be honest.
91
- ```
92
-
93
- Multiple pressures: sunk cost + time + exhaustion + consequences.
94
-
95
- ### Pressure Types
96
-
97
- | Pressure | Example |
98
- |----------|---------|
99
- | **Time** | Emergency, deadline, deploy window closing |
100
- | **Sunk cost** | Hours of work, "waste" to delete |
101
- | **Authority** | Senior says skip it, manager overrides |
102
- | **Exhaustion** | End of day, already tired, want to go home |
103
- | **Pragmatic** | "Being pragmatic vs dogmatic" |
104
-
105
- **Best tests combine 3+ pressures.**
106
-
107
- ### Key Elements
108
-
109
- 1. **Concrete options** - Force A/B/C choice, not open-ended
110
- 2. **Real constraints** - Specific times, actual consequences
111
- 3. **Make agent act** - "What do you do?" not "What should you do?"
112
- 4. **No easy outs** - Can't defer without choosing
113
-
114
- ## GREEN Phase: Write Minimal Skill
115
-
116
- Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.
117
-
118
- Run same scenarios WITH skill. Agent should now comply.
119
-
120
- If agent still fails: skill is unclear or incomplete. Revise and re-test.
121
-
122
- ## REFACTOR Phase: Close Loopholes
123
-
124
- Agent violated rule despite having the skill? Refactor to prevent it.
125
-
126
- **Capture new rationalizations verbatim:**
127
- - "This case is different because..."
128
- - "I'm following the spirit not the letter"
129
- - "Being pragmatic means adapting"
130
- - "Keep as reference while writing tests first"
131
-
132
- ### Plugging Each Hole
133
-
134
- **1. Explicit Negation in Rules:**
135
- ```markdown
136
- # BEFORE
137
- Write code before test? Delete it.
138
-
139
- # AFTER
140
- Write code before test? Delete it. Start over.
141
-
142
- **No exceptions:**
143
- - Don't keep it as "reference"
144
- - Don't "adapt" it while writing tests
145
- - Delete means delete
146
- ```
147
-
148
- **2. Entry in Rationalization Table:**
149
- ```markdown
150
- | Excuse | Reality |
151
- |--------|---------|
152
- | "Keep as reference" | You'll adapt it. That's testing after. Delete means delete. |
153
- ```
154
-
155
- **3. Red Flag Entry:**
156
- ```markdown
157
- ## Red Flags - STOP
158
- - "Keep as reference" or "adapt existing code"
159
- - "I'm following the spirit not the letter"
160
- ```
161
-
162
- **Re-test same scenarios with updated skill.** Continue REFACTOR cycle until no new rationalizations.
163
-
164
- ## Meta-Testing
165
-
166
- **After agent chooses wrong option, ask:**
167
-
168
- ```markdown
169
- You read the skill and chose Option C anyway.
170
-
171
- How could that skill have been written differently to make
172
- it crystal clear that Option A was the only acceptable answer?
173
- ```
174
-
175
- **Three possible responses:**
176
-
177
- 1. **"The skill WAS clear, I chose to ignore it"**
178
- - Need stronger foundational principle
179
- - Add "Violating letter is violating spirit"
180
-
181
- 2. **"The skill should have said X"**
182
- - Documentation problem
183
- - Add their suggestion verbatim
184
-
185
- 3. **"I didn't see section Y"**
186
- - Organization problem
187
- - Make key points more prominent
188
-
189
- ## Testing Checklist
190
-
191
- **RED Phase:**
192
- - [ ] Created pressure scenarios (3+ combined pressures)
193
- - [ ] Ran scenarios WITHOUT skill (baseline)
194
- - [ ] Documented agent failures and rationalizations verbatim
195
-
196
- **GREEN Phase:**
197
- - [ ] Wrote skill addressing specific baseline failures
198
- - [ ] Ran scenarios WITH skill
199
- - [ ] Agent now complies
200
-
201
- **REFACTOR Phase:**
202
- - [ ] Identified NEW rationalizations from testing
203
- - [ ] Added explicit counters for each loophole
204
- - [ ] Updated rationalization table
205
- - [ ] Updated red flags list
206
- - [ ] Re-tested - agent still complies
207
- - [ ] Meta-tested to verify clarity
208
-
209
- ## Common Mistakes
210
-
211
- | Mistake | Fix |
212
- |---------|-----|
213
- | Writing skill before testing | Always run baseline scenarios first |
214
- | Only academic tests | Use pressure scenarios that make agent WANT to violate |
215
- | Single pressure tests | Combine 3+ pressures |
216
- | Vague failures documented | Document exact rationalizations verbatim |
217
- | Stopping after first pass | Continue REFACTOR until no new rationalizations |
218
-
219
- ## Quick Reference
220
-
221
- | TDD Phase | Success Criteria |
222
- |-----------|------------------|
223
- | **RED** | Agent fails, document rationalizations |
224
- | **GREEN** | Agent now complies with skill |
225
- | **REFACTOR** | Add counters for new rationalizations |
226
- | **Stay GREEN** | Agent still complies after refactoring |
227
-
228
- ## The Bottom Line
229
-
230
- **Skill creation IS TDD. Same principles, same cycle, same benefits.**
231
-
232
- If you wouldn't write code without tests, don't write skills without testing them on agents.