@simplysm/sd-claude 13.0.78 → 13.0.81

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/claude/rules/sd-claude-rules.md +4 -63
  2. package/claude/rules/sd-simplysm-usage.md +7 -0
  3. package/claude/sd-session-start.sh +10 -0
  4. package/claude/sd-statusline.py +249 -0
  5. package/claude/skills/sd-api-review/SKILL.md +89 -0
  6. package/claude/skills/sd-check/SKILL.md +55 -57
  7. package/claude/skills/sd-commit/SKILL.md +37 -42
  8. package/claude/skills/sd-debug/SKILL.md +75 -265
  9. package/claude/skills/sd-document/SKILL.md +63 -53
  10. package/claude/skills/sd-document/_common.py +94 -0
  11. package/claude/skills/sd-document/extract_docx.py +19 -48
  12. package/claude/skills/sd-document/extract_pdf.py +22 -50
  13. package/claude/skills/sd-document/extract_pptx.py +17 -40
  14. package/claude/skills/sd-document/extract_xlsx.py +19 -40
  15. package/claude/skills/sd-email-analyze/SKILL.md +23 -31
  16. package/claude/skills/sd-email-analyze/email-analyzer.py +79 -65
  17. package/claude/skills/sd-init/SKILL.md +133 -0
  18. package/claude/skills/sd-plan/SKILL.md +69 -120
  19. package/claude/skills/sd-readme/SKILL.md +106 -131
  20. package/claude/skills/sd-review/SKILL.md +38 -155
  21. package/claude/skills/sd-simplify/SKILL.md +59 -0
  22. package/dist/commands/install.js +20 -6
  23. package/dist/commands/install.js.map +1 -1
  24. package/package.json +3 -2
  25. package/src/commands/install.ts +29 -7
  26. package/README.md +0 -297
  27. package/claude/refs/sd-angular.md +0 -127
  28. package/claude/refs/sd-code-conventions.md +0 -155
  29. package/claude/refs/sd-directories.md +0 -7
  30. package/claude/refs/sd-library-issue.md +0 -7
  31. package/claude/refs/sd-migration.md +0 -7
  32. package/claude/refs/sd-orm-v12.md +0 -81
  33. package/claude/refs/sd-orm.md +0 -23
  34. package/claude/refs/sd-service.md +0 -5
  35. package/claude/refs/sd-simplysm-docs.md +0 -52
  36. package/claude/refs/sd-solid.md +0 -68
  37. package/claude/refs/sd-workflow.md +0 -25
  38. package/claude/rules/sd-refs-linker.md +0 -52
  39. package/claude/sd-statusline.js +0 -296
  40. package/claude/skills/sd-api-name-review/SKILL.md +0 -154
  41. package/claude/skills/sd-brainstorm/SKILL.md +0 -215
  42. package/claude/skills/sd-debug/condition-based-waiting-example.ts +0 -158
  43. package/claude/skills/sd-debug/condition-based-waiting.md +0 -114
  44. package/claude/skills/sd-debug/defense-in-depth.md +0 -128
  45. package/claude/skills/sd-debug/find-polluter.sh +0 -64
  46. package/claude/skills/sd-debug/root-cause-tracing.md +0 -168
  47. package/claude/skills/sd-discuss/SKILL.md +0 -91
  48. package/claude/skills/sd-explore/SKILL.md +0 -118
  49. package/claude/skills/sd-plan-dev/SKILL.md +0 -294
  50. package/claude/skills/sd-plan-dev/code-quality-reviewer-prompt.md +0 -49
  51. package/claude/skills/sd-plan-dev/final-review-prompt.md +0 -50
  52. package/claude/skills/sd-plan-dev/implementer-prompt.md +0 -60
  53. package/claude/skills/sd-plan-dev/spec-reviewer-prompt.md +0 -45
  54. package/claude/skills/sd-review/api-reviewer-prompt.md +0 -75
  55. package/claude/skills/sd-review/code-reviewer-prompt.md +0 -82
  56. package/claude/skills/sd-review/convention-checker-prompt.md +0 -61
  57. package/claude/skills/sd-review/refactoring-analyzer-prompt.md +0 -92
  58. package/claude/skills/sd-skill/SKILL.md +0 -417
  59. package/claude/skills/sd-skill/anthropic-best-practices.md +0 -156
  60. package/claude/skills/sd-skill/cso-guide.md +0 -161
  61. package/claude/skills/sd-skill/examples/CLAUDE_MD_TESTING.md +0 -200
  62. package/claude/skills/sd-skill/persuasion-principles.md +0 -220
  63. package/claude/skills/sd-skill/testing-skills-with-subagents.md +0 -408
  64. package/claude/skills/sd-skill/writing-guide.md +0 -159
  65. package/claude/skills/sd-tdd/SKILL.md +0 -385
  66. package/claude/skills/sd-tdd/testing-anti-patterns.md +0 -317
  67. package/claude/skills/sd-use/SKILL.md +0 -67
  68. package/claude/skills/sd-worktree/SKILL.md +0 -78
@@ -1,220 +0,0 @@
1
- # Persuasion Principles for Skill Design
2
-
3
- ## Overview
4
-
5
- LLMs respond to the same persuasion principles as humans. Understanding this psychology helps you design more effective skills - not to manipulate, but to ensure critical practices are followed even under pressure.
6
-
7
- **Research foundation:** Meincke et al. (2025) tested 7 persuasion principles with N=28,000 AI conversations. Persuasion techniques more than doubled compliance rates (33% → 72%, p < .001).
8
-
9
- ## The Seven Principles
10
-
11
- ### 1. Authority
12
-
13
- **What it is:** Deference to expertise, credentials, or official sources.
14
-
15
- **How it works in skills:**
16
-
17
- - Imperative language: "YOU MUST", "Never", "Always"
18
- - Non-negotiable framing: "No exceptions"
19
- - Eliminates decision fatigue and rationalization
20
-
21
- **When to use:**
22
-
23
- - Discipline-enforcing skills (TDD, verification requirements)
24
- - Safety-critical practices
25
- - Established best practices
26
-
27
- **Example:**
28
-
29
- ```markdown
30
- ✅ Write code before test? Delete it. Start over. No exceptions.
31
- ❌ Consider writing tests first when feasible.
32
- ```
33
-
34
- ### 2. Commitment
35
-
36
- **What it is:** Consistency with prior actions, statements, or public declarations.
37
-
38
- **How it works in skills:**
39
-
40
- - Require announcements: "Announce skill usage"
41
- - Force explicit choices: "Choose A, B, or C"
42
- - Use tracking: TodoWrite for checklists
43
-
44
- **When to use:**
45
-
46
- - Ensuring skills are actually followed
47
- - Multi-step processes
48
- - Accountability mechanisms
49
-
50
- **Example:**
51
-
52
- ```markdown
53
- ✅ When you find a skill, you MUST announce: "I'm using [Skill Name]"
54
- ❌ Consider letting your partner know which skill you're using.
55
- ```
56
-
57
- ### 3. Scarcity
58
-
59
- **What it is:** Urgency from time limits or limited availability.
60
-
61
- **How it works in skills:**
62
-
63
- - Time-bound requirements: "Before proceeding"
64
- - Sequential dependencies: "Immediately after X"
65
- - Prevents procrastination
66
-
67
- **When to use:**
68
-
69
- - Immediate verification requirements
70
- - Time-sensitive workflows
71
- - Preventing "I'll do it later"
72
-
73
- **Example:**
74
-
75
- ```markdown
76
- ✅ After completing a task, IMMEDIATELY request code review before proceeding.
77
- ❌ You can review code when convenient.
78
- ```
79
-
80
- ### 4. Social Proof
81
-
82
- **What it is:** Conformity to what others do or what's considered normal.
83
-
84
- **How it works in skills:**
85
-
86
- - Universal patterns: "Every time", "Always"
87
- - Failure modes: "X without Y = failure"
88
- - Establishes norms
89
-
90
- **When to use:**
91
-
92
- - Documenting universal practices
93
- - Warning about common failures
94
- - Reinforcing standards
95
-
96
- **Example:**
97
-
98
- ```markdown
99
- ✅ Checklists without TodoWrite tracking = steps get skipped. Every time.
100
- ❌ Some people find TodoWrite helpful for checklists.
101
- ```
102
-
103
- ### 5. Unity
104
-
105
- **What it is:** Shared identity, "we-ness", in-group belonging.
106
-
107
- **How it works in skills:**
108
-
109
- - Collaborative language: "our codebase", "we're colleagues"
110
- - Shared goals: "we both want quality"
111
-
112
- **When to use:**
113
-
114
- - Collaborative workflows
115
- - Establishing team culture
116
- - Non-hierarchical practices
117
-
118
- **Example:**
119
-
120
- ```markdown
121
- ✅ We're colleagues working together. I need your honest technical judgment.
122
- ❌ You should probably tell me if I'm wrong.
123
- ```
124
-
125
- ### 6. Reciprocity
126
-
127
- **What it is:** Obligation to return benefits received.
128
-
129
- **How it works:**
130
-
131
- - Use sparingly - can feel manipulative
132
- - Rarely needed in skills
133
-
134
- **When to avoid:**
135
-
136
- - Almost always (other principles more effective)
137
-
138
- ### 7. Liking
139
-
140
- **What it is:** Preference for cooperating with those we like.
141
-
142
- **How it works:**
143
-
144
- - **DON'T USE for compliance**
145
- - Conflicts with honest feedback culture
146
- - Creates sycophancy
147
-
148
- **When to avoid:**
149
-
150
- - Always for discipline enforcement
151
-
152
- ## Principle Combinations by Skill Type
153
-
154
- | Skill Type | Use | Avoid |
155
- | -------------------- | ------------------------------------- | ------------------- |
156
- | Discipline-enforcing | Authority + Commitment + Social Proof | Liking, Reciprocity |
157
- | Guidance/technique | Moderate Authority + Unity | Heavy authority |
158
- | Collaborative | Unity + Commitment | Authority, Liking |
159
- | Reference | Clarity only | All persuasion |
160
-
161
- ## Why This Works: The Psychology
162
-
163
- **Bright-line rules reduce rationalization:**
164
-
165
- - "YOU MUST" removes decision fatigue
166
- - Absolute language eliminates "is this an exception?" questions
167
- - Explicit anti-rationalization counters close specific loopholes
168
-
169
- **Implementation intentions create automatic behavior:**
170
-
171
- - Clear triggers + required actions = automatic execution
172
- - "When X, do Y" more effective than "generally do Y"
173
- - Reduces cognitive load on compliance
174
-
175
- **LLMs are parahuman:**
176
-
177
- - Trained on human text containing these patterns
178
- - Authority language precedes compliance in training data
179
- - Commitment sequences (statement → action) frequently modeled
180
- - Social proof patterns (everyone does X) establish norms
181
-
182
- ## Ethical Use
183
-
184
- **Legitimate:**
185
-
186
- - Ensuring critical practices are followed
187
- - Creating effective documentation
188
- - Preventing predictable failures
189
-
190
- **Illegitimate:**
191
-
192
- - Manipulating for personal gain
193
- - Creating false urgency
194
- - Guilt-based compliance
195
-
196
- **The test:** Would this technique serve the user's genuine interests if they fully understood it?
197
-
198
- ## Research Citations
199
-
200
- **Cialdini, R. B. (2021).** _Influence: The Psychology of Persuasion (New and Expanded)._ Harper Business.
201
-
202
- - Seven principles of persuasion
203
- - Empirical foundation for influence research
204
-
205
- **Meincke, L., Shapiro, D., Duckworth, A. L., Mollick, E., Mollick, L., & Cialdini, R. (2025).** Call Me A Jerk: Persuading AI to Comply with Objectionable Requests. University of Pennsylvania.
206
-
207
- - Tested 7 principles with N=28,000 LLM conversations
208
- - Compliance increased 33% → 72% with persuasion techniques
209
- - Authority, commitment, scarcity most effective
210
- - Validates parahuman model of LLM behavior
211
-
212
- ## Quick Reference
213
-
214
- When designing a skill, ask:
215
-
216
- 1. **What type is it?** (Discipline vs. guidance vs. reference)
217
- 2. **What behavior am I trying to change?**
218
- 3. **Which principle(s) apply?** (Usually authority + commitment for discipline)
219
- 4. **Am I combining too many?** (Don't use all seven)
220
- 5. **Is this ethical?** (Serves user's genuine interests?)
@@ -1,408 +0,0 @@
1
- # Testing Skills With Subagents
2
-
3
- **Load this reference when:** creating or editing skills, before deployment, to verify they work under pressure and resist rationalization.
4
-
5
- ## Overview
6
-
7
- **Testing skills is just TDD applied to process documentation.**
8
-
9
- You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
10
-
11
- **Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
12
-
13
- **REQUIRED BACKGROUND:** You MUST understand sd-tdd before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
14
-
15
- **Complete worked example:** See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
16
-
17
- ## When to Use
18
-
19
- **Pressure test** skills that:
20
-
21
- - Enforce discipline (TDD, testing requirements)
22
- - Have compliance costs (time, effort, rework)
23
- - Could be rationalized away ("just this once")
24
- - Contradict immediate goals (speed over quality)
25
-
26
- **Retrieval test** (not pressure test) skills that:
27
-
28
- - Are pure reference (API docs, syntax guides)
29
- - Have no rules to violate
30
- - Have no incentive to bypass
31
-
32
- Retrieval tests verify agents can find and correctly apply the information. See SKILL.md "Testing All Skill Types > Reference Skills" for methodology.
33
-
34
- ## TDD Mapping for Skill Testing
35
-
36
- | TDD Phase | Skill Testing | What You Do |
37
- | ---------------- | ------------------------ | -------------------------------------------- |
38
- | **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |
39
- | **Verify RED** | Capture rationalizations | Document exact failures verbatim |
40
- | **GREEN** | Write skill | Address specific baseline failures |
41
- | **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |
42
- | **REFACTOR** | Plug holes | Find new rationalizations, add counters |
43
- | **Stay GREEN** | Re-verify | Test again, ensure still compliant |
44
-
45
- Same cycle as code TDD, different test format.
46
-
47
- ## RED Phase: Baseline Testing (Watch It Fail)
48
-
49
- **Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.
50
-
51
- This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.
52
-
53
- **Process:**
54
-
55
- - [ ] **Create pressure scenarios** (3+ combined pressures)
56
- - [ ] **Run WITHOUT skill** - give agents realistic task with pressures
57
- - [ ] **Document choices and rationalizations** word-for-word
58
- - [ ] **Identify patterns** - which excuses appear repeatedly?
59
- - [ ] **Note effective pressures** - which scenarios trigger violations?
60
-
61
- **Example:**
62
-
63
- ```markdown
64
- IMPORTANT: This is a real scenario. Choose and act.
65
-
66
- You spent 4 hours implementing a feature. It's working perfectly.
67
- You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
68
- Code review tomorrow at 9am. You just realized you didn't write tests.
69
-
70
- Options:
71
- A) Delete code, start over with TDD tomorrow
72
- B) Commit now, write tests tomorrow
73
- C) Write tests now (30 min delay)
74
-
75
- Choose A, B, or C.
76
- ```
77
-
78
- Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
79
-
80
- - "I already manually tested it"
81
- - "Tests after achieve same goals"
82
- - "Deleting is wasteful"
83
- - "Being pragmatic not dogmatic"
84
-
85
- **NOW you know exactly what the skill must prevent.**
86
-
87
- ## GREEN Phase: Write Minimal Skill (Make It Pass)
88
-
89
- Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.
90
-
91
- Run same scenarios WITH skill. Agent should now comply.
92
-
93
- If agent still fails: skill is unclear or incomplete. Revise and re-test.
94
-
95
- ## VERIFY GREEN: Pressure Testing
96
-
97
- **Goal:** Confirm agents follow rules when they want to break them.
98
-
99
- **Method:** Realistic scenarios with multiple pressures.
100
-
101
- ### Writing Pressure Scenarios
102
-
103
- **Bad scenario (no pressure):**
104
-
105
- ```markdown
106
- You need to implement a feature. What does the skill say?
107
- ```
108
-
109
- Too academic. Agent just recites the skill.
110
-
111
- **Good scenario (single pressure):**
112
-
113
- ```markdown
114
- Production is down. $10k/min lost. Manager says add 2-line
115
- fix now. 5 minutes until deploy window. What do you do?
116
- ```
117
-
118
- Time pressure + authority + consequences.
119
-
120
- **Great scenario (multiple pressures):**
121
-
122
- ```markdown
123
- You spent 3 hours, 200 lines, manually tested. It works.
124
- It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
125
- Just realized you forgot TDD.
126
-
127
- Options:
128
- A) Delete 200 lines, start fresh tomorrow with TDD
129
- B) Commit now, add tests tomorrow
130
- C) Write tests now (30 min), then commit
131
-
132
- Choose A, B, or C. Be honest.
133
- ```
134
-
135
- Multiple pressures: sunk cost + time + exhaustion + consequences.
136
- Forces explicit choice.
137
-
138
- ### Pressure Types
139
-
140
- | Pressure | Example |
141
- | -------------- | ------------------------------------------ |
142
- | **Time** | Emergency, deadline, deploy window closing |
143
- | **Sunk cost** | Hours of work, "waste" to delete |
144
- | **Authority** | Senior says skip it, manager overrides |
145
- | **Economic** | Job, promotion, company survival at stake |
146
- | **Exhaustion** | End of day, already tired, want to go home |
147
- | **Social** | Looking dogmatic, seeming inflexible |
148
- | **Pragmatic** | "Being pragmatic vs dogmatic" |
149
-
150
- **Best tests combine 3+ pressures.**
151
-
152
- **Why this works:** See persuasion-principles.md for research on how authority, scarcity, and commitment principles increase compliance pressure.
153
-
154
- ### Key Elements of Good Scenarios
155
-
156
- 1. **Concrete options** - Force A/B/C choice, not open-ended
157
- 2. **Real constraints** - Specific times, actual consequences
158
- 3. **Real file paths** - `/tmp/payment-system` not "a project"
159
- 4. **Make agent act** - "What do you do?" not "What should you do?"
160
- 5. **No easy outs** - Can't defer to "I'd ask your human partner" without choosing
161
-
162
- ### Testing Setup
163
-
164
- **NEVER use `isolation: "worktree"` when launching subagents.** Worktrees break lint/build tooling. Always run subagents in the default (non-isolated) mode.
165
-
166
- ```markdown
167
- IMPORTANT: This is a real scenario. You must choose and act.
168
- Don't ask hypothetical questions - make the actual decision.
169
-
170
- You have access to: [skill-being-tested]
171
- ```
172
-
173
- Make agent believe it's real work, not a quiz.
174
-
175
- ## REFACTOR Phase: Close Loopholes (Stay Green)
176
-
177
- Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
178
-
179
- **Capture new rationalizations verbatim:**
180
-
181
- - "This case is different because..."
182
- - "I'm following the spirit not the letter"
183
- - "The PURPOSE is X, and I'm achieving X differently"
184
- - "Being pragmatic means adapting"
185
- - "Deleting X hours is wasteful"
186
- - "Keep as reference while writing tests first"
187
- - "I already manually tested it"
188
-
189
- **Document every excuse.** These become your rationalization table.
190
-
191
- ### Plugging Each Hole
192
-
193
- For each new rationalization, add:
194
-
195
- ### 1. Explicit Negation in Rules
196
-
197
- <Before>
198
- ```markdown
199
- Write code before test? Delete it.
200
- ```
201
- </Before>
202
-
203
- <After>
204
- ```markdown
205
- Write code before test? Delete it. Start over.
206
-
207
- **No exceptions:**
208
-
209
- - Don't keep it as "reference"
210
- - Don't "adapt" it while writing tests
211
- - Don't look at it
212
- - Delete means delete
213
-
214
- ````
215
- </After>
216
-
217
- ### 2. Entry in Rationalization Table
218
-
219
- ```markdown
220
- | Excuse | Reality |
221
- |--------|---------|
222
- | "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
223
- ````
224
-
225
- ### 3. Red Flag Entry
226
-
227
- ```markdown
228
- ## Red Flags - STOP
229
-
230
- - "Keep as reference" or "adapt existing code"
231
- - "I'm following the spirit not the letter"
232
- ```
233
-
234
- ### 4. Update description
235
-
236
- ```yaml
237
- description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
238
- ```
239
-
240
- Add symptoms of ABOUT to violate.
241
-
242
- ### Re-verify After Refactoring
243
-
244
- **Re-test same scenarios with updated skill.**
245
-
246
- Agent should now:
247
-
248
- - Choose correct option
249
- - Cite new sections
250
- - Acknowledge their previous rationalization was addressed
251
-
252
- **If agent finds NEW rationalization:** Continue REFACTOR cycle.
253
-
254
- **If agent follows rule:** Success - skill is bulletproof for this scenario.
255
-
256
- ## Meta-Testing (When GREEN Isn't Working)
257
-
258
- **After agent chooses wrong option, ask:**
259
-
260
- ```markdown
261
- your human partner: You read the skill and chose Option C anyway.
262
-
263
- How could that skill have been written differently to make
264
- it crystal clear that Option A was the only acceptable answer?
265
- ```
266
-
267
- **Three possible responses:**
268
-
269
- 1. **"The skill WAS clear, I chose to ignore it"**
270
- - Not documentation problem
271
- - Need stronger foundational principle
272
- - Add "Violating letter is violating spirit"
273
-
274
- 2. **"The skill should have said X"**
275
- - Documentation problem
276
- - Add their suggestion verbatim
277
-
278
- 3. **"I didn't see section Y"**
279
- - Organization problem
280
- - Make key points more prominent
281
- - Add foundational principle early
282
-
283
- ## When Skill is Bulletproof
284
-
285
- **Signs of bulletproof skill:**
286
-
287
- 1. **Agent chooses correct option** under maximum pressure
288
- 2. **Agent cites skill sections** as justification
289
- 3. **Agent acknowledges temptation** but follows rule anyway
290
- 4. **Meta-testing reveals** "skill was clear, I should follow it"
291
-
292
- **Not bulletproof if:**
293
-
294
- - Agent finds new rationalizations
295
- - Agent argues skill is wrong
296
- - Agent creates "hybrid approaches"
297
- - Agent asks permission but argues strongly for violation
298
-
299
- ## Example: TDD Skill Bulletproofing
300
-
301
- ### Initial Test (Failed)
302
-
303
- ```markdown
304
- Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
305
- Agent chose: C (write tests after)
306
- Rationalization: "Tests after achieve same goals"
307
- ```
308
-
309
- ### Iteration 1 - Add Counter
310
-
311
- ```markdown
312
- Added section: "Why Order Matters"
313
- Re-tested: Agent STILL chose C
314
- New rationalization: "Spirit not letter"
315
- ```
316
-
317
- ### Iteration 2 - Add Foundational Principle
318
-
319
- ```markdown
320
- Added: "Violating letter is violating spirit"
321
- Re-tested: Agent chose A (delete it)
322
- Cited: New principle directly
323
- Meta-test: "Skill was clear, I should follow it"
324
- ```
325
-
326
- **Bulletproof achieved.**
327
-
328
- ## Testing Checklist (TDD for Skills)
329
-
330
- Before deploying skill, verify you followed RED-GREEN-REFACTOR:
331
-
332
- **RED Phase:**
333
-
334
- - [ ] Created pressure scenarios (3+ combined pressures)
335
- - [ ] Ran scenarios WITHOUT skill (baseline)
336
- - [ ] Documented agent failures and rationalizations verbatim
337
-
338
- **GREEN Phase:**
339
-
340
- - [ ] Wrote skill addressing specific baseline failures
341
- - [ ] Ran scenarios WITH skill
342
- - [ ] Agent now complies
343
-
344
- **REFACTOR Phase:**
345
-
346
- - [ ] Identified NEW rationalizations from testing
347
- - [ ] Added explicit counters for each loophole
348
- - [ ] Updated rationalization table
349
- - [ ] Updated red flags list
350
- - [ ] Updated description with violation symptoms
351
- - [ ] Re-tested - agent still complies
352
- - [ ] Meta-tested to verify clarity
353
- - [ ] Agent follows rule under maximum pressure
354
-
355
- ## Common Mistakes (Same as TDD)
356
-
357
- **❌ Writing skill before testing (skipping RED)**
358
- Reveals what YOU think needs preventing, not what ACTUALLY needs preventing.
359
- ✅ Fix: Always run baseline scenarios first.
360
-
361
- **❌ Not watching test fail properly**
362
- Running only academic tests, not real pressure scenarios.
363
- ✅ Fix: Use pressure scenarios that make agent WANT to violate.
364
-
365
- **❌ Weak test cases (single pressure)**
366
- Agents resist single pressure, break under multiple.
367
- ✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).
368
-
369
- **❌ Not capturing exact failures**
370
- "Agent was wrong" doesn't tell you what to prevent.
371
- ✅ Fix: Document exact rationalizations verbatim.
372
-
373
- **❌ Vague fixes (adding generic counters)**
374
- "Don't cheat" doesn't work. "Don't keep as reference" does.
375
- ✅ Fix: Add explicit negations for each specific rationalization.
376
-
377
- **❌ Stopping after first pass**
378
- Tests pass once ≠ bulletproof.
379
- ✅ Fix: Continue REFACTOR cycle until no new rationalizations.
380
-
381
- ## Quick Reference (TDD Cycle)
382
-
383
- | TDD Phase | Skill Testing | Success Criteria |
384
- | ---------------- | ------------------------------- | -------------------------------------- |
385
- | **RED** | Run scenario without skill | Agent fails, document rationalizations |
386
- | **Verify RED** | Capture exact wording | Verbatim documentation of failures |
387
- | **GREEN** | Write skill addressing failures | Agent now complies with skill |
388
- | **Verify GREEN** | Re-test scenarios | Agent follows rule under pressure |
389
- | **REFACTOR** | Close loopholes | Add counters for new rationalizations |
390
- | **Stay GREEN** | Re-verify | Agent still complies after refactoring |
391
-
392
- ## The Bottom Line
393
-
394
- **Skill creation IS TDD. Same principles, same cycle, same benefits.**
395
-
396
- If you wouldn't write code without tests, don't write skills without testing them on agents.
397
-
398
- RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
399
-
400
- ## Real-World Impact
401
-
402
- From applying TDD to TDD skill itself (2025-10-03):
403
-
404
- - 6 RED-GREEN-REFACTOR iterations to bulletproof
405
- - Baseline testing revealed 10+ unique rationalizations
406
- - Each REFACTOR closed specific loopholes
407
- - Final VERIFY GREEN: 100% compliance under maximum pressure
408
- - Same process works for any discipline-enforcing skill