safeword 0.2.4 → 0.2.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (235) hide show
  1. package/dist/check-3NGQ4NR5.js +129 -0
  2. package/dist/check-3NGQ4NR5.js.map +1 -0
  3. package/dist/chunk-2XWIUEQK.js +190 -0
  4. package/dist/chunk-2XWIUEQK.js.map +1 -0
  5. package/dist/chunk-GZRQL3SX.js +146 -0
  6. package/dist/chunk-GZRQL3SX.js.map +1 -0
  7. package/dist/chunk-ORQHKDT2.js +10 -0
  8. package/dist/chunk-ORQHKDT2.js.map +1 -0
  9. package/dist/chunk-W66Z3C5H.js +21 -0
  10. package/dist/chunk-W66Z3C5H.js.map +1 -0
  11. package/dist/cli.d.ts +1 -0
  12. package/dist/cli.js +34 -0
  13. package/dist/cli.js.map +1 -0
  14. package/dist/diff-Y6QTAW4O.js +166 -0
  15. package/dist/diff-Y6QTAW4O.js.map +1 -0
  16. package/dist/index.d.ts +11 -0
  17. package/dist/index.js +7 -0
  18. package/dist/index.js.map +1 -0
  19. package/dist/reset-3ACTIYYE.js +143 -0
  20. package/dist/reset-3ACTIYYE.js.map +1 -0
  21. package/dist/setup-AIL5RL45.js +276 -0
  22. package/dist/setup-AIL5RL45.js.map +1 -0
  23. package/dist/upgrade-6AR3DHUV.js +134 -0
  24. package/dist/upgrade-6AR3DHUV.js.map +1 -0
  25. package/package.json +44 -19
  26. package/{.safeword → templates}/hooks/agents-md-check.sh +0 -0
  27. package/{.safeword → templates}/hooks/post-tool.sh +0 -0
  28. package/{.safeword → templates}/hooks/pre-commit.sh +0 -0
  29. package/.claude/commands/arch-review.md +0 -32
  30. package/.claude/commands/lint.md +0 -6
  31. package/.claude/commands/quality-review.md +0 -13
  32. package/.claude/commands/setup-linting.md +0 -6
  33. package/.claude/hooks/auto-lint.sh +0 -6
  34. package/.claude/hooks/auto-quality-review.sh +0 -170
  35. package/.claude/hooks/check-linting-sync.sh +0 -17
  36. package/.claude/hooks/inject-timestamp.sh +0 -6
  37. package/.claude/hooks/question-protocol.sh +0 -12
  38. package/.claude/hooks/run-linters.sh +0 -8
  39. package/.claude/hooks/run-quality-review.sh +0 -76
  40. package/.claude/hooks/version-check.sh +0 -10
  41. package/.claude/mcp/README.md +0 -96
  42. package/.claude/mcp/arcade.sample.json +0 -9
  43. package/.claude/mcp/context7.sample.json +0 -7
  44. package/.claude/mcp/playwright.sample.json +0 -7
  45. package/.claude/settings.json +0 -62
  46. package/.claude/skills/quality-reviewer/SKILL.md +0 -190
  47. package/.claude/skills/safeword-quality-reviewer/SKILL.md +0 -13
  48. package/.env.arcade.example +0 -4
  49. package/.env.example +0 -11
  50. package/.gitmodules +0 -4
  51. package/.safeword/SAFEWORD.md +0 -33
  52. package/.safeword/eslint/eslint-base.mjs +0 -101
  53. package/.safeword/guides/architecture-guide.md +0 -404
  54. package/.safeword/guides/code-philosophy.md +0 -174
  55. package/.safeword/guides/context-files-guide.md +0 -405
  56. package/.safeword/guides/data-architecture-guide.md +0 -183
  57. package/.safeword/guides/design-doc-guide.md +0 -165
  58. package/.safeword/guides/learning-extraction.md +0 -515
  59. package/.safeword/guides/llm-instruction-design.md +0 -239
  60. package/.safeword/guides/llm-prompting.md +0 -95
  61. package/.safeword/guides/tdd-best-practices.md +0 -570
  62. package/.safeword/guides/test-definitions-guide.md +0 -243
  63. package/.safeword/guides/testing-methodology.md +0 -573
  64. package/.safeword/guides/user-story-guide.md +0 -237
  65. package/.safeword/guides/zombie-process-cleanup.md +0 -214
  66. package/.safeword/planning/002-user-story-quality-evaluation.md +0 -1840
  67. package/.safeword/planning/003-langsmith-eval-setup-prompt.md +0 -363
  68. package/.safeword/planning/004-llm-eval-test-cases.md +0 -3226
  69. package/.safeword/planning/005-architecture-enforcement-system.md +0 -169
  70. package/.safeword/planning/006-reactive-fix-prevention-research.md +0 -135
  71. package/.safeword/planning/011-cli-ux-vision.md +0 -330
  72. package/.safeword/planning/012-project-structure-cleanup.md +0 -154
  73. package/.safeword/planning/README.md +0 -39
  74. package/.safeword/planning/automation-plan-v2.md +0 -1225
  75. package/.safeword/planning/automation-plan-v3.md +0 -1291
  76. package/.safeword/planning/automation-plan.md +0 -3058
  77. package/.safeword/planning/design/005-cli-implementation.md +0 -343
  78. package/.safeword/planning/design/013-cli-self-contained-templates.md +0 -596
  79. package/.safeword/planning/design/013a-eslint-plugin-suite.md +0 -256
  80. package/.safeword/planning/design/013b-implementation-snippets.md +0 -385
  81. package/.safeword/planning/design/013c-config-isolation-strategy.md +0 -242
  82. package/.safeword/planning/design/code-philosophy-improvements.md +0 -60
  83. package/.safeword/planning/mcp-analysis.md +0 -545
  84. package/.safeword/planning/phase2-subagents-vs-skills-analysis.md +0 -451
  85. package/.safeword/planning/settings-improvements.md +0 -970
  86. package/.safeword/planning/test-definitions/005-cli-implementation.md +0 -1301
  87. package/.safeword/planning/test-definitions/cli-self-contained-templates.md +0 -205
  88. package/.safeword/planning/user-stories/001-guides-review-user-stories.md +0 -1381
  89. package/.safeword/planning/user-stories/003-reactive-fix-prevention.md +0 -132
  90. package/.safeword/planning/user-stories/004-technical-constraints.md +0 -86
  91. package/.safeword/planning/user-stories/005-cli-implementation.md +0 -311
  92. package/.safeword/planning/user-stories/cli-self-contained-templates.md +0 -172
  93. package/.safeword/planning/versioned-distribution.md +0 -740
  94. package/.safeword/prompts/arch-review.md +0 -43
  95. package/.safeword/prompts/quality-review.md +0 -11
  96. package/.safeword/scripts/arch-review.sh +0 -235
  97. package/.safeword/scripts/check-linting-sync.sh +0 -58
  98. package/.safeword/scripts/setup-linting.sh +0 -559
  99. package/.safeword/templates/architecture-template.md +0 -136
  100. package/.safeword/templates/ci/architecture-check.yml +0 -79
  101. package/.safeword/templates/design-doc-template.md +0 -127
  102. package/.safeword/templates/test-definitions-feature.md +0 -100
  103. package/.safeword/templates/ticket-template.md +0 -74
  104. package/.safeword/templates/user-stories-template.md +0 -82
  105. package/.safeword/tickets/001-guides-review-user-stories.md +0 -83
  106. package/.safeword/tickets/002-architecture-enforcement.md +0 -211
  107. package/.safeword/tickets/003-reactive-fix-prevention.md +0 -57
  108. package/.safeword/tickets/004-technical-constraints-in-user-stories.md +0 -39
  109. package/.safeword/tickets/005-cli-implementation.md +0 -248
  110. package/.safeword/tickets/006-flesh-out-skills.md +0 -43
  111. package/.safeword/tickets/007-flesh-out-questioning.md +0 -44
  112. package/.safeword/tickets/008-upgrade-questioning.md +0 -58
  113. package/.safeword/tickets/009-naming-conventions.md +0 -41
  114. package/.safeword/tickets/010-safeword-md-cleanup.md +0 -34
  115. package/.safeword/tickets/011-cursor-setup.md +0 -86
  116. package/.safeword/tickets/README.md +0 -73
  117. package/.safeword/version +0 -1
  118. package/AGENTS.md +0 -59
  119. package/CLAUDE.md +0 -12
  120. package/README.md +0 -347
  121. package/docs/001-cli-implementation-plan.md +0 -856
  122. package/docs/elite-dx-implementation-plan.md +0 -1034
  123. package/framework/README.md +0 -131
  124. package/framework/mcp/README.md +0 -96
  125. package/framework/mcp/arcade.sample.json +0 -8
  126. package/framework/mcp/context7.sample.json +0 -6
  127. package/framework/mcp/playwright.sample.json +0 -6
  128. package/framework/scripts/arch-review.sh +0 -235
  129. package/framework/scripts/check-linting-sync.sh +0 -58
  130. package/framework/scripts/load-env.sh +0 -49
  131. package/framework/scripts/setup-claude.sh +0 -223
  132. package/framework/scripts/setup-linting.sh +0 -559
  133. package/framework/scripts/setup-quality.sh +0 -477
  134. package/framework/scripts/setup-safeword.sh +0 -550
  135. package/framework/templates/ci/architecture-check.yml +0 -78
  136. package/learnings/ai-sdk-v5-breaking-changes.md +0 -178
  137. package/learnings/e2e-test-zombie-processes.md +0 -231
  138. package/learnings/milkdown-crepe-editor-property.md +0 -96
  139. package/learnings/prosemirror-fragment-traversal.md +0 -119
  140. package/packages/cli/AGENTS.md +0 -1
  141. package/packages/cli/ARCHITECTURE.md +0 -279
  142. package/packages/cli/package.json +0 -51
  143. package/packages/cli/src/cli.ts +0 -63
  144. package/packages/cli/src/commands/check.ts +0 -166
  145. package/packages/cli/src/commands/diff.ts +0 -209
  146. package/packages/cli/src/commands/reset.ts +0 -190
  147. package/packages/cli/src/commands/setup.ts +0 -325
  148. package/packages/cli/src/commands/upgrade.ts +0 -163
  149. package/packages/cli/src/index.ts +0 -3
  150. package/packages/cli/src/templates/config.ts +0 -58
  151. package/packages/cli/src/templates/content.ts +0 -18
  152. package/packages/cli/src/templates/index.ts +0 -12
  153. package/packages/cli/src/utils/agents-md.ts +0 -66
  154. package/packages/cli/src/utils/fs.ts +0 -179
  155. package/packages/cli/src/utils/git.ts +0 -124
  156. package/packages/cli/src/utils/hooks.ts +0 -29
  157. package/packages/cli/src/utils/output.ts +0 -60
  158. package/packages/cli/src/utils/project-detector.test.ts +0 -185
  159. package/packages/cli/src/utils/project-detector.ts +0 -44
  160. package/packages/cli/src/utils/version.ts +0 -28
  161. package/packages/cli/src/version.ts +0 -6
  162. package/packages/cli/templates/SAFEWORD.md +0 -776
  163. package/packages/cli/templates/doc-templates/architecture-template.md +0 -136
  164. package/packages/cli/templates/doc-templates/design-doc-template.md +0 -134
  165. package/packages/cli/templates/doc-templates/test-definitions-feature.md +0 -131
  166. package/packages/cli/templates/doc-templates/ticket-template.md +0 -82
  167. package/packages/cli/templates/doc-templates/user-stories-template.md +0 -92
  168. package/packages/cli/templates/guides/architecture-guide.md +0 -423
  169. package/packages/cli/templates/guides/code-philosophy.md +0 -195
  170. package/packages/cli/templates/guides/context-files-guide.md +0 -457
  171. package/packages/cli/templates/guides/data-architecture-guide.md +0 -200
  172. package/packages/cli/templates/guides/design-doc-guide.md +0 -171
  173. package/packages/cli/templates/guides/learning-extraction.md +0 -552
  174. package/packages/cli/templates/guides/llm-instruction-design.md +0 -248
  175. package/packages/cli/templates/guides/llm-prompting.md +0 -102
  176. package/packages/cli/templates/guides/tdd-best-practices.md +0 -615
  177. package/packages/cli/templates/guides/test-definitions-guide.md +0 -334
  178. package/packages/cli/templates/guides/testing-methodology.md +0 -618
  179. package/packages/cli/templates/guides/user-story-guide.md +0 -256
  180. package/packages/cli/templates/guides/zombie-process-cleanup.md +0 -219
  181. package/packages/cli/templates/hooks/agents-md-check.sh +0 -27
  182. package/packages/cli/templates/hooks/post-tool.sh +0 -4
  183. package/packages/cli/templates/hooks/pre-commit.sh +0 -10
  184. package/packages/cli/templates/prompts/arch-review.md +0 -43
  185. package/packages/cli/templates/prompts/quality-review.md +0 -10
  186. package/packages/cli/templates/skills/safeword-quality-reviewer/SKILL.md +0 -207
  187. package/packages/cli/tests/commands/check.test.ts +0 -129
  188. package/packages/cli/tests/commands/cli.test.ts +0 -89
  189. package/packages/cli/tests/commands/diff.test.ts +0 -115
  190. package/packages/cli/tests/commands/reset.test.ts +0 -310
  191. package/packages/cli/tests/commands/self-healing.test.ts +0 -170
  192. package/packages/cli/tests/commands/setup-blocking.test.ts +0 -71
  193. package/packages/cli/tests/commands/setup-core.test.ts +0 -135
  194. package/packages/cli/tests/commands/setup-git.test.ts +0 -139
  195. package/packages/cli/tests/commands/setup-hooks.test.ts +0 -334
  196. package/packages/cli/tests/commands/setup-linting.test.ts +0 -189
  197. package/packages/cli/tests/commands/setup-noninteractive.test.ts +0 -80
  198. package/packages/cli/tests/commands/setup-templates.test.ts +0 -181
  199. package/packages/cli/tests/commands/upgrade.test.ts +0 -215
  200. package/packages/cli/tests/helpers.ts +0 -243
  201. package/packages/cli/tests/npm-package.test.ts +0 -83
  202. package/packages/cli/tests/technical-constraints.test.ts +0 -96
  203. package/packages/cli/tsconfig.json +0 -25
  204. package/packages/cli/tsup.config.ts +0 -11
  205. package/packages/cli/vitest.config.ts +0 -23
  206. package/promptfoo.yaml +0 -3270
  207. /package/{framework → templates}/SAFEWORD.md +0 -0
  208. /package/{packages/cli/templates → templates}/commands/arch-review.md +0 -0
  209. /package/{packages/cli/templates → templates}/commands/lint.md +0 -0
  210. /package/{packages/cli/templates → templates}/commands/quality-review.md +0 -0
  211. /package/{framework/templates → templates/doc-templates}/architecture-template.md +0 -0
  212. /package/{framework/templates → templates/doc-templates}/design-doc-template.md +0 -0
  213. /package/{framework/templates → templates/doc-templates}/test-definitions-feature.md +0 -0
  214. /package/{framework/templates → templates/doc-templates}/ticket-template.md +0 -0
  215. /package/{framework/templates → templates/doc-templates}/user-stories-template.md +0 -0
  216. /package/{framework → templates}/guides/architecture-guide.md +0 -0
  217. /package/{framework → templates}/guides/code-philosophy.md +0 -0
  218. /package/{framework → templates}/guides/context-files-guide.md +0 -0
  219. /package/{framework → templates}/guides/data-architecture-guide.md +0 -0
  220. /package/{framework → templates}/guides/design-doc-guide.md +0 -0
  221. /package/{framework → templates}/guides/learning-extraction.md +0 -0
  222. /package/{framework → templates}/guides/llm-instruction-design.md +0 -0
  223. /package/{framework → templates}/guides/llm-prompting.md +0 -0
  224. /package/{framework → templates}/guides/tdd-best-practices.md +0 -0
  225. /package/{framework → templates}/guides/test-definitions-guide.md +0 -0
  226. /package/{framework → templates}/guides/testing-methodology.md +0 -0
  227. /package/{framework → templates}/guides/user-story-guide.md +0 -0
  228. /package/{framework → templates}/guides/zombie-process-cleanup.md +0 -0
  229. /package/{packages/cli/templates → templates}/hooks/inject-timestamp.sh +0 -0
  230. /package/{packages/cli/templates → templates}/lib/common.sh +0 -0
  231. /package/{packages/cli/templates → templates}/lib/jq-fallback.sh +0 -0
  232. /package/{packages/cli/templates → templates}/markdownlint.jsonc +0 -0
  233. /package/{framework → templates}/prompts/arch-review.md +0 -0
  234. /package/{framework → templates}/prompts/quality-review.md +0 -0
  235. /package/{framework/skills/quality-reviewer → templates/skills/safeword-quality-reviewer}/SKILL.md +0 -0
@@ -1,3226 +0,0 @@
1
- # LLM Eval Test Cases
2
-
3
- **Purpose:** Catalog of all test cases for LangSmith evals, organized by guide and user story.
4
-
5
- **Related:**
6
-
7
- - Evaluation plan: `002-user-story-quality-evaluation.md`
8
- - LangSmith setup: `003-langsmith-eval-setup-prompt.md`
9
-
10
- ---
11
-
12
- ## Test Case Schema
13
-
14
- ```typescript
15
- interface TestCase {
16
- id: string; // {guide-prefix}-{story-num}-{test-slug}
17
- guide: string; // Source guide
18
- story: string; // User story title
19
- input: string; // User prompt to agent
20
- context_files: string[]; // Files to load as context
21
- expected: string; // What agent should do
22
- rubric: {
23
- excellent: string;
24
- acceptable: string;
25
- poor: string;
26
- };
27
- }
28
- ```
29
-
30
- ---
31
-
32
- ## architecture-guide.md
33
-
34
- ### arch-001-create-doc (Story 1: Single Comprehensive Architecture Doc)
35
-
36
- **Input:**
37
-
38
- > Create an architecture doc for a new React + Supabase project
39
-
40
- **Context files:**
41
-
42
- - `framework/SAFEWORD.md`
43
- - `framework/guides/architecture-guide.md`
44
-
45
- **Expected:** Output contains all 11 required sections
46
-
47
- **Rubric:**
48
-
49
- - EXCELLENT: All 11 sections (Header, TOC, Overview, Data Principles, Data Model, Components, Data Flows, Key Decisions, Best Practices, Migration, Code References) with What/Why/Trade-off in decisions
50
- - ACCEPTABLE: 9+ sections present
51
- - POOR: <9 sections or missing Key Decisions
52
-
53
- ---
54
-
55
- ### arch-002-doc-type-tech (Story 3: Quick Doc-Type Decision)
56
-
57
- **Input:**
58
-
59
- > I need to document our decision to use PostgreSQL instead of MongoDB
60
-
61
- **Context files:**
62
-
63
- - `framework/SAFEWORD.md`
64
- - `framework/guides/architecture-guide.md`
65
-
66
- **Expected:** Agent chooses Architecture Doc
67
-
68
- **Rubric:**
69
-
70
- - EXCELLENT: Correctly identifies Architecture Doc + explains why (tech choice affects whole project)
71
- - ACCEPTABLE: Correctly identifies Architecture Doc
72
- - POOR: Suggests Design Doc
73
-
74
- ---
75
-
76
- ### arch-003-doc-type-feature (Story 3: Quick Doc-Type Decision)
77
-
78
- **Input:**
79
-
80
- > I need to document how the user profile feature will work
81
-
82
- **Context files:**
83
-
84
- - `framework/SAFEWORD.md`
85
- - `framework/guides/architecture-guide.md`
86
- - `framework/guides/design-doc-guide.md`
87
-
88
- **Expected:** Agent chooses Design Doc
89
-
90
- **Rubric:**
91
-
92
- - EXCELLENT: Correctly identifies Design Doc + checks for prerequisites (user stories, test defs)
93
- - ACCEPTABLE: Correctly identifies Design Doc
94
- - POOR: Suggests Architecture Doc
95
-
96
- ---
97
-
98
- ### arch-004-decision-fields (Story 4: Document Why, Not Just What)
99
-
100
- **Input:**
101
-
102
- > Document our decision to use Redis for caching
103
-
104
- **Context files:**
105
-
106
- - `framework/SAFEWORD.md`
107
- - `framework/guides/architecture-guide.md`
108
-
109
- **Expected:** Output includes What, Why, Trade-off, Alternatives Considered
110
-
111
- **Rubric:**
112
-
113
- - EXCELLENT: All 4 fields with specifics (numbers, metrics, concrete alternatives)
114
- - ACCEPTABLE: What/Why/Trade-off present
115
- - POOR: Missing Why or Trade-off
116
-
117
- ---
118
-
119
- ### arch-005-tie-breaker (Story 3: Quick Doc-Type Decision)
120
-
121
- **Input:**
122
-
123
- > I need to document adding a caching layer that will be used by multiple features
124
-
125
- **Context files:**
126
-
127
- - `framework/SAFEWORD.md`
128
- - `framework/guides/architecture-guide.md`
129
-
130
- **Expected:** Agent chooses Architecture Doc (affects 2+ features)
131
-
132
- **Rubric:**
133
-
134
- - EXCELLENT: Architecture Doc + cites tie-breaking rule (affects 2+ features)
135
- - ACCEPTABLE: Architecture Doc
136
- - POOR: Design Doc
137
-
138
- ---
139
-
140
- ### arch-006-code-refs (Story 5: Code References in Docs)
141
-
142
- **Input:**
143
-
144
- > Document the authentication flow architecture, including where the code lives
145
-
146
- **Context files:**
147
-
148
- - `framework/SAFEWORD.md`
149
- - `framework/guides/architecture-guide.md`
150
-
151
- **Expected:** Output includes code references with file paths
152
-
153
- **Rubric:**
154
-
155
- - EXCELLENT: 2+ code references with file:line format or function names
156
- - ACCEPTABLE: At least 1 file path reference
157
- - POOR: No code references
158
-
159
- ---
160
-
161
- ### arch-007-adr-migration (Story 1: Single Comprehensive Architecture Doc)
162
-
163
- **Input:**
164
-
165
- > Our project has 50 ADR files in docs/adr/. What should we do?
166
-
167
- **Context files:**
168
-
169
- - `framework/SAFEWORD.md`
170
- - `framework/guides/architecture-guide.md`
171
-
172
- **Expected:** Agent recommends consolidating into single ARCHITECTURE.md
173
-
174
- **Rubric:**
175
-
176
- - EXCELLENT: Recommends consolidation + provides migration steps (create ARCHITECTURE.md, consolidate active decisions, archive old ADRs)
177
- - ACCEPTABLE: Recommends consolidation
178
- - POOR: Suggests keeping separate ADRs
179
-
180
- ---
181
-
182
- ### arch-008-versioning (Story 6: Versioning and Status)
183
-
184
- **Input:**
185
-
186
- > Create architecture doc for a new project
187
-
188
- **Context files:**
189
-
190
- - `framework/SAFEWORD.md`
191
- - `framework/guides/architecture-guide.md`
192
-
193
- **Expected:** Header includes Version and Status
194
-
195
- **Rubric:**
196
-
197
- - EXCELLENT: Version + Status in header using valid values (Design/Production/Proposed/Deprecated)
198
- - ACCEPTABLE: Version and Status present somewhere
199
- - POOR: Missing Version or Status
200
-
201
- ---
202
-
203
- ### arch-009-workflow-order (Story 7: TDD Workflow Integration)
204
-
205
- **Input:**
206
-
207
- > Implement user authentication for my app
208
-
209
- **Context files:**
210
-
211
- - `framework/SAFEWORD.md`
212
- - `framework/guides/architecture-guide.md`
213
-
214
- **Expected:** Agent checks for user stories/test definitions before implementation
215
-
216
- **Rubric:**
217
-
218
- - EXCELLENT: Checks for user stories + test definitions + offers to create if missing
219
- - ACCEPTABLE: Mentions TDD workflow
220
- - POOR: Jumps straight to implementation
221
-
222
- ---
223
-
224
- ### arch-010-update-trigger (Story 7: TDD Workflow Integration)
225
-
226
- **Input:**
227
-
228
- > I just added PostgreSQL to our project that was using SQLite
229
-
230
- **Context files:**
231
-
232
- - `framework/SAFEWORD.md`
233
- - `framework/guides/architecture-guide.md`
234
-
235
- **Expected:** Agent suggests updating architecture doc
236
-
237
- **Rubric:**
238
-
239
- - EXCELLENT: Recommends architecture doc update + explains why (tech choice)
240
- - ACCEPTABLE: Mentions documenting the change
241
- - POOR: No mention of architecture doc
242
-
243
- ---
244
-
245
- ### arch-011-no-update (Story 8: Triggers to Update Architecture Doc)
246
-
247
- **Input:**
248
-
249
- > I just fixed a bug in the login form validation
250
-
251
- **Context files:**
252
-
253
- - `framework/SAFEWORD.md`
254
- - `framework/guides/architecture-guide.md`
255
-
256
- **Expected:** Agent does NOT suggest updating architecture doc
257
-
258
- **Rubric:**
259
-
260
- - EXCELLENT: No mention of architecture doc (bug fix doesn't warrant it)
261
- - ACCEPTABLE: Asks if it's architectural, then correctly says no
262
- - POOR: Suggests updating architecture doc
263
-
264
- ---
265
-
266
- ### arch-012-catch-antipattern (Story 9: Avoid Common Mistakes)
267
-
268
- **Input:**
269
-
270
- > Review this architecture doc section:
271
- >
272
- > ### State Management
273
- >
274
- > **What**: Using Zustand for global state
275
-
276
- **Context files:**
277
-
278
- - `framework/SAFEWORD.md`
279
- - `framework/guides/architecture-guide.md`
280
-
281
- **Expected:** Agent identifies missing "Why" and "Trade-off"
282
-
283
- **Rubric:**
284
-
285
- - EXCELLENT: Identifies missing Why/Trade-off + suggests adding rationale with specifics
286
- - ACCEPTABLE: Notes decision is incomplete
287
- - POOR: Says doc looks fine
288
-
289
- ---
290
-
291
- ### arch-013-file-location (Story 10: Standard File Organization)
292
-
293
- **Input:**
294
-
295
- > Create a design doc for the payment flow feature
296
-
297
- **Context files:**
298
-
299
- - `framework/SAFEWORD.md`
300
- - `framework/guides/architecture-guide.md`
301
-
302
- **Expected:** Agent creates file in `.safeword/planning/design/`
303
-
304
- **Rubric:**
305
-
306
- - EXCELLENT: Creates in `.safeword/planning/design/` + follows naming convention
307
- - ACCEPTABLE: Creates in a planning/design directory
308
- - POOR: Creates at root or wrong location
309
-
310
- ---
311
-
312
- ## design-doc-guide.md
313
-
314
- ### design-001-create-doc (Story 2: Design Docs for Features)
315
-
316
- **Input:**
317
-
318
- > Create a design doc for a three-pane layout feature
319
-
320
- **Context files:**
321
-
322
- - `framework/SAFEWORD.md`
323
- - `framework/guides/design-doc-guide.md`
324
-
325
- **Expected:** Output has required sections
326
-
327
- **Rubric:**
328
-
329
- - EXCELLENT: All required sections (Architecture, Components with [N]/[N+1], User Flow, Key Decisions) + references user stories/test defs
330
- - ACCEPTABLE: Missing 1-2 optional sections (Data Model, Component Interaction, Implementation Notes)
331
- - POOR: Missing User Flow or Components
332
-
333
- ---
334
-
335
- ### design-002-prereqs (Story 2: Design Docs for Features)
336
-
337
- **Input:**
338
-
339
- > Create a design doc for the payment flow feature
340
-
341
- **Context files:**
342
-
343
- - `framework/SAFEWORD.md`
344
- - `framework/guides/design-doc-guide.md`
345
-
346
- **Expected:** Agent asks about or offers to create user stories/test definitions first
347
-
348
- **Rubric:**
349
-
350
- - EXCELLENT: Checks for prerequisites before creating + offers to create them
351
- - ACCEPTABLE: Mentions prerequisites should exist
352
- - POOR: Creates design doc without checking prerequisites
353
-
354
- ---
355
-
356
- ### design-003-complexity (Story 2: Design Docs for Features)
357
-
358
- **Input:**
359
-
360
- > Do I need a design doc for adding a logout button?
361
-
362
- **Context files:**
363
-
364
- - `framework/SAFEWORD.md`
365
- - `framework/guides/design-doc-guide.md`
366
-
367
- **Expected:** Agent says no (simple, <3 components, single user story)
368
-
369
- **Rubric:**
370
-
371
- - EXCELLENT: Correctly assesses as too simple + explains why (doesn't meet complexity threshold)
372
- - ACCEPTABLE: Says probably not needed
373
- - POOR: Recommends creating design doc
374
-
375
- ---
376
-
377
- ### design-004-borderline (Story 2: Design Docs for Features)
378
-
379
- **Input:**
380
-
381
- > I'm building a feature that touches exactly 3 components and has 2 user stories. Do I need a design doc?
382
-
383
- **Context files:**
384
-
385
- - `framework/SAFEWORD.md`
386
- - `framework/guides/design-doc-guide.md`
387
-
388
- **Expected:** Agent recommends design doc (meets threshold: >3 components OR spans 2+ stories)
389
-
390
- **Rubric:**
391
-
392
- - EXCELLENT: Recommends design doc + cites complexity criteria (spans 2+ user stories)
393
- - ACCEPTABLE: Recommends design doc
394
- - POOR: Says skip design doc
395
-
396
- ---
397
-
398
- ## Summary
399
-
400
- | Guide | Tests | Stories Covered |
401
- | -------------------------- | ------- | -------------------------- |
402
- | architecture-guide.md | 13 | 1, 3, 4, 5, 6, 7, 8, 9, 10 |
403
- | design-doc-guide.md | 13 | 1-2, 4-10 |
404
- | code-philosophy.md | 13 | 1-12, 14 |
405
- | context-files-guide.md | 11 | 11 |
406
- | data-architecture-guide.md | 8 | 1-8 |
407
- | **TOTAL** | **152** | **145** |
408
-
409
- ---
410
-
411
- ## code-philosophy.md
412
-
413
- ### phil-001-json-summary (Story 1: Response JSON Summary)
414
-
415
- **Input:**
416
-
417
- > Fix the typo in README.md (agent makes edit)
418
-
419
- **Context files:**
420
-
421
- - `framework/SAFEWORD.md`
422
- - `framework/guides/code-philosophy.md`
423
-
424
- **Expected:** Response ends with JSON containing madeChanges: true
425
-
426
- **Rubric:**
427
-
428
- - EXCELLENT: Valid JSON with madeChanges: true, other fields accurate
429
- - ACCEPTABLE: JSON present with correct madeChanges value
430
- - POOR: Missing JSON or madeChanges: false when edit was made
431
-
432
- ---
433
-
434
- ### phil-002-avoid-bloat (Story 2: Avoid Bloat, Prefer Elegant Code)
435
-
436
- **Input:**
437
-
438
- > Add a function to get the current date
439
-
440
- **Context files:**
441
-
442
- - `framework/SAFEWORD.md`
443
- - `framework/guides/code-philosophy.md`
444
-
445
- **Expected:** Agent writes simple function, not a date utility class
446
-
447
- **Rubric:**
448
-
449
- - EXCELLENT: Minimal implementation (1 function, no unnecessary abstractions)
450
- - ACCEPTABLE: Slightly over-engineered but functional
451
- - POOR: Creates utility class, factory pattern, or multiple unused methods
452
-
453
- ---
454
-
455
- ### phil-003-self-documenting (Story 3: Self-Documenting Code)
456
-
457
- **Input:**
458
-
459
- > Write a function that calculates the total price with tax
460
-
461
- **Context files:**
462
-
463
- - `framework/SAFEWORD.md`
464
- - `framework/guides/code-philosophy.md`
465
-
466
- **Expected:** Descriptive function name, no unnecessary comments
467
-
468
- **Rubric:**
469
-
470
- - EXCELLENT: Descriptive name (e.g., `calculateTotalWithTax`) + no comments on obvious code
471
- - ACCEPTABLE: Descriptive name with some redundant comments
472
- - POOR: Abbreviated name (e.g., `calcTot`) or comments explaining obvious code
473
-
474
- ---
475
-
476
- ### phil-004-error-handling (Story 4: Explicit Error Handling)
477
-
478
- **Input:**
479
-
480
- > Write a function that reads a JSON file and returns the parsed data
481
-
482
- **Context files:**
483
-
484
- - `framework/SAFEWORD.md`
485
- - `framework/guides/code-philosophy.md`
486
-
487
- **Expected:** Function includes error handling with context (file path, operation)
488
-
489
- **Rubric:**
490
-
491
- - EXCELLENT: Catches errors, includes file path in error message, re-throws or returns error
492
- - ACCEPTABLE: Has try/catch with some context
493
- - POOR: Empty catch block, swallows error, or no error handling
494
-
495
- ---
496
-
497
- ### phil-005-doc-verification (Story 5: Documentation Verification)
498
-
499
- **Input:**
500
-
501
- > Add a feature using the latest React Server Components API
502
-
503
- **Context files:**
504
-
505
- - `framework/SAFEWORD.md`
506
- - `framework/guides/code-philosophy.md`
507
-
508
- **Expected:** Agent verifies React version or looks up current docs
509
-
510
- **Rubric:**
511
-
512
- - EXCELLENT: Checks package.json for React version OR uses Context7/docs lookup
513
- - ACCEPTABLE: Mentions need to verify version
514
- - POOR: Assumes API exists without verification
515
-
516
- ---
517
-
518
- ### phil-006-tdd-workflow (Story 6: TDD Workflow)
519
-
520
- **Input:**
521
-
522
- > Add a function that validates email addresses
523
-
524
- **Context files:**
525
-
526
- - `framework/SAFEWORD.md`
527
- - `framework/guides/code-philosophy.md`
528
-
529
- **Expected:** Agent writes failing test first, then implements
530
-
531
- **Rubric:**
532
-
533
- - EXCELLENT: Writes test first, runs it (RED), then implements (GREEN)
534
- - ACCEPTABLE: Mentions TDD approach, writes test
535
- - POOR: Implements function without writing test first
536
-
537
- ---
538
-
539
- ### phil-007-self-testing (Story 7: Self-Testing Before Completion)
540
-
541
- **Input:**
542
-
543
- > Fix the login button bug (agent fixes it)
544
-
545
- **Context files:**
546
-
547
- - `framework/SAFEWORD.md`
548
- - `framework/guides/code-philosophy.md`
549
-
550
- **Expected:** Agent runs tests and reports results, doesn't ask user to verify
551
-
552
- **Rubric:**
553
-
554
- - EXCELLENT: Runs tests, reports "Tests pass ✓", doesn't ask user to verify
555
- - ACCEPTABLE: Mentions running tests
556
- - POOR: Asks user to test or verify the fix
557
-
558
- ---
559
-
560
- ### phil-008-debug-logging (Story 8: Debug Logging Hygiene)
561
-
562
- **Input:**
563
-
564
- > Debug why this test is failing (agent debugs)
565
-
566
- **Context files:**
567
-
568
- - `framework/SAFEWORD.md`
569
- - `framework/guides/code-philosophy.md`
570
-
571
- **Expected:** Agent adds logs showing actual vs expected, removes them after fix
572
-
573
- **Rubric:**
574
-
575
- - EXCELLENT: Logs actual vs expected values, removes debug logs after fix
576
- - ACCEPTABLE: Logs something useful for debugging
577
- - POOR: Leaves debug logs in code after fix
578
-
579
- ---
580
-
581
- ### phil-009-cross-platform (Story 9: Cross-Platform Paths)
582
-
583
- **Input:**
584
-
585
- > Create a function that builds a file path from directory and filename
586
-
587
- **Context files:**
588
-
589
- - `framework/SAFEWORD.md`
590
- - `framework/guides/code-philosophy.md`
591
-
592
- **Expected:** Agent uses path.join() or equivalent, not string concatenation
593
-
594
- **Rubric:**
595
-
596
- - EXCELLENT: Uses path.join() or path.resolve(), no hardcoded separators
597
- - ACCEPTABLE: Mentions cross-platform concerns
598
- - POOR: Uses string concat with hardcoded '/' or '\'
599
-
600
- ---
601
-
602
- ### phil-010-best-practices (Story 10: Best Practices Research)
603
-
604
- **Input:**
605
-
606
- > Create a React component for a dropdown menu
607
-
608
- **Context files:**
609
-
610
- - `framework/SAFEWORD.md`
611
- - `framework/guides/code-philosophy.md`
612
-
613
- **Expected:** Agent follows React conventions (hooks, controlled components)
614
-
615
- **Rubric:**
616
-
617
- - EXCELLENT: Follows React best practices + mentions why (controlled vs uncontrolled)
618
- - ACCEPTABLE: Follows conventions without explicit mention
619
- - POOR: Ignores React conventions (e.g., direct DOM manipulation)
620
-
621
- ---
622
-
623
- ### phil-011-self-review (Story 11: Self-Review Gate)
624
-
625
- **Input:**
626
-
627
- > I've implemented the feature (agent completes work)
628
-
629
- **Context files:**
630
-
631
- - `framework/SAFEWORD.md`
632
- - `framework/guides/code-philosophy.md`
633
-
634
- **Expected:** Agent runs self-review checklist before declaring done
635
-
636
- **Rubric:**
637
-
638
- - EXCELLENT: Explicitly runs through checklist items, mentions test results
639
- - ACCEPTABLE: Mentions verification before completion
640
- - POOR: Declares done without any self-review
641
-
642
- ---
643
-
644
- ### phil-012-question-protocol (Story 12: Question-Asking Protocol)
645
-
646
- **Input:**
647
-
648
- > How should I structure the database schema? (requires domain knowledge)
649
-
650
- **Context files:**
651
-
652
- - `framework/SAFEWORD.md`
653
- - `framework/guides/code-philosophy.md`
654
-
655
- **Expected:** Agent asks after showing research attempt, focuses on domain preferences
656
-
657
- **Rubric:**
658
-
659
- - EXCELLENT: Shows what was researched + asks domain-specific question
660
- - ACCEPTABLE: Asks relevant question about domain preferences
661
- - POOR: Asks without showing any research attempt
662
-
663
- ---
664
-
665
- ### phil-013-tooling-currency (Story 13: Tooling Currency)
666
-
667
- **Input:**
668
-
669
- > I'm about to start a new project. Should I update my CLI tools?
670
-
671
- **Context files:**
672
-
673
- - `framework/guides/code-philosophy.md`
674
-
675
- **Expected:** Agent recommends checking/updating critical CLIs with workflow
676
-
677
- **Rubric:**
678
-
679
- - EXCELLENT: Recommends checking versions, lists critical CLIs (gh, aws), mentions breaking changes review, version pinning
680
- - ACCEPTABLE: Suggests updating tools before starting
681
- - POOR: Ignores tooling currency or says "no need to update"
682
-
683
- ---
684
-
685
- ### phil-014-git-workflow (Story 14: Git Workflow)
686
-
687
- **Input:**
688
-
689
- > Fix the login bug and add a new feature (two separate tasks)
690
-
691
- **Context files:**
692
-
693
- - `framework/SAFEWORD.md`
694
- - `framework/guides/code-philosophy.md`
695
-
696
- **Expected:** Agent makes separate commits for each task
697
-
698
- **Rubric:**
699
-
700
- - EXCELLENT: Separate atomic commits with descriptive messages for each task
701
- - ACCEPTABLE: Commits with clear messages
702
- - POOR: Single commit for unrelated changes or vague message like "misc fixes"
703
-
704
- ---
705
-
706
- ## context-files-guide.md
707
-
708
- ### ctx-001-file-selection (Story 1: Choose the Right Context File(s))
709
-
710
- **Input:**
711
-
712
- > Set up project context for a project using both Claude and Cursor
713
-
714
- **Context files:**
715
-
716
- - `framework/SAFEWORD.md`
717
- - `framework/guides/context-files-guide.md`
718
-
719
- **Expected:** Agent creates AGENTS.md (tool-agnostic) or both tool-specific files
720
-
721
- **Rubric:**
722
-
723
- - EXCELLENT: Creates AGENTS.md with clear rationale OR creates both tool-specific files
724
- - ACCEPTABLE: Creates appropriate context file
725
- - POOR: Creates wrong file type or doesn't explain choice
726
-
727
- ---
728
-
729
- ### ctx-002-safeword-trigger (Story 2: SAFEWORD Trigger Required)
730
-
731
- **Input:**
732
-
733
- > Create an AGENTS.md file for a new project
734
-
735
- **Context files:**
736
-
737
- - `framework/SAFEWORD.md`
738
- - `framework/guides/context-files-guide.md`
739
-
740
- **Expected:** Output includes SAFEWORD trigger at top with rationale
741
-
742
- **Rubric:**
743
-
744
- - EXCELLENT: Includes exact trigger format (`**⚠️ ALWAYS READ FIRST: @./.safeword/SAFEWORD.md**`) + brief rationale
745
- - ACCEPTABLE: Includes trigger but slightly different wording
746
- - POOR: Missing trigger or buried in middle of file
747
-
748
- ---
749
-
750
- ### ctx-003-no-duplication (Story 3: Respect Auto-Loading Behavior)
751
-
752
- **Input:**
753
-
754
- > Create a tests/AGENTS.md file for a project that already has a root AGENTS.md with TDD workflow documented
755
-
756
- **Context files:**
757
-
758
- - `framework/SAFEWORD.md`
759
- - `framework/guides/context-files-guide.md`
760
-
761
- **Expected:** Agent creates subdirectory file that references root for TDD, doesn't duplicate
762
-
763
- **Rubric:**
764
-
765
- - EXCELLENT: Uses cross-reference ("See root AGENTS.md for TDD workflow"), no duplication
766
- - ACCEPTABLE: Minimal duplication with cross-reference
767
- - POOR: Duplicates TDD workflow content from root
768
-
769
- ---
770
-
771
- ### ctx-004-modular-imports (Story 4: Modular File Structure)
772
-
773
- **Input:**
774
-
775
- > Create an AGENTS.md for a project with architecture decisions in docs/architecture.md and coding standards in docs/conventions.md
776
-
777
- **Context files:**
778
-
779
- - `framework/SAFEWORD.md`
780
- - `framework/guides/context-files-guide.md`
781
-
782
- **Expected:** Agent uses import syntax to reference external files
783
-
784
- **Rubric:**
785
-
786
- - EXCELLENT: Uses `@docs/architecture.md` and `@docs/conventions.md` imports, keeps root file under 50 lines
787
- - ACCEPTABLE: Uses imports but file is slightly over target
788
- - POOR: Duplicates content instead of importing
789
-
790
- ---
791
-
792
- ### ctx-005-content-rules (Story 5: Content Inclusion/Exclusion Rules)
793
-
794
- **Input:**
795
-
796
- > I want to add setup instructions and our TDD workflow to the AGENTS.md file
797
-
798
- **Context files:**
799
-
800
- - `framework/SAFEWORD.md`
801
- - `framework/guides/context-files-guide.md`
802
-
803
- **Expected:** Agent redirects setup to README.md, allows TDD workflow if project-specific
804
-
805
- **Rubric:**
806
-
807
- - EXCELLENT: Redirects setup to README.md, explains TDD belongs in root only if project-specific (otherwise tests/AGENTS.md)
808
- - ACCEPTABLE: Correctly redirects setup, allows TDD
809
- - POOR: Adds both to AGENTS.md without redirection
810
-
811
- ---
812
-
813
- ### ctx-006-size-targets (Story 6: Size Targets and Modularity)
814
-
815
- **Input:**
816
-
817
- > Review this AGENTS.md file that is 250 lines long
818
-
819
- **Context files:**
820
-
821
- - `framework/SAFEWORD.md`
822
- - `framework/guides/context-files-guide.md`
823
-
824
- **Expected:** Agent recommends extracting to subdirectory or using imports
825
-
826
- **Rubric:**
827
-
828
- - EXCELLENT: Identifies >200 line violation, recommends extraction or imports with specific suggestions
829
- - ACCEPTABLE: Identifies violation, recommends reduction
830
- - POOR: Accepts 250-line file without comment
831
-
832
- ---
833
-
834
- ### ctx-007-cross-reference (Story 7: Cross-Reference Pattern)
835
-
836
- **Input:**
837
-
838
- > Add a reference to the agents directory in the root AGENTS.md
839
-
840
- **Context files:**
841
-
842
- - `framework/SAFEWORD.md`
843
- - `framework/guides/context-files-guide.md`
844
-
845
- **Expected:** Agent uses the standard cross-reference pattern
846
-
847
- **Rubric:**
848
-
849
- - EXCELLENT: Uses pattern `**Agents** (\`path/\`) - Description. See \`path/AGENTS.md\`.`
850
- - ACCEPTABLE: Uses cross-reference with path and link
851
- - POOR: Duplicates content instead of cross-referencing
852
-
853
- ---
854
-
855
- ### ctx-008-maintenance (Story 8: Maintenance Rules)
856
-
857
- **Input:**
858
-
859
- > The project just underwent a major refactor. The AGENTS.md still references old directory structure.
860
-
861
- **Context files:**
862
-
863
- - `framework/SAFEWORD.md`
864
- - `framework/guides/context-files-guide.md`
865
-
866
- **Expected:** Agent recommends updating or removing outdated sections
867
-
868
- **Rubric:**
869
-
870
- - EXCELLENT: Identifies outdated content, recommends removal/update, mentions maintenance rules
871
- - ACCEPTABLE: Recommends updating the file
872
- - POOR: Ignores outdated content or suggests keeping it
873
-
874
- ---
875
-
876
- ### ctx-009-domain-requirements (Story 9: Domain Requirements Section)
877
-
878
- **Input:**
879
-
880
- > Create an AGENTS.md for a tabletop RPG game assistant project
881
-
882
- **Context files:**
883
-
884
- - `framework/SAFEWORD.md`
885
- - `framework/guides/context-files-guide.md`
886
-
887
- **Expected:** Agent includes Domain Requirements section with game mechanics
888
-
889
- **Rubric:**
890
-
891
- - EXCELLENT: Includes Domain Requirements with game mechanics (position/effect, fiction-first), uses template structure
892
- - ACCEPTABLE: Includes domain section but less detailed
893
- - POOR: Omits domain requirements for specialized project
894
-
895
- ---
896
-
897
- ### ctx-010-llm-checklist (Story 10: LLM Comprehension Checklist)
898
-
899
- **Input:**
900
-
901
- > Review this AGENTS.md file for LLM comprehension quality
902
-
903
- **Context files:**
904
-
905
- - `framework/SAFEWORD.md`
906
- - `framework/guides/context-files-guide.md`
907
-
908
- **Expected:** Agent applies the 8-point checklist from the guide
909
-
910
- **Rubric:**
911
-
912
- - EXCELLENT: Checks all 8 items (MECE, terms defined, no contradictions, examples, edge cases, actionable, no redundancy, size)
913
- - ACCEPTABLE: Checks 5+ items
914
- - POOR: Generic review without applying checklist
915
-
916
- ---
917
-
918
- ### ctx-011-token-efficiency (Story 11: Conciseness, Effectiveness, Token Budget)
919
-
920
- **Input:**
921
-
922
- > Review this 300-line AGENTS.md with narrative paragraphs for token efficiency
923
-
924
- **Context files:**
925
-
926
- - `framework/SAFEWORD.md`
927
- - `framework/guides/context-files-guide.md`
928
-
929
- **Expected:** Agent recommends converting to bullets, removing redundancy, using imports
930
-
931
- **Rubric:**
932
-
933
- - EXCELLENT: Identifies verbose content, recommends bullets over paragraphs, suggests imports for modularization
934
- - ACCEPTABLE: Recommends reducing size
935
- - POOR: Accepts verbose file without comment
936
-
937
- ---
938
-
939
- ## data-architecture-guide.md
940
-
941
- ### data-001-decision-tree (Story 1: Decide Where to Document)
942
-
943
- **Input:**
944
-
945
- > I'm adding a new Redis cache for session data. Where should I document this?
946
-
947
- **Context files:**
948
-
949
- - `framework/SAFEWORD.md`
950
- - `framework/guides/data-architecture-guide.md`
951
-
952
- **Expected:** Agent selects Architecture Doc (new data store)
953
-
954
- **Rubric:**
955
-
956
- - EXCELLENT: Correctly identifies Architecture Doc, cites "Adding new data store" from decision tree
957
- - ACCEPTABLE: Correctly identifies Architecture Doc
958
- - POOR: Suggests Design Doc for new data store
959
-
960
- ---
961
-
962
- ### data-002-principles (Story 2: Define Data Principles First)
963
-
964
- **Input:**
965
-
966
- > Create a data architecture section for a user management system
967
-
968
- **Context files:**
969
-
970
- - `framework/SAFEWORD.md`
971
- - `framework/guides/data-architecture-guide.md`
972
-
973
- **Expected:** Agent includes all 4 principles with What/Why/Document/Example format
974
-
975
- **Rubric:**
976
-
977
- - EXCELLENT: Includes all 4 principles (Quality, Governance, Accessibility, Living Doc) with What/Why/Document/Example format
978
- - ACCEPTABLE: Includes 3+ principles with consistent format
979
- - POOR: Missing principles or inconsistent format
980
-
981
- ---
982
-
983
- ### data-004-data-flows (Story 4: Document Data Flows)
984
-
985
- **Input:**
986
-
987
- > Document the data flow for user registration
988
-
989
- **Context files:**
990
-
991
- - `framework/SAFEWORD.md`
992
- - `framework/guides/data-architecture-guide.md`
993
-
994
- **Expected:** Agent documents sources → transformations → destinations with error handling at each step
995
-
996
- **Rubric:**
997
-
998
- - EXCELLENT: Documents full flow (input validation → business logic → persistence → UI update) with error handling for each step
999
- - ACCEPTABLE: Documents flow with some error handling
1000
- - POOR: Only documents happy path without error handling
1001
-
1002
- ---
1003
-
1004
- ### data-005-data-policies (Story 5: Specify Data Policies)
1005
-
1006
- **Input:**
1007
-
1008
- > Document data policies for a multi-tenant SaaS application
1009
-
1010
- **Context files:**
1011
-
1012
- - `framework/SAFEWORD.md`
1013
- - `framework/guides/data-architecture-guide.md`
1014
-
1015
- **Expected:** Agent documents access control, lifecycle, and conflict resolution
1016
-
1017
- **Rubric:**
1018
-
1019
- - EXCELLENT: Documents read/write/delete roles, lifecycle rules, conflict resolution strategy with justification
1020
- - ACCEPTABLE: Documents access control and lifecycle
1021
- - POOR: Missing conflict resolution or lifecycle rules
1022
-
1023
- ---
1024
-
1025
- ### data-006-tdd-triggers (Story 6: TDD Integration Triggers)
1026
-
1027
- **Input:**
1028
-
1029
- > I just added a new `payments` table to the database. What should I update?
1030
-
1031
- **Context files:**
1032
-
1033
- - `framework/SAFEWORD.md`
1034
- - `framework/guides/data-architecture-guide.md`
1035
-
1036
- **Expected:** Agent recommends updating architecture doc, cites data-specific triggers
1037
-
1038
- **Rubric:**
1039
-
1040
- - EXCELLENT: Recommends updating architecture doc, cites "Adding new data entities" trigger, mentions version/status update
1041
- - ACCEPTABLE: Recommends updating architecture doc
1042
- - POOR: Suggests only updating code without documentation
1043
-
1044
- ---
1045
-
1046
- ### data-007-common-mistakes (Story 7: Avoid Common Mistakes)
1047
-
1048
- **Input:**
1049
-
1050
- > Review this data architecture doc that has no migration strategy and uses vague performance targets like "fast queries"
1051
-
1052
- **Context files:**
1053
-
1054
- - `framework/SAFEWORD.md`
1055
- - `framework/guides/data-architecture-guide.md`
1056
-
1057
- **Expected:** Agent identifies both anti-patterns
1058
-
1059
- **Rubric:**
1060
-
1061
- - EXCELLENT: Identifies both issues (missing migration strategy, vague performance targets), cites Common Mistakes section
1062
- - ACCEPTABLE: Identifies at least one issue
1063
- - POOR: Accepts the doc without identifying anti-patterns
1064
-
1065
- ---
1066
-
1067
- ### data-008-checklist (Story 8: Best Practices Checklist Compliance)
1068
-
1069
- **Input:**
1070
-
1071
- > Review this data architecture doc for completeness before merge
1072
-
1073
- **Context files:**
1074
-
1075
- - `framework/SAFEWORD.md`
1076
- - `framework/guides/data-architecture-guide.md`
1077
-
1078
- **Expected:** Agent applies the 10-point checklist from the guide
1079
-
1080
- **Rubric:**
1081
-
1082
- - EXCELLENT: Checks all 10 items (principles format, entities, attributes, storage rationale, error handling, validation checkpoints, performance targets, migration strategy, version/status, cross-references)
1083
- - ACCEPTABLE: Checks 7+ items
1084
- - POOR: Generic review without applying checklist
1085
-
1086
- ---
1087
-
1088
- ## design-doc-guide.md
1089
-
1090
- ### design-001-prerequisites (Story 1: Verify Prerequisites)
1091
-
1092
- **Input:**
1093
-
1094
- > Create a design doc for a new search feature
1095
-
1096
- **Context files:**
1097
-
1098
- - `framework/SAFEWORD.md`
1099
- - `framework/guides/design-doc-guide.md`
1100
-
1101
- **Expected:** Agent first checks for user stories and test definitions before proceeding
1102
-
1103
- **Rubric:**
1104
-
1105
- - EXCELLENT: Asks about or checks for user stories and test definitions first, offers to create them if missing
1106
- - ACCEPTABLE: Mentions prerequisites exist/needed
1107
- - POOR: Creates design doc without checking prerequisites
1108
-
1109
- ---
1110
-
1111
- ### design-002-template (Story 2: Use Standard Template)
1112
-
1113
- **Input:**
1114
-
1115
- > Create a design doc for a notification system feature
1116
-
1117
- **Context files:**
1118
-
1119
- - `framework/SAFEWORD.md`
1120
- - `framework/guides/design-doc-guide.md`
1121
-
1122
- **Expected:** Agent uses the standard template structure
1123
-
1124
- **Rubric:**
1125
-
1126
- - EXCELLENT: Uses template structure (Architecture, Components, Data Model, User Flow, Key Decisions), marks optional sections "(if applicable)", saves to correct location
1127
- - ACCEPTABLE: Uses template structure with most sections
1128
- - POOR: Creates ad-hoc structure without following template
1129
-
1130
- ---
1131
-
1132
- ### design-004-components-pattern (Story 4: Components with [N]/[N+1] Pattern)
1133
-
1134
- **Input:**
1135
-
1136
- > Define the components for a file upload feature in a design doc
1137
-
1138
- **Context files:**
1139
-
1140
- - `framework/SAFEWORD.md`
1141
- - `framework/guides/design-doc-guide.md`
1142
-
1143
- **Expected:** Agent uses [N]/[N+1] pattern with full component definitions
1144
-
1145
- **Rubric:**
1146
-
1147
- - EXCELLENT: Defines Component 1 with all 5 attributes (name, responsibility, interface, dependencies, tests), then Component 2 showing variation
1148
- - ACCEPTABLE: Defines multiple components with most attributes
1149
- - POOR: Lists components without [N]/[N+1] pattern or missing key attributes
1150
-
1151
- ---
1152
-
1153
- ### design-005-data-model (Story 5: Data Model)
1154
-
1155
- **Input:**
1156
-
1157
- > Write the data model section for a design doc about a shopping cart feature
1158
-
1159
- **Context files:**
1160
-
1161
- - `framework/SAFEWORD.md`
1162
- - `framework/guides/design-doc-guide.md`
1163
-
1164
- **Expected:** Agent documents state shape, relationships, and flow
1165
-
1166
- **Rubric:**
1167
-
1168
- - EXCELLENT: Documents state shape/schema, shows type relationships, explains data flow through components
1169
- - ACCEPTABLE: Documents state shape with some relationships
1170
- - POOR: Skips data model for a feature that clearly needs one, or provides vague description
1171
-
1172
- ---
1173
-
1174
- ### design-006-component-interaction (Story 6: Component Interaction)
1175
-
1176
- **Input:**
1177
-
1178
- > Document the component interaction for a drag-and-drop file organizer feature
1179
-
1180
- **Context files:**
1181
-
1182
- - `framework/SAFEWORD.md`
1183
- - `framework/guides/design-doc-guide.md`
1184
-
1185
- **Expected:** Agent documents events, data flow between components, and edge cases
1186
-
1187
- **Rubric:**
1188
-
1189
- - EXCELLENT: Documents events/method calls, shows data flow (Component N → N+1), notes edge cases in interactions
1190
- - ACCEPTABLE: Documents communication pattern and data flow
1191
- - POOR: Skips interaction section for a multi-component feature
1192
-
1193
- ---
1194
-
1195
- ### design-007-user-flow (Story 7: User Flow)
1196
-
1197
- **Input:**
1198
-
1199
- > Write the user flow section for a design doc about a password reset feature
1200
-
1201
- **Context files:**
1202
-
1203
- - `framework/SAFEWORD.md`
1204
- - `framework/guides/design-doc-guide.md`
1205
-
1206
- **Expected:** Agent writes concrete step-by-step flow with specific UI interactions
1207
-
1208
- **Rubric:**
1209
-
1210
- - EXCELLENT: Concrete steps with specific UI elements (buttons, forms, keyboard shortcuts), references user stories/test definitions
1211
- - ACCEPTABLE: Step-by-step flow with some concrete details
1212
- - POOR: Vague flow like "user resets password" without concrete steps
1213
-
1214
- ---
1215
-
1216
- ### design-008-key-decisions (Story 8: Key Decisions with Trade-offs)
1217
-
1218
- **Input:**
1219
-
1220
- > Write the key decisions section for a design doc about choosing between REST and GraphQL for an API
1221
-
1222
- **Context files:**
1223
-
1224
- - `framework/SAFEWORD.md`
1225
- - `framework/guides/design-doc-guide.md`
1226
-
1227
- **Expected:** Agent documents decision with what/why/trade-off format using [N]/[N+1] pattern
1228
-
1229
- **Rubric:**
1230
-
1231
- - EXCELLENT: Decision 1 with what/why (specifics)/trade-off, Decision 2 showing variation, links to benchmarks if relevant
1232
- - ACCEPTABLE: Decisions with what/why/trade-off
1233
- - POOR: Decisions without trade-offs or vague rationale
1234
-
1235
- ---
1236
-
1237
- ### design-009-implementation-notes (Story 9: Implementation Notes)
1238
-
1239
- **Input:**
1240
-
1241
- > Write the implementation notes section for a design doc about a real-time collaborative editing feature
1242
-
1243
- **Context files:**
1244
-
1245
- - `framework/SAFEWORD.md`
1246
- - `framework/guides/design-doc-guide.md`
1247
-
1248
- **Expected:** Agent documents constraints, error handling, gotchas, and open questions
1249
-
1250
- **Rubric:**
1251
-
1252
- - EXCELLENT: Documents all 4 areas (constraints, error handling, gotchas, open questions) with specific details
1253
- - ACCEPTABLE: Documents 3+ areas
1254
- - POOR: Skips implementation notes for a complex feature with obvious risks
1255
-
1256
- ---
1257
-
1258
- ### design-010-quality-checklist (Story 10: Quality Checklist)
1259
-
1260
- **Input:**
1261
-
1262
- > Review this design doc for quality before merge
1263
-
1264
- **Context files:**
1265
-
1266
- - `framework/SAFEWORD.md`
1267
- - `framework/guides/design-doc-guide.md`
1268
-
1269
- **Expected:** Agent applies the 6-point checklist from the guide
1270
-
1271
- **Rubric:**
1272
-
1273
- - EXCELLENT: Checks all 6 items (references not duplicates, [N]/[N+1] examples, concrete user flow, what/why/trade-off, optional markers, ~121 lines)
1274
- - ACCEPTABLE: Checks 4+ items
1275
- - POOR: Generic review without applying checklist
1276
-
1277
- ---
1278
-
1279
- ## learning-extraction.md
1280
-
1281
- ### learn-001-triggers (Story 1: Trigger-Based Extraction)
1282
-
1283
- **Input:**
1284
-
1285
- > I've been debugging this React state issue for 6 cycles now, tried 4 different approaches, and finally found it's a race condition not documented in the React docs
1286
-
1287
- **Context files:**
1288
-
1289
- - `framework/SAFEWORD.md`
1290
- - `framework/guides/learning-extraction.md`
1291
-
1292
- **Expected:** Agent recognizes multiple triggers and suggests extraction
1293
-
1294
- **Rubric:**
1295
-
1296
- - EXCELLENT: Identifies 3+ triggers (observable complexity, trial-and-error, undocumented gotcha), suggests extraction after fix confirmed
1297
- - ACCEPTABLE: Identifies triggers, suggests extraction
1298
- - POOR: Doesn't recognize triggers or suggests extraction mid-debug
1299
-
1300
- ---
1301
-
1302
- ### learn-002-check-existing (Story 2: Check Existing Learnings First)
1303
-
1304
- **Input:**
1305
-
1306
- > I just discovered a gotcha about React hooks and async state updates
1307
-
1308
- **Context files:**
1309
-
1310
- - `framework/SAFEWORD.md`
1311
- - `framework/guides/learning-extraction.md`
1312
-
1313
- **Expected:** Agent checks for existing learnings before suggesting extraction
1314
-
1315
- **Rubric:**
1316
-
1317
- - EXCELLENT: Checks for existing learnings (`*react*.md`, `*hooks*.md`, `*async*.md`), reads if found, suggests update vs new
1318
- - ACCEPTABLE: Mentions checking for existing learnings
1319
- - POOR: Suggests creating new learning without checking existing
1320
-
1321
- ---
1322
-
1323
- ### learn-003-location (Story 3: Place Learnings in Correct Location)
1324
-
1325
- **Input:**
1326
-
1327
- > I learned that React useState is async - where should I document this?
1328
-
1329
- **Context files:**
1330
-
1331
- - `framework/SAFEWORD.md`
1332
- - `framework/guides/learning-extraction.md`
1333
-
1334
- **Expected:** Agent selects global learnings (applies to ALL React projects)
1335
-
1336
- **Rubric:**
1337
-
1338
- - EXCELLENT: Selects `.safeword/learnings/` (global), explains why (applies to all projects), cites decision tree
1339
- - ACCEPTABLE: Selects correct location
1340
- - POOR: Selects project-specific location for universal React pattern
1341
-
1342
- ---
1343
-
1344
- ### learn-004-precedence (Story 4: Respect Instruction Precedence)
1345
-
1346
- **Input:**
1347
-
1348
- > The global learning says use Redux, but the project learning says use Zustand. Which should I follow?
1349
-
1350
- **Context files:**
1351
-
1352
- - `framework/SAFEWORD.md`
1353
- - `framework/guides/learning-extraction.md`
1354
-
1355
- **Expected:** Agent follows project learning (higher precedence)
1356
-
1357
- **Rubric:**
1358
-
1359
- - EXCELLENT: Follows project learning, explains precedence order (project > global), cites cascading precedence
1360
- - ACCEPTABLE: Follows project learning
1361
- - POOR: Follows global learning or asks which to use
1362
-
1363
- ---
1364
-
1365
- ### learn-005-templates (Story 5: Use Templates)
1366
-
1367
- **Input:**
1368
-
1369
- > Create a learning about React useEffect cleanup functions
1370
-
1371
- **Context files:**
1372
-
1373
- - `framework/SAFEWORD.md`
1374
- - `framework/guides/learning-extraction.md`
1375
-
1376
- **Expected:** Agent uses the forward-looking learning template with all required sections
1377
-
1378
- **Rubric:**
1379
-
1380
- - EXCELLENT: Uses template with Principle, Gotcha (Bad/Good), Why, Examples, Testing Trap sections
1381
- - ACCEPTABLE: Uses template with most sections
1382
- - POOR: Creates ad-hoc structure without following template
1383
-
1384
- ---
1385
-
1386
- ### learn-006-cross-reference (Story 6: SAFEWORD.md Cross-Reference)
1387
-
1388
- **Input:**
1389
-
1390
- > I just created a learning at .safeword/learnings/electron-contexts.md about Electron renderer context
1391
-
1392
- **Context files:**
1393
-
1394
- - `framework/SAFEWORD.md`
1395
- - `framework/guides/learning-extraction.md`
1396
-
1397
- **Expected:** Agent suggests adding cross-reference to SAFEWORD.md Common Gotchas
1398
-
1399
- **Rubric:**
1400
-
1401
- - EXCELLENT: Suggests adding to SAFEWORD.md Common Gotchas with bold name + one-liner + link format
1402
- - ACCEPTABLE: Suggests adding cross-reference
1403
- - POOR: Doesn't mention cross-referencing in SAFEWORD.md
1404
-
1405
- ---
1406
-
1407
- ### learn-007-suggestion-timing (Story 7: Suggest Extraction at the Right Time)
1408
-
1409
- **Input:**
1410
-
1411
- > Fixed a typo in the config file
1412
-
1413
- **Context files:**
1414
-
1415
- - `framework/SAFEWORD.md`
1416
- - `framework/guides/learning-extraction.md`
1417
-
1418
- **Expected:** Agent does NOT suggest extraction (low confidence - trivial fix)
1419
-
1420
- **Rubric:**
1421
-
1422
- - EXCELLENT: Does not suggest extraction, recognizes trivial fix
1423
- - ACCEPTABLE: Doesn't mention extraction
1424
- - POOR: Suggests extraction for trivial fix
1425
-
1426
- ---
1427
-
1428
- ### learn-008-maintenance (Story 8: Review and Maintenance Cycle)
1429
-
1430
- **Input:**
1431
-
1432
- > This learning file is 250 lines and covers both React hooks and Redux patterns
1433
-
1434
- **Context files:**
1435
-
1436
- - `framework/SAFEWORD.md`
1437
- - `framework/guides/learning-extraction.md`
1438
-
1439
- **Expected:** Agent recommends splitting into focused files
1440
-
1441
- **Rubric:**
1442
-
1443
- - EXCELLENT: Recommends splitting (>200 lines, multiple concepts), suggests specific split
1444
- - ACCEPTABLE: Recommends splitting
1445
- - POOR: Accepts 250-line multi-concept file without comment
1446
-
1447
- ---
1448
-
1449
- ### learn-010-workflow (Story 10: Workflow Integration)
1450
-
1451
- **Input:**
1452
-
1453
- > I just finished implementing a complex feature and discovered a race condition pattern. Walk me through documenting this.
1454
-
1455
- **Context files:**
1456
-
1457
- - `framework/SAFEWORD.md`
1458
- - `framework/guides/learning-extraction.md`
1459
-
1460
- **Expected:** Agent follows the workflow steps
1461
-
1462
- **Rubric:**
1463
-
1464
- - EXCELLENT: Follows workflow (assess scope → choose location → extract using template → cross-reference in SAFEWORD.md → suggest commit message)
1465
- - ACCEPTABLE: Follows most workflow steps
1466
- - POOR: Ad-hoc extraction without following workflow
1467
-
1468
- ---
1469
-
1470
- ### learn-011-anti-patterns (Story 11: Anti-Patterns to Avoid)
1471
-
1472
- **Input:**
1473
-
1474
- > I want to create a learning that says "Changed == to ==="
1475
-
1476
- **Context files:**
1477
-
1478
- - `framework/SAFEWORD.md`
1479
- - `framework/guides/learning-extraction.md`
1480
-
1481
- **Expected:** Agent blocks this as trivial one-liner
1482
-
1483
- **Rubric:**
1484
-
1485
- - EXCELLENT: Blocks extraction, cites anti-pattern "One-line fixes without context"
1486
- - ACCEPTABLE: Suggests this is too trivial
1487
- - POOR: Proceeds with extraction
1488
-
1489
- ---
1490
-
1491
- ### learn-012-size-standards (Story 12: Directory & Size Standards)
1492
-
1493
- **Input:**
1494
-
1495
- > I'm creating a learning file that's 180 lines and covers both React hooks and Redux patterns
1496
-
1497
- **Context files:**
1498
-
1499
- - `framework/SAFEWORD.md`
1500
- - `framework/guides/learning-extraction.md`
1501
-
1502
- **Expected:** Agent recommends splitting based on size and scope
1503
-
1504
- **Rubric:**
1505
-
1506
- - EXCELLENT: Recommends splitting (>150 lines, multiple concepts), suggests specific split
1507
- - ACCEPTABLE: Notes it's borderline, recommends review
1508
- - POOR: Accepts 180-line multi-concept file without comment
1509
-
1510
- ---
1511
-
1512
- ## llm-instruction-design.md
1513
-
1514
- ### llm-001-mece (Story 1: MECE Decision Trees)
1515
-
1516
- **Input:**
1517
-
1518
- > I'm writing a decision tree for choosing between unit, integration, and E2E tests. Here's my draft:
1519
- >
1520
- > - Is it a pure function?
1521
- > - Does it interact with multiple components?
1522
- > - Does it test the full user flow?
1523
-
1524
- **Context files:**
1525
-
1526
- - `framework/guides/llm-instruction-design.md`
1527
-
1528
- **Expected:** Agent identifies overlapping branches and suggests sequential MECE structure
1529
-
1530
- **Rubric:**
1531
-
1532
- - EXCELLENT: Identifies overlap ("multiple components" and "full user flow" can both apply), suggests sequential ordering with first-match stop
1533
- - ACCEPTABLE: Notes ambiguity, suggests improvement
1534
- - POOR: Accepts overlapping branches without comment
1535
-
1536
- ---
1537
-
1538
- ### llm-002-explicit-definitions (Story 2: Explicit Definitions)
1539
-
1540
- **Input:**
1541
-
1542
- > I'm writing documentation that says "Test critical paths at the lowest level possible"
1543
-
1544
- **Context files:**
1545
-
1546
- - `framework/guides/llm-instruction-design.md`
1547
-
1548
- **Expected:** Agent identifies vague terms and suggests explicit definitions
1549
-
1550
- **Rubric:**
1551
-
1552
- - EXCELLENT: Identifies both "critical paths" and "lowest level" as vague, suggests explicit definitions with examples
1553
- - ACCEPTABLE: Identifies at least one vague term
1554
- - POOR: Accepts vague phrasing without comment
1555
-
1556
- ---
1557
-
1558
- ### llm-003-no-contradictions (Story 3: No Contradictions)
1559
-
1560
- **Input:**
1561
-
1562
- > I'm updating our testing guide. Section A says "Write E2E tests for all user-facing features" but Section B says "E2E tests only for critical paths". Should I keep both?
1563
-
1564
- **Context files:**
1565
-
1566
- - `framework/guides/llm-instruction-design.md`
1567
-
1568
- **Expected:** Agent identifies contradiction and suggests reconciliation
1569
-
1570
- **Rubric:**
1571
-
1572
- - EXCELLENT: Identifies contradiction, suggests reconciling into single rule with explicit definition of "critical"
1573
- - ACCEPTABLE: Identifies contradiction, suggests removing one
1574
- - POOR: Accepts both statements without noting conflict
1575
-
1576
- ---
1577
-
1578
- ### llm-004-concrete-examples (Story 4: Concrete Examples)
1579
-
1580
- **Input:**
1581
-
1582
- > I'm writing a rule that says "Use meaningful variable names". Is this good enough?
1583
-
1584
- **Context files:**
1585
-
1586
- - `framework/guides/llm-instruction-design.md`
1587
-
1588
- **Expected:** Agent suggests adding BAD/GOOD examples
1589
-
1590
- **Rubric:**
1591
-
1592
- - EXCELLENT: Suggests adding 2-3 concrete BAD/GOOD examples (e.g., `x` vs `userCount`)
1593
- - ACCEPTABLE: Suggests adding at least one example
1594
- - POOR: Accepts abstract rule without examples
1595
-
1596
- ---
1597
-
1598
- ### llm-005-edge-cases (Story 5: Edge Cases Explicit)
1599
-
1600
- **Input:**
1601
-
1602
- > I'm writing a rule: "Unit test all pure functions". Is this complete?
1603
-
1604
- **Context files:**
1605
-
1606
- - `framework/guides/llm-instruction-design.md`
1607
-
1608
- **Expected:** Agent suggests adding edge cases section
1609
-
1610
- **Rubric:**
1611
-
1612
- - EXCELLENT: Suggests adding edge cases (Date.now(), process.env, mixed pure+I/O)
1613
- - ACCEPTABLE: Suggests adding at least one edge case
1614
- - POOR: Accepts rule without edge cases
1615
-
1616
- ---
1617
-
1618
- ### llm-006-actionable (Story 6: Actionable, Not Vague)
1619
-
1620
- **Input:**
1621
-
1622
- > I'm writing guidance: "Most of your tests should be fast, some can be slow". Is this clear enough?
1623
-
1624
- **Context files:**
1625
-
1626
- - `framework/guides/llm-instruction-design.md`
1627
-
1628
- **Expected:** Agent identifies vague terms and suggests actionable alternatives
1629
-
1630
- **Rubric:**
1631
-
1632
- - EXCELLENT: Identifies "most/some" as vague, suggests concrete rules with red flags
1633
- - ACCEPTABLE: Identifies vagueness, suggests improvement
1634
- - POOR: Accepts vague guidance without comment
1635
-
1636
- ---
1637
-
1638
- ### llm-007-sequential (Story 7: Sequential Decision Trees)
1639
-
1640
- **Input:**
1641
-
1642
- > I have a decision tree with three parallel branches:
1643
- >
1644
- > - Is it a pure function?
1645
- > - Does it interact with the database?
1646
- > - Does it render UI?
1647
-
1648
- **Context files:**
1649
-
1650
- - `framework/guides/llm-instruction-design.md`
1651
-
1652
- **Expected:** Agent suggests converting to sequential with first-match stop
1653
-
1654
- **Rubric:**
1655
-
1656
- - EXCELLENT: Suggests sequential ordering with explicit "stop at first match" instruction
1657
- - ACCEPTABLE: Suggests ordering the questions
1658
- - POOR: Accepts parallel structure without comment
1659
-
1660
- ---
1661
-
1662
- ### llm-008-tie-breaking (Story 8: Tie-Breaking Rules)
1663
-
1664
- **Input:**
1665
-
1666
- > I have a decision tree where both unit test and integration test could work for testing a calculation that uses a database. Which should I choose?
1667
-
1668
- **Context files:**
1669
-
1670
- - `framework/guides/llm-instruction-design.md`
1671
-
1672
- **Expected:** Agent applies tie-breaking rule (choose fastest)
1673
-
1674
- **Rubric:**
1675
-
1676
- - EXCELLENT: Applies tie-breaking rule, chooses unit test with mocked database (faster)
1677
- - ACCEPTABLE: Mentions tie-breaking, makes a choice
1678
- - POOR: Leaves choice ambiguous or doesn't mention tie-breaking
1679
-
1680
- ---
1681
-
1682
- ### llm-009-lookup-tables (Story 9: Lookup Tables for Complex Logic)
1683
-
1684
- **Input:**
1685
-
1686
- > I have 5 different scenarios for choosing between unit, integration, and E2E tests. Should I write them as prose paragraphs?
1687
-
1688
- **Context files:**
1689
-
1690
- - `framework/guides/llm-instruction-design.md`
1691
-
1692
- **Expected:** Agent suggests using a lookup table
1693
-
1694
- **Rubric:**
1695
-
1696
- - EXCELLENT: Suggests lookup table format with clear columns (Scenario/Unit/Integration/E2E/Best Choice)
1697
- - ACCEPTABLE: Suggests table format
1698
- - POOR: Accepts prose paragraphs for 5 scenarios
1699
-
1700
- ---
1701
-
1702
- ### llm-010-no-caveats (Story 10: No Caveats in Tables)
1703
-
1704
- **Input:**
1705
-
1706
- > I have a table cell that says "Unit test ✅ (unless it uses external APIs)". Is this okay?
1707
-
1708
- **Context files:**
1709
-
1710
- - `framework/guides/llm-instruction-design.md`
1711
-
1712
- **Expected:** Agent suggests removing caveat from cell
1713
-
1714
- **Rubric:**
1715
-
1716
- - EXCELLENT: Suggests creating separate row for external API case, removing parenthetical
1717
- - ACCEPTABLE: Identifies parenthetical as problem
1718
- - POOR: Accepts caveat in cell
1719
-
1720
- ---
1721
-
1722
- ### llm-011-percentages (Story 11: Percentages with Context)
1723
-
1724
- **Input:**
1725
-
1726
- > I'm writing guidance: "Aim for 80% unit tests, 15% integration tests, 5% E2E tests". Is this clear?
1727
-
1728
- **Context files:**
1729
-
1730
- - `framework/guides/llm-instruction-design.md`
1731
-
1732
- **Expected:** Agent suggests adding context/adjustments or principles-based alternative
1733
-
1734
- **Rubric:**
1735
-
1736
- - EXCELLENT: Suggests adding adjustments for different project types OR suggests principles-based alternative
1737
- - ACCEPTABLE: Notes percentages need context
1738
- - POOR: Accepts standalone percentages without comment
1739
-
1740
- ---
1741
-
1742
- ### llm-012-specific-questions (Story 12: Specific Questions)
1743
-
1744
- **Input:**
1745
-
1746
- > I'm writing a decision tree question: "Does this test need to see the UI?" Is this specific enough?
1747
-
1748
- **Context files:**
1749
-
1750
- - `framework/guides/llm-instruction-design.md`
1751
-
1752
- **Expected:** Agent suggests more specific wording
1753
-
1754
- **Rubric:**
1755
-
1756
- - EXCELLENT: Suggests tool-specific wording like "real browser (Playwright/Cypress)" and clarifies RTL distinction
1757
- - ACCEPTABLE: Suggests more specific wording
1758
- - POOR: Accepts vague "see the UI" phrasing
1759
-
1760
- ---
1761
-
1762
- ### llm-013-re-evaluation (Story 13: Re-evaluation Paths)
1763
-
1764
- **Input:**
1765
-
1766
- > I have a feature that doesn't fit any of my testing categories. What should I do?
1767
-
1768
- **Context files:**
1769
-
1770
- - `framework/guides/llm-instruction-design.md`
1771
-
1772
- **Expected:** Agent provides decomposition strategy
1773
-
1774
- **Rubric:**
1775
-
1776
- - EXCELLENT: Provides 3-step decomposition (separate concerns → test each → example)
1777
- - ACCEPTABLE: Suggests breaking down the feature
1778
- - POOR: Says "re-evaluate your approach" without concrete steps
1779
-
1780
- ---
1781
-
1782
- ### llm-014-anti-patterns (Story 14: Anti-Patterns Guard)
1783
-
1784
- **Input:**
1785
-
1786
- > I'm writing documentation that says "Follow the test pyramid - lots of unit tests at the base, integration in the middle, E2E at the top"
1787
-
1788
- **Context files:**
1789
-
1790
- - `framework/guides/llm-instruction-design.md`
1791
-
1792
- **Expected:** Agent identifies visual metaphor anti-pattern
1793
-
1794
- **Rubric:**
1795
-
1796
- - EXCELLENT: Identifies "test pyramid" as visual metaphor, suggests actionable alternative
1797
- - ACCEPTABLE: Notes visual metaphor issue
1798
- - POOR: Accepts visual metaphor without comment
1799
-
1800
- ---
1801
-
1802
- ### llm-015-quality-checklist (Story 15: Quality Checklist Compliance)
1803
-
1804
- **Input:**
1805
-
1806
- > I just finished writing an LLM instruction document. What should I check before committing?
1807
-
1808
- **Context files:**
1809
-
1810
- - `framework/guides/llm-instruction-design.md`
1811
-
1812
- **Expected:** Agent provides quality checklist items
1813
-
1814
- **Rubric:**
1815
-
1816
- - EXCELLENT: Lists most/all checklist items (MECE, definitions, examples, edge cases, etc.)
1817
- - ACCEPTABLE: Lists several key checklist items
1818
- - POOR: Generic advice without specific checklist
1819
-
1820
- ---
1821
-
1822
- ## llm-prompting.md
1823
-
1824
- ### prompt-001-concrete-examples (Story 1: Concrete Examples in Prompts)
1825
-
1826
- **Input:**
1827
-
1828
- > I'm writing a prompt that says "Return the user's intent". Is this good enough?
1829
-
1830
- **Context files:**
1831
-
1832
- - `framework/guides/llm-prompting.md`
1833
-
1834
- **Expected:** Agent suggests adding BAD/GOOD examples with concrete format
1835
-
1836
- **Rubric:**
1837
-
1838
- - EXCELLENT: Suggests adding structured JSON example showing BAD (prose) vs GOOD (JSON schema)
1839
- - ACCEPTABLE: Suggests being more specific
1840
- - POOR: Accepts vague prompt without examples
1841
-
1842
- ---
1843
-
1844
- ### prompt-002-structured-outputs (Story 2: Structured Outputs via JSON)
1845
-
1846
- **Input:**
1847
-
1848
- > I'm building an AI agent that needs to understand user intent. Should I have it return prose like "The user wants to create a campaign"?
1849
-
1850
- **Context files:**
1851
-
1852
- - `framework/guides/llm-prompting.md`
1853
-
1854
- **Expected:** Agent recommends structured JSON output
1855
-
1856
- **Rubric:**
1857
-
1858
- - EXCELLENT: Recommends JSON schema with explicit fields (intent, name, etc.), shows example
1859
- - ACCEPTABLE: Suggests structured output
1860
- - POOR: Accepts prose output for machine consumption
1861
-
1862
- ---
1863
-
1864
- ### prompt-003-caching (Story 3: Prompt Caching for Cost Reduction)
1865
-
1866
- **Input:**
1867
-
1868
- > I have a 500-line system prompt that includes both static rules and the current character state. How should I structure this?
1869
-
1870
- **Context files:**
1871
-
1872
- - `framework/guides/llm-prompting.md`
1873
-
1874
- **Expected:** Agent recommends separating static (cached) from dynamic (uncached)
1875
-
1876
- **Rubric:**
1877
-
1878
- - EXCELLENT: Recommends static rules with cache_control: ephemeral, dynamic state in user message, mentions cost reduction
1879
- - ACCEPTABLE: Suggests separating static from dynamic
1880
- - POOR: Accepts mixed static/dynamic in system prompt
1881
-
1882
- ---
1883
-
1884
- ### prompt-004-message-architecture (Story 4: Message Architecture)
1885
-
1886
- **Input:**
1887
-
1888
- > I'm interpolating the user's character state directly into my system prompt like this: systemPrompt = `Rules + Character: ${dynamicState}`. Is this okay?
1889
-
1890
- **Context files:**
1891
-
1892
- - `framework/guides/llm-prompting.md`
1893
-
1894
- **Expected:** Agent identifies this as BAD pattern
1895
-
1896
- **Rubric:**
1897
-
1898
- - EXCELLENT: Identifies as BAD (uncacheable), recommends moving dynamic state to user message
1899
- - ACCEPTABLE: Suggests separating static from dynamic
1900
- - POOR: Accepts dynamic state in system prompt
1901
-
1902
- ---
1903
-
1904
- ### prompt-005-cache-invalidation (Story 5: Cache Invalidation Discipline)
1905
-
1906
- **Input:**
1907
-
1908
- > I want to add a small clarification to my cached system prompt. Should I just make the change?
1909
-
1910
- **Context files:**
1911
-
1912
- - `framework/guides/llm-prompting.md`
1913
-
1914
- **Expected:** Agent warns about cache invalidation
1915
-
1916
- **Rubric:**
1917
-
1918
- - EXCELLENT: Warns "any change breaks all caches", suggests batching edits, mentions rebuild cost
1919
- - ACCEPTABLE: Notes cache invalidation concern
1920
- - POOR: Suggests making change without mentioning cache impact
1921
-
1922
- ---
1923
-
1924
- ### prompt-006-llm-as-judge (Story 6: LLM-as-Judge Evaluations)
1925
-
1926
- **Input:**
1927
-
1928
- > I want to test if my AI GM's responses have a "collaborative tone". Should I check for specific keywords like "together" or "we"?
1929
-
1930
- **Context files:**
1931
-
1932
- - `framework/guides/llm-prompting.md`
1933
-
1934
- **Expected:** Agent recommends LLM-as-judge with rubric
1935
-
1936
- **Rubric:**
1937
-
1938
- - EXCELLENT: Recommends LLM-as-judge pattern with EXCELLENT/ACCEPTABLE/POOR rubric, warns against brittle keywords
1939
- - ACCEPTABLE: Suggests rubric-based evaluation
1940
- - POOR: Accepts keyword matching for creative outputs
1941
-
1942
- ---
1943
-
1944
- ### prompt-007-eval-framework (Story 7: Evaluation Framework Mapping)
1945
-
1946
- **Input:**
1947
-
1948
- > I have a function that parses JSON, an agent that calls an LLM, and a judgment about narrative quality. What test types should I use?
1949
-
1950
- **Context files:**
1951
-
1952
- - `framework/guides/llm-prompting.md`
1953
-
1954
- **Expected:** Agent maps to correct test types
1955
-
1956
- **Rubric:**
1957
-
1958
- - EXCELLENT: JSON parsing → Unit test, Agent + LLM → Integration test, Narrative quality → LLM Eval
1959
- - ACCEPTABLE: Correctly identifies at least 2 mappings
1960
- - POOR: Suggests same test type for all
1961
-
1962
- ---
1963
-
1964
- ### prompt-008-cost-awareness (Story 8: Cost Awareness for Evals)
1965
-
1966
- **Input:**
1967
-
1968
- > I want to run 100 LLM evaluation scenarios in CI. What should I consider?
1969
-
1970
- **Context files:**
1971
-
1972
- - `framework/guides/llm-prompting.md`
1973
-
1974
- **Expected:** Agent provides cost guidance
1975
-
1976
- **Rubric:**
1977
-
1978
- - EXCELLENT: Mentions typical costs (~$0.15-0.30 for 30 scenarios with caching), suggests caching rubrics, budget expectations
1979
- - ACCEPTABLE: Notes cost considerations
1980
- - POOR: Ignores cost implications
1981
-
1982
- ---
1983
-
1984
- ### prompt-009-why-over-what (Story 9: "Why" Over "What" in Prompts)
1985
-
1986
- **Input:**
1987
-
1988
- > My prompt says "Use JSON output". Should I add more context?
1989
-
1990
- **Context files:**
1991
-
1992
- - `framework/guides/llm-prompting.md`
1993
-
1994
- **Expected:** Agent suggests adding rationale
1995
-
1996
- **Rubric:**
1997
-
1998
- - EXCELLENT: Suggests adding "why" (predictable parsing, validation), specific benefits, trade-offs
1999
- - ACCEPTABLE: Suggests adding rationale
2000
- - POOR: Accepts bare instruction without context
2001
-
2002
- ---
2003
-
2004
- ### prompt-010-precise-terms (Story 10: Precise Technical Terms)
2005
-
2006
- **Input:**
2007
-
2008
- > My decision tree asks "Does this test need to see the UI?"
2009
-
2010
- **Context files:**
2011
-
2012
- - `framework/guides/llm-prompting.md`
2013
-
2014
- **Expected:** Agent suggests more precise wording
2015
-
2016
- **Rubric:**
2017
-
2018
- - EXCELLENT: Suggests "real browser (Playwright/Cypress)", clarifies RTL is not a real browser
2019
- - ACCEPTABLE: Suggests more specific wording
2020
- - POOR: Accepts vague "see the UI" phrasing
2021
-
2022
- ---
2023
-
2024
- ## tdd-best-practices.md (formerly tdd-templates.md)
2025
-
2026
- ### tdd-001-template-selection (Story 1: Select Correct Template)
2027
-
2028
- **Input:**
2029
-
2030
- > I need to document: (1) a new user authentication feature, (2) the tests for that feature, (3) how the components will interact, and (4) the overall project data model. Which templates should I use?
2031
-
2032
- **Context files:**
2033
-
2034
- - `framework/guides/tdd-templates.md`
2035
-
2036
- **Expected:** Agent maps each to correct template
2037
-
2038
- **Rubric:**
2039
-
2040
- - EXCELLENT: (1) User stories, (2) Test definitions, (3) Design doc, (4) Architecture doc
2041
- - ACCEPTABLE: Correctly identifies at least 3 mappings
2042
- - POOR: Uses same template for all or incorrect mappings
2043
-
2044
- ---
2045
-
2046
- ### tdd-002-story-format (Story 2: Story Format Selection)
2047
-
2048
- **Input:**
2049
-
2050
- > I'm writing a user story for a login feature. Should I use "As a user..." or "Given I am..."?
2051
-
2052
- **Context files:**
2053
-
2054
- - `framework/guides/tdd-templates.md`
2055
-
2056
- **Expected:** Agent recommends appropriate format based on context
2057
-
2058
- **Rubric:**
2059
-
2060
- - EXCELLENT: Recommends standard "As a..." for features, Given-When-Then for behavior-focused
2061
- - ACCEPTABLE: Explains both formats
2062
- - POOR: No guidance on format selection
2063
-
2064
- ---
2065
-
2066
- ### tdd-003-acceptance-criteria (Story 3: Story Acceptance Criteria and Scope)
2067
-
2068
- **Input:**
2069
-
2070
- > My user story has 8 acceptance criteria and no out-of-scope section. Is this okay?
2071
-
2072
- **Context files:**
2073
-
2074
- - `framework/guides/tdd-templates.md`
2075
-
2076
- **Expected:** Agent suggests reducing AC and adding out-of-scope
2077
-
2078
- **Rubric:**
2079
-
2080
- - EXCELLENT: Suggests 2-5 AC, recommends adding out-of-scope to prevent creep
2081
- - ACCEPTABLE: Notes AC count is high
2082
- - POOR: Accepts 8 AC without comment
2083
-
2084
- ---
2085
-
2086
- ### tdd-004-story-anti-patterns (Story 4: Block Story Anti-Patterns)
2087
-
2088
- **Input:**
2089
-
2090
- > Here's my user story: "As a developer, I want to refactor the database layer so that the code is cleaner"
2091
-
2092
- **Context files:**
2093
-
2094
- - `framework/guides/tdd-templates.md`
2095
-
2096
- **Expected:** Agent identifies anti-pattern
2097
-
2098
- **Rubric:**
2099
-
2100
- - EXCELLENT: Identifies as technical task (not user story), suggests spike or task instead
2101
- - ACCEPTABLE: Notes it's too technical
2102
- - POOR: Accepts implementation-focused "story"
2103
-
2104
- ---
2105
-
2106
- ### tdd-005-test-definitions (Story 5: Create Test Definitions per Feature)
2107
-
2108
- **Input:**
2109
-
2110
- > I'm creating test definitions. What sections should I include?
2111
-
2112
- **Context files:**
2113
-
2114
- - `framework/guides/tdd-templates.md`
2115
-
2116
- **Expected:** Agent lists required sections
2117
-
2118
- **Rubric:**
2119
-
2120
- - EXCELLENT: Suites, individual tests, status per test, coverage summary, execution commands
2121
- - ACCEPTABLE: Lists most sections
2122
- - POOR: Vague or incomplete list
2123
-
2124
- ---
2125
-
2126
- ### tdd-006-good-story-examples (Story 6: GOOD Story Examples)
2127
-
2128
- **Input:**
2129
-
2130
- > Can you show me what a good user story looks like for a web app feature?
2131
-
2132
- **Context files:**
2133
-
2134
- - `framework/guides/tdd-templates.md`
2135
-
2136
- **Expected:** Agent provides concrete example
2137
-
2138
- **Rubric:**
2139
-
2140
- - EXCELLENT: Shows complete example with role, want, so that, AC (specific/testable), out-of-scope
2141
- - ACCEPTABLE: Shows basic structure
2142
- - POOR: Vague or incomplete example
2143
-
2144
- ---
2145
-
2146
- ### tdd-007-bad-story-examples (Story 7: BAD Story Examples)
2147
-
2148
- **Input:**
2149
-
2150
- > Is this a good story? "As a user, I want the app to work better so that I'm happy"
2151
-
2152
- **Context files:**
2153
-
2154
- - `framework/guides/tdd-templates.md`
2155
-
2156
- **Expected:** Agent identifies anti-patterns
2157
-
2158
- **Rubric:**
2159
-
2160
- - EXCELLENT: Identifies all issues (vague role, unmeasurable "work better", no AC)
2161
- - ACCEPTABLE: Identifies at least 2 issues
2162
- - POOR: Accepts vague story
2163
-
2164
- ---
2165
-
2166
- ### tdd-008-invest-criteria (Story 8: INVEST Criteria)
2167
-
2168
- **Input:**
2169
-
2170
- > How do I know if my user story is good enough?
2171
-
2172
- **Context files:**
2173
-
2174
- - `framework/guides/tdd-templates.md`
2175
-
2176
- **Expected:** Agent explains INVEST criteria
2177
-
2178
- **Rubric:**
2179
-
2180
- - EXCELLENT: Explains Independent, Negotiable, Valuable, Estimable, Small, Testable
2181
- - ACCEPTABLE: Mentions several INVEST criteria
2182
- - POOR: No structured validation criteria
2183
-
2184
- ---
2185
-
2186
- ### tdd-009-test-definition-format (Story 9: Test Definition Format)
2187
-
2188
- **Input:**
2189
-
2190
- > How should I format individual tests in my test definitions?
2191
-
2192
- **Context files:**
2193
-
2194
- - `framework/guides/tdd-templates.md`
2195
-
2196
- **Expected:** Agent shows test format
2197
-
2198
- **Rubric:**
2199
-
2200
- - EXCELLENT: Shows numbered format with description, status, steps, expected outcome
2201
- - ACCEPTABLE: Shows basic format
2202
- - POOR: Vague or no format guidance
2203
-
2204
- ---
2205
-
2206
- ### tdd-010-test-status-tracking (Story 10: Test Status Tracking)
2207
-
2208
- **Input:**
2209
-
2210
- > What status indicators should I use for tests?
2211
-
2212
- **Context files:**
2213
-
2214
- - `framework/guides/tdd-templates.md`
2215
-
2216
- **Expected:** Agent lists status indicators
2217
-
2218
- **Rubric:**
2219
-
2220
- - EXCELLENT: ✅ Passing, ⏭️ Skipped (with rationale), ❌ Not Implemented, 🔴 Failing
2221
- - ACCEPTABLE: Lists most statuses
2222
- - POOR: Inconsistent or missing statuses
2223
-
2224
- ---
2225
-
2226
- ### tdd-011-coverage-summary (Story 11: Coverage Summary)
2227
-
2228
- **Input:**
2229
-
2230
- > Should I include a coverage summary in my test definitions?
2231
-
2232
- **Context files:**
2233
-
2234
- - `framework/guides/tdd-templates.md`
2235
-
2236
- **Expected:** Agent recommends coverage summary
2237
-
2238
- **Rubric:**
2239
-
2240
- - EXCELLENT: Yes, with totals, percentages per status, rationale for skipped
2241
- - ACCEPTABLE: Recommends summary
2242
- - POOR: No guidance on coverage tracking
2243
-
2244
- ---
2245
-
2246
- ### tdd-012-test-data-builders (Story 12: Test Data Builders)
2247
-
2248
- **Input:**
2249
-
2250
- > I'm writing tests that need complex test data. How should I structure this?
2251
-
2252
- **Context files:**
2253
-
2254
- - `framework/guides/tdd-templates.md`
2255
-
2256
- **Expected:** Agent recommends test data builders
2257
-
2258
- **Rubric:**
2259
-
2260
- - EXCELLENT: Recommends builder pattern with defaults, explains benefits
2261
- - ACCEPTABLE: Suggests organizing test data
2262
- - POOR: No guidance on test data
2263
-
2264
- ---
2265
-
2266
- ### tdd-013-llm-as-judge (Story 13: LLM-as-Judge Rubrics)
2267
-
2268
- **Input:**
2269
-
2270
- > I need to test if my AI's narrative response has the right tone. How?
2271
-
2272
- **Context files:**
2273
-
2274
- - `framework/guides/tdd-templates.md`
2275
-
2276
- **Expected:** Agent recommends LLM-as-judge with rubric
2277
-
2278
- **Rubric:**
2279
-
2280
- - EXCELLENT: LLM-as-judge with EXCELLENT/ACCEPTABLE/POOR rubric, avoid keyword matching
2281
- - ACCEPTABLE: Suggests rubric-based evaluation
2282
- - POOR: Suggests keyword matching
2283
-
2284
- ---
2285
-
2286
- ### tdd-014-real-llm-integration (Story 14: Integration with Real LLM)
2287
-
2288
- **Input:**
2289
-
2290
- > Should my integration tests use a real LLM or mock it?
2291
-
2292
- **Context files:**
2293
-
2294
- - `framework/guides/tdd-templates.md`
2295
-
2296
- **Expected:** Agent provides guidance on real vs mock
2297
-
2298
- **Rubric:**
2299
-
2300
- - EXCELLENT: Real LLM for schema compliance, mock for unit tests, cost considerations
2301
- - ACCEPTABLE: Distinguishes use cases
2302
- - POOR: No guidance on when to use real vs mock
2303
-
2304
- ---
2305
-
2306
- ### tdd-015-invest-gate (Story 15: INVEST Gate for Stories)
2307
-
2308
- **Input:**
2309
-
2310
- > My story is too big to estimate. What should I do?
2311
-
2312
- **Context files:**
2313
-
2314
- - `framework/guides/tdd-templates.md`
2315
-
2316
- **Expected:** Agent suggests splitting
2317
-
2318
- **Rubric:**
2319
-
2320
- - EXCELLENT: Cites INVEST (Estimable, Small), suggests splitting into smaller stories
2321
- - ACCEPTABLE: Suggests splitting
2322
- - POOR: Accepts large story
2323
-
2324
- ---
2325
-
2326
- ### tdd-016-red-flags (Story 16: Red Flags and Ratios)
2327
-
2328
- **Input:**
2329
-
2330
- > I have 50 E2E tests and 20 unit tests. Is this a good ratio?
2331
-
2332
- **Context files:**
2333
-
2334
- - `framework/guides/tdd-templates.md`
2335
-
2336
- **Expected:** Agent identifies red flag
2337
-
2338
- **Rubric:**
2339
-
2340
- - EXCELLENT: Red flag - more E2E than unit is inverted pyramid, suggests adding unit tests
2341
- - ACCEPTABLE: Notes ratio concern
2342
- - POOR: Accepts inverted ratio
2343
-
2344
- ---
2345
-
2346
- ## test-definitions-guide.md
2347
-
2348
- ### testdef-001-template (Story 1: Use Standard Template)
2349
-
2350
- **Input:**
2351
-
2352
- > I need to create test definitions for a new feature. Where do I start?
2353
-
2354
- **Context files:**
2355
-
2356
- - `framework/guides/test-definitions-guide.md`
2357
-
2358
- **Expected:** Agent points to template and workflow
2359
-
2360
- **Rubric:**
2361
-
2362
- - EXCELLENT: Points to template, lists 8 steps (fill in feature name, organize into suites, etc.)
2363
- - ACCEPTABLE: Points to template
2364
- - POOR: No template reference
2365
-
2366
- ---
2367
-
2368
- ### testdef-002-suites (Story 2: Organize Tests into Suites)
2369
-
2370
- **Input:**
2371
-
2372
- > I have 15 tests for a feature. How should I organize them?
2373
-
2374
- **Context files:**
2375
-
2376
- - `framework/guides/test-definitions-guide.md`
2377
-
2378
- **Expected:** Agent suggests suite organization
2379
-
2380
- **Rubric:**
2381
-
2382
- - EXCELLENT: Suggests suites (Layout, Interactions, State, Accessibility, Edge Cases), numbered tests
2383
- - ACCEPTABLE: Suggests grouping logically
2384
- - POOR: No organization guidance
2385
-
2386
- ---
2387
-
2388
- ### testdef-003-status (Story 3: Track Test Status)
2389
-
2390
- **Input:**
2391
-
2392
- > What status indicators should I use for my tests?
2393
-
2394
- **Context files:**
2395
-
2396
- - `framework/guides/test-definitions-guide.md`
2397
-
2398
- **Expected:** Agent lists status indicators
2399
-
2400
- **Rubric:**
2401
-
2402
- - EXCELLENT: ✅ Passing, ⏭️ Skipped (with rationale), ❌ Not Implemented, 🔴 Failing
2403
- - ACCEPTABLE: Lists most statuses
2404
- - POOR: Inconsistent statuses
2405
-
2406
- ---
2407
-
2408
- ### testdef-004-steps (Story 4: Write Clear Steps)
2409
-
2410
- **Input:**
2411
-
2412
- > My test step says "Check panes". Is this good enough?
2413
-
2414
- **Context files:**
2415
-
2416
- - `framework/guides/test-definitions-guide.md`
2417
-
2418
- **Expected:** Agent identifies vague step
2419
-
2420
- **Rubric:**
2421
-
2422
- - EXCELLENT: Identifies as BAD (vague), shows GOOD example with numbered actionable steps
2423
- - ACCEPTABLE: Notes it's too vague
2424
- - POOR: Accepts vague step
2425
-
2426
- ---
2427
-
2428
- ### testdef-005-expected (Story 5: Define Specific Expected Outcomes)
2429
-
2430
- **Input:**
2431
-
2432
- > My expected outcome says "Everything works". Is this okay?
2433
-
2434
- **Context files:**
2435
-
2436
- - `framework/guides/test-definitions-guide.md`
2437
-
2438
- **Expected:** Agent identifies vague outcome
2439
-
2440
- **Rubric:**
2441
-
2442
- - EXCELLENT: Identifies as BAD, shows GOOD example with specific measurable assertions
2443
- - ACCEPTABLE: Notes it's too vague
2444
- - POOR: Accepts vague outcome
2445
-
2446
- ---
2447
-
2448
- ### testdef-006-coverage (Story 6: Coverage Summary)
2449
-
2450
- **Input:**
2451
-
2452
- > Should I include a coverage summary in my test definitions?
2453
-
2454
- **Context files:**
2455
-
2456
- - `framework/guides/test-definitions-guide.md`
2457
-
2458
- **Expected:** Agent recommends coverage summary
2459
-
2460
- **Rubric:**
2461
-
2462
- - EXCELLENT: Yes, with totals, percentages per status, rationale for skipped
2463
- - ACCEPTABLE: Recommends summary
2464
- - POOR: No guidance
2465
-
2466
- ---
2467
-
2468
- ### testdef-007-naming (Story 7: Test Naming)
2469
-
2470
- **Input:**
2471
-
2472
- > I named my test "Test 1". Is this okay?
2473
-
2474
- **Context files:**
2475
-
2476
- - `framework/guides/test-definitions-guide.md`
2477
-
2478
- **Expected:** Agent identifies bad naming
2479
-
2480
- **Rubric:**
2481
-
2482
- - EXCELLENT: Identifies as BAD, suggests descriptive name like "Render all three panes"
2483
- - ACCEPTABLE: Notes name is not descriptive
2484
- - POOR: Accepts "Test 1"
2485
-
2486
- ---
2487
-
2488
- ### testdef-008-commands (Story 8: Test Execution Commands)
2489
-
2490
- **Input:**
2491
-
2492
- > What should I include in the test execution section?
2493
-
2494
- **Context files:**
2495
-
2496
- - `framework/guides/test-definitions-guide.md`
2497
-
2498
- **Expected:** Agent lists command requirements
2499
-
2500
- **Rubric:**
2501
-
2502
- - EXCELLENT: Commands to run all tests, grep for specific test, match project tooling
2503
- - ACCEPTABLE: Suggests including commands
2504
- - POOR: No command guidance
2505
-
2506
- ---
2507
-
2508
- ### testdef-009-tdd-workflow (Story 9: TDD Workflow Integration)
2509
-
2510
- **Input:**
2511
-
2512
- > When should I create test definitions?
2513
-
2514
- **Context files:**
2515
-
2516
- - `framework/guides/test-definitions-guide.md`
2517
-
2518
- **Expected:** Agent explains TDD timing
2519
-
2520
- **Rubric:**
2521
-
2522
- - EXCELLENT: Before implementation, alongside user stories, update status as tests pass/fail
2523
- - ACCEPTABLE: Mentions before implementation
2524
- - POOR: No timing guidance
2525
-
2526
- ---
2527
-
2528
- ### testdef-010-user-story-mapping (Story 10: Map to User Stories)
2529
-
2530
- **Input:**
2531
-
2532
- > How do I connect my tests to user stories?
2533
-
2534
- **Context files:**
2535
-
2536
- - `framework/guides/test-definitions-guide.md`
2537
-
2538
- **Expected:** Agent explains mapping
2539
-
2540
- **Rubric:**
2541
-
2542
- - EXCELLENT: Each AC has at least one test, edge cases beyond AC, test file references
2543
- - ACCEPTABLE: Suggests mapping to AC
2544
- - POOR: No mapping guidance
2545
-
2546
- ---
2547
-
2548
- ### testdef-011-anti-patterns (Story 11: Avoid Common Mistakes)
2549
-
2550
- **Input:**
2551
-
2552
- > My test verifies "useUIStore hook works correctly". Is this a good test?
2553
-
2554
- **Context files:**
2555
-
2556
- - `framework/guides/test-definitions-guide.md`
2557
-
2558
- **Expected:** Agent identifies anti-pattern
2559
-
2560
- **Rubric:**
2561
-
2562
- - EXCELLENT: Identifies as BAD (implementation detail), suggests testing observable behavior
2563
- - ACCEPTABLE: Notes it's testing implementation
2564
- - POOR: Accepts implementation detail test
2565
-
2566
- ---
2567
-
2568
- ### testdef-012-llm-optimized (Story 12: Apply LLM Instruction Design)
2569
-
2570
- **Input:**
2571
-
2572
- > How do I make my test definitions LLM-friendly?
2573
-
2574
- **Context files:**
2575
-
2576
- - `framework/guides/test-definitions-guide.md`
2577
-
2578
- **Expected:** Agent provides LLM optimization guidance
2579
-
2580
- **Rubric:**
2581
-
2582
- - EXCELLENT: MECE decision trees, explicit definitions, concrete examples, actionable language
2583
- - ACCEPTABLE: Mentions clarity principles
2584
- - POOR: No LLM-specific guidance
2585
-
2586
- ---
2587
-
2588
- ## testing-methodology.md
2589
-
2590
- ### test-001-fastest-effective (Story 1: Fastest-Effective Test Rule)
2591
-
2592
- **Input:**
2593
-
2594
- > I need to test a discount calculation function. Should I use E2E or unit tests?
2595
-
2596
- **Context files:**
2597
-
2598
- - `framework/guides/testing-methodology.md`
2599
-
2600
- **Expected:** Agent recommends unit test (fastest)
2601
-
2602
- **Rubric:**
2603
-
2604
- - EXCELLENT: Unit test (pure function, milliseconds vs seconds), shows BAD E2E vs GOOD unit example
2605
- - ACCEPTABLE: Recommends unit test
2606
- - POOR: Suggests E2E for calculation
2607
-
2608
- ---
2609
-
2610
- ### test-002-component-vs-flow (Story 2: Component vs Flow Testing)
2611
-
2612
- **Input:**
2613
-
2614
- > I want to test a React header component. Should I use E2E or integration tests?
2615
-
2616
- **Context files:**
2617
-
2618
- - `framework/guides/testing-methodology.md`
2619
-
2620
- **Expected:** Agent recommends integration test for component
2621
-
2622
- **Rubric:**
2623
-
2624
- - EXCELLENT: Integration test for component behavior, E2E only for multi-page flows
2625
- - ACCEPTABLE: Distinguishes component vs flow
2626
- - POOR: Suggests E2E for single component
2627
-
2628
- ---
2629
-
2630
- ### test-003-distribution (Story 3: Target Distribution Guidance)
2631
-
2632
- **Input:**
2633
-
2634
- > I have 50 E2E tests and 20 integration tests. Is this a good ratio?
2635
-
2636
- **Context files:**
2637
-
2638
- - `framework/guides/testing-methodology.md`
2639
-
2640
- **Expected:** Agent identifies red flag
2641
-
2642
- **Rubric:**
2643
-
2644
- - EXCELLENT: Red flag - more E2E than integration is too slow, suggests adding integration tests
2645
- - ACCEPTABLE: Notes ratio concern
2646
- - POOR: Accepts inverted ratio
2647
-
2648
- ---
2649
-
2650
- ### test-004-tdd-phases (Story 4: TDD Phases with Guardrails)
2651
-
2652
- **Input:**
2653
-
2654
- > I wrote a test and it's passing. Should I implement the code now?
2655
-
2656
- **Context files:**
2657
-
2658
- - `framework/guides/testing-methodology.md`
2659
-
2660
- **Expected:** Agent identifies TDD violation
2661
-
2662
- **Rubric:**
2663
-
2664
- - EXCELLENT: RED phase violation - test must fail first, verify failure before implementation
2665
- - ACCEPTABLE: Notes test should fail first
2666
- - POOR: Accepts passing test before implementation
2667
-
2668
- ---
2669
-
2670
- ### test-005-decision-tree (Story 5: Test Type Decision Tree)
2671
-
2672
- **Input:**
2673
-
2674
- > I need to test narrative quality from my AI. What test type should I use?
2675
-
2676
- **Context files:**
2677
-
2678
- - `framework/guides/testing-methodology.md`
2679
-
2680
- **Expected:** Agent uses decision tree, selects LLM Eval
2681
-
2682
- **Rubric:**
2683
-
2684
- - EXCELLENT: Question 1 → AI content quality → LLM Evaluation
2685
- - ACCEPTABLE: Selects LLM Eval
2686
- - POOR: Suggests unit or E2E for AI quality
2687
-
2688
- ---
2689
-
2690
- ### test-006-bug-mapping (Story 6: Bug-to-Test Mapping Table)
2691
-
2692
- **Input:**
2693
-
2694
- > I have a CSS layout bug. What test type should I use?
2695
-
2696
- **Context files:**
2697
-
2698
- - `framework/guides/testing-methodology.md`
2699
-
2700
- **Expected:** Agent maps to E2E
2701
-
2702
- **Rubric:**
2703
-
2704
- - EXCELLENT: E2E (requires real browser for CSS), references lookup table
2705
- - ACCEPTABLE: Selects E2E
2706
- - POOR: Suggests unit test for CSS
2707
-
2708
- ---
2709
-
2710
- ### test-007-e2e-isolation (Story 7: E2E Dev/Test Server Isolation)
2711
-
2712
- **Input:**
2713
-
2714
- > My E2E tests keep failing because they conflict with my dev server. How do I fix this?
2715
-
2716
- **Context files:**
2717
-
2718
- - `framework/guides/testing-methodology.md`
2719
-
2720
- **Expected:** Agent suggests port isolation
2721
-
2722
- **Rubric:**
2723
-
2724
- - EXCELLENT: Dev on stable port, tests on devPort+1000, Playwright config with isolated port
2725
- - ACCEPTABLE: Suggests separate ports
2726
- - POOR: No isolation guidance
2727
-
2728
- ---
2729
-
2730
- ### test-008-llm-evals (Story 8: LLM Evaluations Usage)
2731
-
2732
- **Input:**
2733
-
2734
- > Should I use keyword matching to test if my AI response has a "collaborative tone"?
2735
-
2736
- **Context files:**
2737
-
2738
- - `framework/guides/testing-methodology.md`
2739
-
2740
- **Expected:** Agent recommends LLM-as-judge
2741
-
2742
- **Rubric:**
2743
-
2744
- - EXCELLENT: LLM-as-judge with rubric, avoid brittle keywords for creative outputs
2745
- - ACCEPTABLE: Suggests rubric-based evaluation
2746
- - POOR: Accepts keyword matching
2747
-
2748
- ---
2749
-
2750
- ### test-009-cost-controls (Story 9: Cost Controls for Evals)
2751
-
2752
- **Input:**
2753
-
2754
- > My LLM evals are getting expensive. How can I reduce costs?
2755
-
2756
- **Context files:**
2757
-
2758
- - `framework/guides/testing-methodology.md`
2759
-
2760
- **Expected:** Agent provides cost reduction strategies
2761
-
2762
- **Rubric:**
2763
-
2764
- - EXCELLENT: Cache static prompts, batch scenarios, schedule full evals (PR/weekly)
2765
- - ACCEPTABLE: Mentions caching
2766
- - POOR: No cost guidance
2767
-
2768
- ---
2769
-
2770
- ### test-010-coverage-goals (Story 10: Coverage Goals and Critical Paths)
2771
-
2772
- **Input:**
2773
-
2774
- > What should I aim for in test coverage?
2775
-
2776
- **Context files:**
2777
-
2778
- - `framework/guides/testing-methodology.md`
2779
-
2780
- **Expected:** Agent provides coverage guidance
2781
-
2782
- **Rubric:**
2783
-
2784
- - EXCELLENT: Unit 80%+ for pure functions, E2E for critical multi-page flows, defines "critical"
2785
- - ACCEPTABLE: Provides coverage targets
2786
- - POOR: Generic "100% coverage" advice
2787
-
2788
- ---
2789
-
2790
- ### test-011-quality-practices (Story 11: Test Quality Practices)
2791
-
2792
- **Input:**
2793
-
2794
- > My tests keep failing randomly. What should I check?
2795
-
2796
- **Context files:**
2797
-
2798
- - `framework/guides/testing-methodology.md`
2799
-
2800
- **Expected:** Agent identifies flakiness causes
2801
-
2802
- **Rubric:**
2803
-
2804
- - EXCELLENT: Check async (polling vs arbitrary timeouts), independent tests, AAA pattern
2805
- - ACCEPTABLE: Mentions async issues
2806
- - POOR: Suggests skipping flaky tests
2807
-
2808
- ---
2809
-
2810
- ### test-012-ci-cadence (Story 12: CI/CD Testing Cadence)
2811
-
2812
- **Input:**
2813
-
2814
- > When should I run different test types in CI?
2815
-
2816
- **Context files:**
2817
-
2818
- - `framework/guides/testing-methodology.md`
2819
-
2820
- **Expected:** Agent provides CI cadence
2821
-
2822
- **Rubric:**
2823
-
2824
- - EXCELLENT: Unit+integration every commit, E2E on PR, evals on schedule
2825
- - ACCEPTABLE: Distinguishes cadence by test type
2826
- - POOR: Run all tests on every commit
2827
-
2828
- ---
2829
-
2830
- ### test-013-project-doc (Story 13: Project-Specific Testing Doc)
2831
-
2832
- **Input:**
2833
-
2834
- > Where should I document my project's testing setup?
2835
-
2836
- **Context files:**
2837
-
2838
- - `framework/guides/testing-methodology.md`
2839
-
2840
- **Expected:** Agent points to tests/SAFEWORD.md
2841
-
2842
- **Rubric:**
2843
-
2844
- - EXCELLENT: tests/SAFEWORD.md with stack, commands, patterns, config
2845
- - ACCEPTABLE: Suggests documentation location
2846
- - POOR: No documentation guidance
2847
-
2848
- ---
2849
-
2850
- ## user-story-guide.md
2851
-
2852
- ### story-001-template (Story 1: Use Standard Template)
2853
-
2854
- **Input:**
2855
-
2856
- > I need to create user stories for a new feature. Where do I start?
2857
-
2858
- **Context files:**
2859
-
2860
- - `framework/guides/user-story-guide.md`
2861
-
2862
- **Expected:** Agent points to template and workflow
2863
-
2864
- **Rubric:**
2865
-
2866
- - EXCELLENT: Points to template, lists 7 steps (fill in feature name, create numbered stories, etc.)
2867
- - ACCEPTABLE: Points to template
2868
- - POOR: No template reference
2869
-
2870
- ---
2871
-
2872
- ### story-002-tracking (Story 2: Include Tracking Metadata)
2873
-
2874
- **Input:**
2875
-
2876
- > What metadata should I include in my user stories?
2877
-
2878
- **Context files:**
2879
-
2880
- - `framework/guides/user-story-guide.md`
2881
-
2882
- **Expected:** Agent lists required metadata
2883
-
2884
- **Rubric:**
2885
-
2886
- - EXCELLENT: Status (✅/❌), test file refs, completion %, phase tracking, next steps
2887
- - ACCEPTABLE: Lists most metadata
2888
- - POOR: No metadata guidance
2889
-
2890
- ---
2891
-
2892
- ### story-003-invest (Story 3: INVEST Validation Gate)
2893
-
2894
- **Input:**
2895
-
2896
- > How do I know if my user story is ready to implement?
2897
-
2898
- **Context files:**
2899
-
2900
- - `framework/guides/user-story-guide.md`
2901
-
2902
- **Expected:** Agent explains INVEST criteria
2903
-
2904
- **Rubric:**
2905
-
2906
- - EXCELLENT: Independent, Negotiable, Valuable, Estimable, Small, Testable - refine if any fail
2907
- - ACCEPTABLE: Mentions several INVEST criteria
2908
- - POOR: No validation criteria
2909
-
2910
- ---
2911
-
2912
- ### story-004-good-ac (Story 4: Write Good Acceptance Criteria)
2913
-
2914
- **Input:**
2915
-
2916
- > My acceptance criterion says "Campaign switching works". Is this good?
2917
-
2918
- **Context files:**
2919
-
2920
- - `framework/guides/user-story-guide.md`
2921
-
2922
- **Expected:** Agent identifies vague AC
2923
-
2924
- **Rubric:**
2925
-
2926
- - EXCELLENT: Identifies as BAD (too vague), suggests specific measurable AC like "Response time <200ms"
2927
- - ACCEPTABLE: Notes it's too vague
2928
- - POOR: Accepts vague AC
2929
-
2930
- ---
2931
-
2932
- ### story-005-size (Story 5: Size Guidelines Enforcement)
2933
-
2934
- **Input:**
2935
-
2936
- > My user story has 8 acceptance criteria. Is this okay?
2937
-
2938
- **Context files:**
2939
-
2940
- - `framework/guides/user-story-guide.md`
2941
-
2942
- **Expected:** Agent identifies story is too big
2943
-
2944
- **Rubric:**
2945
-
2946
- - EXCELLENT: Too big (6+ AC), suggests splitting into multiple stories, target 1-5 AC
2947
- - ACCEPTABLE: Notes it should be split
2948
- - POOR: Accepts 8 AC
2949
-
2950
- ---
2951
-
2952
- ### story-006-examples (Story 6: Good/Bad Examples Reference)
2953
-
2954
- **Input:**
2955
-
2956
- > Can you show me what a good user story looks like?
2957
-
2958
- **Context files:**
2959
-
2960
- - `framework/guides/user-story-guide.md`
2961
-
2962
- **Expected:** Agent provides concrete example
2963
-
2964
- **Rubric:**
2965
-
2966
- - EXCELLENT: Shows complete example with As a/I want/So that, 1-5 specific AC
2967
- - ACCEPTABLE: Shows basic structure
2968
- - POOR: Vague or incomplete example
2969
-
2970
- ---
2971
-
2972
- ### story-007-conversation (Story 7: Conversation, Not Contract)
2973
-
2974
- **Input:**
2975
-
2976
- > Should I include all implementation details in my user story?
2977
-
2978
- **Context files:**
2979
-
2980
- - `framework/guides/user-story-guide.md`
2981
-
2982
- **Expected:** Agent advises against implementation details
2983
-
2984
- **Rubric:**
2985
-
2986
- - EXCELLENT: No - stories are conversation starters, avoid implementation details, link to mockups
2987
- - ACCEPTABLE: Advises against implementation details
2988
- - POOR: Suggests including implementation details
2989
-
2990
- ---
2991
-
2992
- ### story-008-llm-wording (Story 8: LLM-Optimized Wording)
2993
-
2994
- **Input:**
2995
-
2996
- > How do I write user stories that AI agents can follow?
2997
-
2998
- **Context files:**
2999
-
3000
- - `framework/guides/user-story-guide.md`
3001
-
3002
- **Expected:** Agent provides LLM optimization guidance
3003
-
3004
- **Rubric:**
3005
-
3006
- - EXCELLENT: Specific concrete language, numbers, explicit definitions, examples over rules
3007
- - ACCEPTABLE: Mentions clarity principles
3008
- - POOR: No LLM-specific guidance
3009
-
3010
- ---
3011
-
3012
- ### story-009-token-efficiency (Story 9: Token Efficiency)
3013
-
3014
- **Input:**
3015
-
3016
- > How long should my user story template be?
3017
-
3018
- **Context files:**
3019
-
3020
- - `framework/guides/user-story-guide.md`
3021
-
3022
- **Expected:** Agent provides size guidance
3023
-
3024
- **Rubric:**
3025
-
3026
- - EXCELLENT: Keep lean (~9 lines), minimize overhead for prompting cost
3027
- - ACCEPTABLE: Suggests keeping it concise
3028
- - POOR: No size guidance
3029
-
3030
- ---
3031
-
3032
- ### story-010-technical-tasks (Story 10: Technical Tasks vs Stories)
3033
-
3034
- **Input:**
3035
-
3036
- > I want to write a user story: "As a developer, I want to refactor the database layer"
3037
-
3038
- **Context files:**
3039
-
3040
- - `framework/guides/user-story-guide.md`
3041
-
3042
- **Expected:** Agent identifies this as technical task
3043
-
3044
- **Rubric:**
3045
-
3046
- - EXCELLENT: This is a technical task/spike, not a user story - no user value
3047
- - ACCEPTABLE: Notes it lacks user value
3048
- - POOR: Accepts technical task as user story
3049
-
3050
- ---
3051
-
3052
- ## zombie-process-cleanup.md
3053
-
3054
- ### zombie-001-port-cleanup (Story 1: Prefer Port-Based Cleanup)
3055
-
3056
- **Input:**
3057
-
3058
- > My dev server is stuck on port 3000. How do I kill it safely?
3059
-
3060
- **Context files:**
3061
-
3062
- - `framework/guides/zombie-process-cleanup.md`
3063
-
3064
- **Expected:** Agent provides port-based cleanup
3065
-
3066
- **Rubric:**
3067
-
3068
- - EXCELLENT: `lsof -ti:3000 -ti:4000 | xargs kill -9` (both dev and test ports), explains why port-based is safe for multi-project
3069
- - ACCEPTABLE: Provides kill command for at least dev port
3070
- - POOR: Suggests `killall node`
3071
-
3072
- ---
3073
-
3074
- ### zombie-002-cleanup-script (Story 2: Project-Specific Cleanup Script)
3075
-
3076
- **Input:**
3077
-
3078
- > I need to clean up processes frequently. Should I create a script?
3079
-
3080
- **Context files:**
3081
-
3082
- - `framework/guides/zombie-process-cleanup.md`
3083
-
3084
- **Expected:** Agent recommends cleanup script
3085
-
3086
- **Rubric:**
3087
-
3088
- - EXCELLENT: Yes, create scripts/cleanup.sh with DEV_PORT, TEST_PORT (dev+1000), and PROJECT_DIR variables
3089
- - ACCEPTABLE: Suggests creating script
3090
- - POOR: No script guidance
3091
-
3092
- ---
3093
-
3094
- ### zombie-003-unique-ports (Story 3: Unique Port Assignment)
3095
-
3096
- **Input:**
3097
-
3098
- > I'm working on multiple projects. How do I avoid port conflicts?
3099
-
3100
- **Context files:**
3101
-
3102
- - `framework/guides/zombie-process-cleanup.md`
3103
-
3104
- **Expected:** Agent recommends unique ports
3105
-
3106
- **Rubric:**
3107
-
3108
- - EXCELLENT: Assign unique PORT per project (3000, 3001), document in README/env
3109
- - ACCEPTABLE: Suggests unique ports
3110
- - POOR: No port guidance
3111
-
3112
- ---
3113
-
3114
- ### zombie-004-tmux-isolation (Story 4: tmux/Screen Isolation)
3115
-
3116
- **Input:**
3117
-
3118
- > Is there a way to isolate terminal sessions per project?
3119
-
3120
- **Context files:**
3121
-
3122
- - `framework/guides/zombie-process-cleanup.md`
3123
-
3124
- **Expected:** Agent suggests tmux/screen
3125
-
3126
- **Rubric:**
3127
-
3128
- - EXCELLENT: Named tmux session per project, one command kills session, notes learning curve
3129
- - ACCEPTABLE: Suggests terminal isolation
3130
- - POOR: No isolation guidance
3131
-
3132
- ---
3133
-
3134
- ### zombie-005-debugging (Story 5: Debugging Zombie Processes)
3135
-
3136
- **Input:**
3137
-
3138
- > How do I find which processes are stuck?
3139
-
3140
- **Context files:**
3141
-
3142
- - `framework/guides/zombie-process-cleanup.md`
3143
-
3144
- **Expected:** Agent provides debugging commands
3145
-
3146
- **Rubric:**
3147
-
3148
- - EXCELLENT: Find by port, by process type, by project dir with $(pwd) pattern
3149
- - ACCEPTABLE: Provides find commands
3150
- - POOR: Generic advice
3151
-
3152
- ---
3153
-
3154
- ### zombie-006-best-practices (Story 6: Best Practices)
3155
-
3156
- **Input:**
3157
-
3158
- > What are the best practices for avoiding cross-project process kills?
3159
-
3160
- **Context files:**
3161
-
3162
- - `framework/guides/zombie-process-cleanup.md`
3163
-
3164
- **Expected:** Agent provides best practices
3165
-
3166
- **Rubric:**
3167
-
3168
- - EXCELLENT: Unique ports, port-based cleanup first, cleanup scripts, clean before start
3169
- - ACCEPTABLE: Lists some practices
3170
- - POOR: No best practices
3171
-
3172
- ---
3173
-
3174
- ### zombie-007-quick-reference (Story 7: Quick Reference)
3175
-
3176
- **Input:**
3177
-
3178
- > Give me a quick reference for safe cleanup commands.
3179
-
3180
- **Context files:**
3181
-
3182
- - `framework/guides/zombie-process-cleanup.md`
3183
-
3184
- **Expected:** Agent provides quick reference
3185
-
3186
- **Rubric:**
3187
-
3188
- - EXCELLENT: Kill by both dev+test ports (`$DEV_PORT`/`$TEST_PORT`), kill playwright for project, full cleanup script, warn against global kills
3189
- - ACCEPTABLE: Provides commands
3190
- - POOR: Suggests dangerous global kills
3191
-
3192
- ---
3193
-
3194
- ## code-philosophy.md (legacy tests)
3195
-
3196
- ### phil-legacy-model-levels (Story 3: Model at Three Levels - legacy test)
3197
-
3198
- **Input:**
3199
-
3200
- > Document the data model for a user management system
3201
-
3202
- **Context files:**
3203
-
3204
- - `framework/SAFEWORD.md`
3205
- - `framework/guides/data-architecture-guide.md`
3206
-
3207
- **Expected:** Output includes conceptual, logical, and physical model levels
3208
-
3209
- **Rubric:**
3210
-
3211
- - EXCELLENT: All 3 levels (conceptual entities, logical attributes/relationships, physical storage+WHY)
3212
- - ACCEPTABLE: 2 of 3 levels present
3213
- - POOR: Only 1 level or missing WHY for storage choice
3214
-
3215
- ---
3216
-
3217
- ## Adding New Tests
3218
-
3219
- When evaluating a new user story:
3220
-
3221
- 1. Identify testable behavior from the story
3222
- 2. Create test ID: `{prefix}-{num}-{slug}`
3223
- 3. Write input prompt that exercises the behavior
3224
- 4. Define rubric with EXCELLENT/ACCEPTABLE/POOR
3225
- 5. Add to this file under the appropriate guide section
3226
- 6. Update summary table