safeword 0.2.4 → 0.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (235) hide show
  1. package/dist/check-3NGQ4NR5.js +129 -0
  2. package/dist/check-3NGQ4NR5.js.map +1 -0
  3. package/dist/chunk-2XWIUEQK.js +190 -0
  4. package/dist/chunk-2XWIUEQK.js.map +1 -0
  5. package/dist/chunk-GZRQL3SX.js +146 -0
  6. package/dist/chunk-GZRQL3SX.js.map +1 -0
  7. package/dist/chunk-ORQHKDT2.js +10 -0
  8. package/dist/chunk-ORQHKDT2.js.map +1 -0
  9. package/dist/chunk-W66Z3C5H.js +21 -0
  10. package/dist/chunk-W66Z3C5H.js.map +1 -0
  11. package/dist/cli.d.ts +1 -0
  12. package/dist/cli.js +34 -0
  13. package/dist/cli.js.map +1 -0
  14. package/dist/diff-Y6QTAW4O.js +166 -0
  15. package/dist/diff-Y6QTAW4O.js.map +1 -0
  16. package/dist/index.d.ts +11 -0
  17. package/dist/index.js +7 -0
  18. package/dist/index.js.map +1 -0
  19. package/dist/reset-3ACTIYYE.js +143 -0
  20. package/dist/reset-3ACTIYYE.js.map +1 -0
  21. package/dist/setup-RR4M334C.js +266 -0
  22. package/dist/setup-RR4M334C.js.map +1 -0
  23. package/dist/upgrade-6AR3DHUV.js +134 -0
  24. package/dist/upgrade-6AR3DHUV.js.map +1 -0
  25. package/package.json +44 -19
  26. package/{.safeword → templates}/hooks/agents-md-check.sh +0 -0
  27. package/{.safeword → templates}/hooks/post-tool.sh +0 -0
  28. package/{.safeword → templates}/hooks/pre-commit.sh +0 -0
  29. package/.claude/commands/arch-review.md +0 -32
  30. package/.claude/commands/lint.md +0 -6
  31. package/.claude/commands/quality-review.md +0 -13
  32. package/.claude/commands/setup-linting.md +0 -6
  33. package/.claude/hooks/auto-lint.sh +0 -6
  34. package/.claude/hooks/auto-quality-review.sh +0 -170
  35. package/.claude/hooks/check-linting-sync.sh +0 -17
  36. package/.claude/hooks/inject-timestamp.sh +0 -6
  37. package/.claude/hooks/question-protocol.sh +0 -12
  38. package/.claude/hooks/run-linters.sh +0 -8
  39. package/.claude/hooks/run-quality-review.sh +0 -76
  40. package/.claude/hooks/version-check.sh +0 -10
  41. package/.claude/mcp/README.md +0 -96
  42. package/.claude/mcp/arcade.sample.json +0 -9
  43. package/.claude/mcp/context7.sample.json +0 -7
  44. package/.claude/mcp/playwright.sample.json +0 -7
  45. package/.claude/settings.json +0 -62
  46. package/.claude/skills/quality-reviewer/SKILL.md +0 -190
  47. package/.claude/skills/safeword-quality-reviewer/SKILL.md +0 -13
  48. package/.env.arcade.example +0 -4
  49. package/.env.example +0 -11
  50. package/.gitmodules +0 -4
  51. package/.safeword/SAFEWORD.md +0 -33
  52. package/.safeword/eslint/eslint-base.mjs +0 -101
  53. package/.safeword/guides/architecture-guide.md +0 -404
  54. package/.safeword/guides/code-philosophy.md +0 -174
  55. package/.safeword/guides/context-files-guide.md +0 -405
  56. package/.safeword/guides/data-architecture-guide.md +0 -183
  57. package/.safeword/guides/design-doc-guide.md +0 -165
  58. package/.safeword/guides/learning-extraction.md +0 -515
  59. package/.safeword/guides/llm-instruction-design.md +0 -239
  60. package/.safeword/guides/llm-prompting.md +0 -95
  61. package/.safeword/guides/tdd-best-practices.md +0 -570
  62. package/.safeword/guides/test-definitions-guide.md +0 -243
  63. package/.safeword/guides/testing-methodology.md +0 -573
  64. package/.safeword/guides/user-story-guide.md +0 -237
  65. package/.safeword/guides/zombie-process-cleanup.md +0 -214
  66. package/.safeword/planning/002-user-story-quality-evaluation.md +0 -1840
  67. package/.safeword/planning/003-langsmith-eval-setup-prompt.md +0 -363
  68. package/.safeword/planning/004-llm-eval-test-cases.md +0 -3226
  69. package/.safeword/planning/005-architecture-enforcement-system.md +0 -169
  70. package/.safeword/planning/006-reactive-fix-prevention-research.md +0 -135
  71. package/.safeword/planning/011-cli-ux-vision.md +0 -330
  72. package/.safeword/planning/012-project-structure-cleanup.md +0 -154
  73. package/.safeword/planning/README.md +0 -39
  74. package/.safeword/planning/automation-plan-v2.md +0 -1225
  75. package/.safeword/planning/automation-plan-v3.md +0 -1291
  76. package/.safeword/planning/automation-plan.md +0 -3058
  77. package/.safeword/planning/design/005-cli-implementation.md +0 -343
  78. package/.safeword/planning/design/013-cli-self-contained-templates.md +0 -596
  79. package/.safeword/planning/design/013a-eslint-plugin-suite.md +0 -256
  80. package/.safeword/planning/design/013b-implementation-snippets.md +0 -385
  81. package/.safeword/planning/design/013c-config-isolation-strategy.md +0 -242
  82. package/.safeword/planning/design/code-philosophy-improvements.md +0 -60
  83. package/.safeword/planning/mcp-analysis.md +0 -545
  84. package/.safeword/planning/phase2-subagents-vs-skills-analysis.md +0 -451
  85. package/.safeword/planning/settings-improvements.md +0 -970
  86. package/.safeword/planning/test-definitions/005-cli-implementation.md +0 -1301
  87. package/.safeword/planning/test-definitions/cli-self-contained-templates.md +0 -205
  88. package/.safeword/planning/user-stories/001-guides-review-user-stories.md +0 -1381
  89. package/.safeword/planning/user-stories/003-reactive-fix-prevention.md +0 -132
  90. package/.safeword/planning/user-stories/004-technical-constraints.md +0 -86
  91. package/.safeword/planning/user-stories/005-cli-implementation.md +0 -311
  92. package/.safeword/planning/user-stories/cli-self-contained-templates.md +0 -172
  93. package/.safeword/planning/versioned-distribution.md +0 -740
  94. package/.safeword/prompts/arch-review.md +0 -43
  95. package/.safeword/prompts/quality-review.md +0 -11
  96. package/.safeword/scripts/arch-review.sh +0 -235
  97. package/.safeword/scripts/check-linting-sync.sh +0 -58
  98. package/.safeword/scripts/setup-linting.sh +0 -559
  99. package/.safeword/templates/architecture-template.md +0 -136
  100. package/.safeword/templates/ci/architecture-check.yml +0 -79
  101. package/.safeword/templates/design-doc-template.md +0 -127
  102. package/.safeword/templates/test-definitions-feature.md +0 -100
  103. package/.safeword/templates/ticket-template.md +0 -74
  104. package/.safeword/templates/user-stories-template.md +0 -82
  105. package/.safeword/tickets/001-guides-review-user-stories.md +0 -83
  106. package/.safeword/tickets/002-architecture-enforcement.md +0 -211
  107. package/.safeword/tickets/003-reactive-fix-prevention.md +0 -57
  108. package/.safeword/tickets/004-technical-constraints-in-user-stories.md +0 -39
  109. package/.safeword/tickets/005-cli-implementation.md +0 -248
  110. package/.safeword/tickets/006-flesh-out-skills.md +0 -43
  111. package/.safeword/tickets/007-flesh-out-questioning.md +0 -44
  112. package/.safeword/tickets/008-upgrade-questioning.md +0 -58
  113. package/.safeword/tickets/009-naming-conventions.md +0 -41
  114. package/.safeword/tickets/010-safeword-md-cleanup.md +0 -34
  115. package/.safeword/tickets/011-cursor-setup.md +0 -86
  116. package/.safeword/tickets/README.md +0 -73
  117. package/.safeword/version +0 -1
  118. package/AGENTS.md +0 -59
  119. package/CLAUDE.md +0 -12
  120. package/README.md +0 -347
  121. package/docs/001-cli-implementation-plan.md +0 -856
  122. package/docs/elite-dx-implementation-plan.md +0 -1034
  123. package/framework/README.md +0 -131
  124. package/framework/mcp/README.md +0 -96
  125. package/framework/mcp/arcade.sample.json +0 -8
  126. package/framework/mcp/context7.sample.json +0 -6
  127. package/framework/mcp/playwright.sample.json +0 -6
  128. package/framework/scripts/arch-review.sh +0 -235
  129. package/framework/scripts/check-linting-sync.sh +0 -58
  130. package/framework/scripts/load-env.sh +0 -49
  131. package/framework/scripts/setup-claude.sh +0 -223
  132. package/framework/scripts/setup-linting.sh +0 -559
  133. package/framework/scripts/setup-quality.sh +0 -477
  134. package/framework/scripts/setup-safeword.sh +0 -550
  135. package/framework/templates/ci/architecture-check.yml +0 -78
  136. package/learnings/ai-sdk-v5-breaking-changes.md +0 -178
  137. package/learnings/e2e-test-zombie-processes.md +0 -231
  138. package/learnings/milkdown-crepe-editor-property.md +0 -96
  139. package/learnings/prosemirror-fragment-traversal.md +0 -119
  140. package/packages/cli/AGENTS.md +0 -1
  141. package/packages/cli/ARCHITECTURE.md +0 -279
  142. package/packages/cli/package.json +0 -51
  143. package/packages/cli/src/cli.ts +0 -63
  144. package/packages/cli/src/commands/check.ts +0 -166
  145. package/packages/cli/src/commands/diff.ts +0 -209
  146. package/packages/cli/src/commands/reset.ts +0 -190
  147. package/packages/cli/src/commands/setup.ts +0 -325
  148. package/packages/cli/src/commands/upgrade.ts +0 -163
  149. package/packages/cli/src/index.ts +0 -3
  150. package/packages/cli/src/templates/config.ts +0 -58
  151. package/packages/cli/src/templates/content.ts +0 -18
  152. package/packages/cli/src/templates/index.ts +0 -12
  153. package/packages/cli/src/utils/agents-md.ts +0 -66
  154. package/packages/cli/src/utils/fs.ts +0 -179
  155. package/packages/cli/src/utils/git.ts +0 -124
  156. package/packages/cli/src/utils/hooks.ts +0 -29
  157. package/packages/cli/src/utils/output.ts +0 -60
  158. package/packages/cli/src/utils/project-detector.test.ts +0 -185
  159. package/packages/cli/src/utils/project-detector.ts +0 -44
  160. package/packages/cli/src/utils/version.ts +0 -28
  161. package/packages/cli/src/version.ts +0 -6
  162. package/packages/cli/templates/SAFEWORD.md +0 -776
  163. package/packages/cli/templates/doc-templates/architecture-template.md +0 -136
  164. package/packages/cli/templates/doc-templates/design-doc-template.md +0 -134
  165. package/packages/cli/templates/doc-templates/test-definitions-feature.md +0 -131
  166. package/packages/cli/templates/doc-templates/ticket-template.md +0 -82
  167. package/packages/cli/templates/doc-templates/user-stories-template.md +0 -92
  168. package/packages/cli/templates/guides/architecture-guide.md +0 -423
  169. package/packages/cli/templates/guides/code-philosophy.md +0 -195
  170. package/packages/cli/templates/guides/context-files-guide.md +0 -457
  171. package/packages/cli/templates/guides/data-architecture-guide.md +0 -200
  172. package/packages/cli/templates/guides/design-doc-guide.md +0 -171
  173. package/packages/cli/templates/guides/learning-extraction.md +0 -552
  174. package/packages/cli/templates/guides/llm-instruction-design.md +0 -248
  175. package/packages/cli/templates/guides/llm-prompting.md +0 -102
  176. package/packages/cli/templates/guides/tdd-best-practices.md +0 -615
  177. package/packages/cli/templates/guides/test-definitions-guide.md +0 -334
  178. package/packages/cli/templates/guides/testing-methodology.md +0 -618
  179. package/packages/cli/templates/guides/user-story-guide.md +0 -256
  180. package/packages/cli/templates/guides/zombie-process-cleanup.md +0 -219
  181. package/packages/cli/templates/hooks/agents-md-check.sh +0 -27
  182. package/packages/cli/templates/hooks/post-tool.sh +0 -4
  183. package/packages/cli/templates/hooks/pre-commit.sh +0 -10
  184. package/packages/cli/templates/prompts/arch-review.md +0 -43
  185. package/packages/cli/templates/prompts/quality-review.md +0 -10
  186. package/packages/cli/templates/skills/safeword-quality-reviewer/SKILL.md +0 -207
  187. package/packages/cli/tests/commands/check.test.ts +0 -129
  188. package/packages/cli/tests/commands/cli.test.ts +0 -89
  189. package/packages/cli/tests/commands/diff.test.ts +0 -115
  190. package/packages/cli/tests/commands/reset.test.ts +0 -310
  191. package/packages/cli/tests/commands/self-healing.test.ts +0 -170
  192. package/packages/cli/tests/commands/setup-blocking.test.ts +0 -71
  193. package/packages/cli/tests/commands/setup-core.test.ts +0 -135
  194. package/packages/cli/tests/commands/setup-git.test.ts +0 -139
  195. package/packages/cli/tests/commands/setup-hooks.test.ts +0 -334
  196. package/packages/cli/tests/commands/setup-linting.test.ts +0 -189
  197. package/packages/cli/tests/commands/setup-noninteractive.test.ts +0 -80
  198. package/packages/cli/tests/commands/setup-templates.test.ts +0 -181
  199. package/packages/cli/tests/commands/upgrade.test.ts +0 -215
  200. package/packages/cli/tests/helpers.ts +0 -243
  201. package/packages/cli/tests/npm-package.test.ts +0 -83
  202. package/packages/cli/tests/technical-constraints.test.ts +0 -96
  203. package/packages/cli/tsconfig.json +0 -25
  204. package/packages/cli/tsup.config.ts +0 -11
  205. package/packages/cli/vitest.config.ts +0 -23
  206. package/promptfoo.yaml +0 -3270
  207. /package/{framework → templates}/SAFEWORD.md +0 -0
  208. /package/{packages/cli/templates → templates}/commands/arch-review.md +0 -0
  209. /package/{packages/cli/templates → templates}/commands/lint.md +0 -0
  210. /package/{packages/cli/templates → templates}/commands/quality-review.md +0 -0
  211. /package/{framework/templates → templates/doc-templates}/architecture-template.md +0 -0
  212. /package/{framework/templates → templates/doc-templates}/design-doc-template.md +0 -0
  213. /package/{framework/templates → templates/doc-templates}/test-definitions-feature.md +0 -0
  214. /package/{framework/templates → templates/doc-templates}/ticket-template.md +0 -0
  215. /package/{framework/templates → templates/doc-templates}/user-stories-template.md +0 -0
  216. /package/{framework → templates}/guides/architecture-guide.md +0 -0
  217. /package/{framework → templates}/guides/code-philosophy.md +0 -0
  218. /package/{framework → templates}/guides/context-files-guide.md +0 -0
  219. /package/{framework → templates}/guides/data-architecture-guide.md +0 -0
  220. /package/{framework → templates}/guides/design-doc-guide.md +0 -0
  221. /package/{framework → templates}/guides/learning-extraction.md +0 -0
  222. /package/{framework → templates}/guides/llm-instruction-design.md +0 -0
  223. /package/{framework → templates}/guides/llm-prompting.md +0 -0
  224. /package/{framework → templates}/guides/tdd-best-practices.md +0 -0
  225. /package/{framework → templates}/guides/test-definitions-guide.md +0 -0
  226. /package/{framework → templates}/guides/testing-methodology.md +0 -0
  227. /package/{framework → templates}/guides/user-story-guide.md +0 -0
  228. /package/{framework → templates}/guides/zombie-process-cleanup.md +0 -0
  229. /package/{packages/cli/templates → templates}/hooks/inject-timestamp.sh +0 -0
  230. /package/{packages/cli/templates → templates}/lib/common.sh +0 -0
  231. /package/{packages/cli/templates → templates}/lib/jq-fallback.sh +0 -0
  232. /package/{packages/cli/templates → templates}/markdownlint.jsonc +0 -0
  233. /package/{framework → templates}/prompts/arch-review.md +0 -0
  234. /package/{framework → templates}/prompts/quality-review.md +0 -0
  235. /package/{framework/skills/quality-reviewer → templates/skills/safeword-quality-reviewer}/SKILL.md +0 -0
@@ -1,363 +0,0 @@
1
- # LangSmith Eval Setup Plan
2
-
3
- **Goal:** Test whether an LLM agent correctly follows SAFEWORD documentation/guides when given various scenarios.
4
-
5
- **Scale:** 140 user stories across 13 guides → expect 200+ test cases over time.
6
-
7
- ---
8
-
9
- ## Phase 1: Foundation (Do First)
10
-
11
- ### 1.1 Project Structure
12
-
13
- ```
14
- evals/
15
- ├── package.json # Dependencies: langsmith, langchain, openai/anthropic SDK
16
- ├── tsconfig.json
17
- ├── .env.example # LANGSMITH_API_KEY, OPENAI_API_KEY, etc.
18
- ├── src/
19
- │ ├── config.ts # LangSmith project config, model selection
20
- │ ├── context-loader.ts # Load SAFEWORD.md + guides as context
21
- │ ├── evaluators/
22
- │ │ ├── section-presence.ts # Check if output has required sections
23
- │ │ ├── doc-type-decision.ts # Check if correct doc type was chosen
24
- │ │ └── prerequisites.ts # Check if prerequisites were verified
25
- │ └── runner.ts # Execute evals, upload to LangSmith
26
- ├── datasets/
27
- │ ├── architecture-guide/ # Tests for architecture-guide.md
28
- │ │ ├── create-doc.json # Test 1: Create Architecture Doc
29
- │ │ ├── tech-choice.json # Test 2: Doc Type Decision - Tech Choice
30
- │ │ └── ...
31
- │ ├── design-doc-guide/ # Tests for design-doc-guide.md
32
- │ └── ... (one folder per guide)
33
- └── scripts/
34
- ├── run-all.ts # Run full eval suite
35
- ├── run-guide.ts # Run evals for one guide
36
- └── upload-datasets.ts # Sync datasets to LangSmith
37
- ```
38
-
39
- ### 1.2 Core Dependencies
40
-
41
- ```json
42
- {
43
- "dependencies": {
44
- "langsmith": "^0.1.0",
45
- "openai": "^4.0.0",
46
- "@anthropic-ai/sdk": "^0.30.0"
47
- },
48
- "devDependencies": {
49
- "tsx": "^4.0.0",
50
- "dotenv": "^16.0.0"
51
- }
52
- }
53
- ```
54
-
55
- ### 1.3 Dataset Schema (Standard for All Tests)
56
-
57
- ```typescript
58
- interface TestCase {
59
- id: string; // e.g., "arch-001-create-doc"
60
- guide: string; // e.g., "architecture-guide"
61
- story_id: number; // Maps to user story number
62
- input: string; // User prompt
63
- context_files: string[]; // Which files to load as context
64
- expected_behavior: string; // What the agent should do
65
- rubric: {
66
- excellent: string;
67
- acceptable: string;
68
- poor: string;
69
- };
70
- }
71
- ```
72
-
73
- ---
74
-
75
- ## Phase 2: Evaluator Patterns (Reusable)
76
-
77
- ### 2.1 Section Presence Evaluator
78
-
79
- Used by: Tests 1, 6 (any "Create X Doc" test)
80
-
81
- ```typescript
82
- // evaluators/section-presence.ts
83
- const evaluate = (output: string, requiredSections: string[]) => {
84
- const found = requiredSections.filter(s => output.includes(s));
85
- const score = found.length / requiredSections.length;
86
-
87
- if (score === 1) return { score: 1, label: 'excellent' };
88
- if (score >= 0.8) return { score: 0.8, label: 'acceptable' };
89
- return { score: 0.4, label: 'poor' };
90
- };
91
- ```
92
-
93
- ### 2.2 Decision Evaluator
94
-
95
- Used by: Tests 2, 3, 5, 8, 10 (any "Which doc type?" test)
96
-
97
- ```typescript
98
- // evaluators/doc-type-decision.ts
99
- const evaluate = (output: string, expectedType: 'architecture' | 'design' | 'none') => {
100
- const mentionsArch = /architecture\s*(doc|document)/i.test(output);
101
- const mentionsDesign = /design\s*(doc|document)/i.test(output);
102
-
103
- // Scoring logic based on expectedType
104
- };
105
- ```
106
-
107
- ### 2.3 Prerequisites Evaluator
108
-
109
- Used by: Tests 7 (any "Did agent check prerequisites?" test)
110
-
111
- ```typescript
112
- // evaluators/prerequisites.ts
113
- const evaluate = (output: string) => {
114
- const asksAboutStories = /user stor(y|ies)/i.test(output);
115
- const asksAboutTestDefs = /test definition/i.test(output);
116
- const offersToCreate = /create|offer|would you like/i.test(output);
117
-
118
- // Excellent = checks before creating
119
- };
120
- ```
121
-
122
- ### 2.4 LLM-as-Judge (General Purpose)
123
-
124
- Used by: Complex cases where regex isn't enough
125
-
126
- ```typescript
127
- // evaluators/llm-judge.ts
128
- const judgePrompt = `
129
- You are evaluating an AI coding assistant's response.
130
-
131
- Rubric:
132
- - EXCELLENT: {rubric.excellent}
133
- - ACCEPTABLE: {rubric.acceptable}
134
- - POOR: {rubric.poor}
135
-
136
- User Input: {input}
137
- Assistant Output: {output}
138
-
139
- Rate the response as EXCELLENT, ACCEPTABLE, or POOR. Explain briefly.
140
- `;
141
- ```
142
-
143
- ---
144
-
145
- ## Phase 3: Implementation Order
146
-
147
- ### Week 1: Setup + First 10 Tests
148
-
149
- | Task | Output |
150
- | ----------------------------------- | ------------------------- |
151
- | Create `evals/` directory structure | Scaffolding |
152
- | Install dependencies | `package.json` |
153
- | Create context loader | Load SAFEWORD.md + guides |
154
- | Create LangSmith project | `safeword-evals` project |
155
- | Implement Tests 1-10 from prompt | First dataset uploaded |
156
-
157
- ### Week 2: Evaluator Library
158
-
159
- | Task | Output |
160
- | -------------------------- | ----------------------------------- |
161
- | Section presence evaluator | Reusable for all "create doc" tests |
162
- | Decision evaluator | Reusable for all "which doc?" tests |
163
- | Prerequisites evaluator | Reusable for workflow tests |
164
- | LLM-as-judge wrapper | General fallback |
165
-
166
- ### Week 3+: Scale Up
167
-
168
- | Guide | # Stories | Priority |
169
- | ------------------------- | --------- | -------------- |
170
- | architecture-guide.md | 11 | HIGH (4 done) |
171
- | design-doc-guide.md | 10 | HIGH (partial) |
172
- | testing-methodology.md | 13 | MEDIUM |
173
- | tdd-templates.md | 16 | MEDIUM |
174
- | llm-instruction-design.md | 15 | LOW |
175
- | ... | ... | ... |
176
-
177
- ---
178
-
179
- ## Phase 4: CI Integration (Defer)
180
-
181
- ```yaml
182
- # .github/workflows/evals.yml
183
- name: Run Evals
184
- on:
185
- push:
186
- paths:
187
- - 'framework/**' # Run when guides change
188
- schedule:
189
- - cron: '0 0 * * 0' # Weekly regression check
190
-
191
- jobs:
192
- eval:
193
- runs-on: ubuntu-latest
194
- steps:
195
- - uses: actions/checkout@v4
196
- - run: cd evals && npm ci
197
- - run: npm run eval:all
198
- env:
199
- LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
200
- ```
201
-
202
- ---
203
-
204
- ## Cost Controls
205
-
206
- | Control | Implementation |
207
- | ------------------- | ---------------------------------------------------------------- |
208
- | **Model selection** | Use GPT-4o-mini for judge (cheap), GPT-4o for target (expensive) |
209
- | **Sampling** | Run subset in CI, full suite on demand |
210
- | **Caching** | Cache context loading; SAFEWORD.md rarely changes |
211
- | **Budget alerts** | LangSmith has spend tracking |
212
-
213
- **Estimated costs (200 test cases):**
214
-
215
- - Full suite run: ~$5-10 (depends on output length)
216
- - CI weekly run: ~$50/month
217
-
218
- ---
219
-
220
- ## Naming Conventions
221
-
222
- ```
223
- Test ID format: {guide-prefix}-{story-num}-{test-slug}
224
-
225
- Examples:
226
- - arch-001-create-doc (architecture-guide, story 1)
227
- - arch-002-tech-choice (architecture-guide, story 3)
228
- - design-001-create-doc (design-doc-guide, story 2)
229
- - design-002-prereqs (design-doc-guide, story 1)
230
- ```
231
-
232
- | Guide | Prefix |
233
- | -------------------------- | ---------- |
234
- | architecture-guide.md | arch |
235
- | design-doc-guide.md | design |
236
- | testing-methodology.md | test |
237
- | tdd-templates.md | tdd |
238
- | code-philosophy.md | code |
239
- | context-files-guide.md | ctx |
240
- | data-architecture-guide.md | data |
241
- | learning-extraction.md | learn |
242
- | llm-instruction-design.md | llm-instr |
243
- | llm-prompting.md | llm-prompt |
244
- | test-definitions-guide.md | testdef |
245
- | user-story-guide.md | story |
246
- | zombie-process-cleanup.md | zombie |
247
-
248
- ---
249
-
250
- ## Success Criteria
251
-
252
- | Metric | Target |
253
- | ------------------ | ------------------------------------- |
254
- | Test coverage | ≥1 test per evaluated user story |
255
- | Pass rate baseline | Establish baseline, track regressions |
256
- | Cost per run | <$15 for full suite |
257
- | Run time | <10 min for full suite |
258
-
259
- ---
260
-
261
- ## Files to Create (Initial Implementation)
262
-
263
- 1. `evals/package.json` — Dependencies
264
- 2. `evals/src/config.ts` — LangSmith + model config
265
- 3. `evals/src/context-loader.ts` — Load guide files
266
- 4. `evals/src/evaluators/section-presence.ts` — First evaluator
267
- 5. `evals/src/runner.ts` — Execute tests
268
- 6. `evals/datasets/architecture-guide/create-doc.json` — First test case
269
- 7. `evals/scripts/run-all.ts` — Entry point
270
-
271
- ---
272
-
273
- ## Related Files
274
-
275
- - Evaluation plan: `.safeword/planning/002-user-story-quality-evaluation.md`
276
- - User stories source: `.safeword/planning/user-stories/001-guides-review-user-stories.md`
277
- - Original prompt: See "Prompt to Use" section below
278
-
279
- ---
280
-
281
- ## Prompt to Use in New Thread
282
-
283
- ```
284
- I want to set up LLM evaluations using LangSmith for my AI coding agent framework (SAFEWORD).
285
-
286
- **Goal:** Test whether an LLM agent correctly follows the documentation/guides when given various scenarios.
287
-
288
- **Tech context:**
289
- - Framework location: /Users/alex/projects/safeword/framework/
290
- - Guides location: framework/guides/
291
- - Main instruction file: framework/SAFEWORD.md
292
-
293
- **What I need:**
294
- 1. LangSmith project setup for evals
295
- 2. Dataset creation with test scenarios
296
- 3. Evaluator rubrics (LLM-as-judge)
297
- 4. Integration with CI (optional, can defer)
298
-
299
- **Test scenarios to implement:**
300
-
301
- ### Architecture Doc Tests (from architecture-guide.md)
302
-
303
- **Test 1: Create Architecture Doc**
304
- - Input: "Create an architecture doc for a new React + Supabase project"
305
- - Expected: Output contains all 10 required sections (Header, TOC, Overview, Data Principles, Data Model, Components, Data Flows, Key Decisions, Best Practices, Migration)
306
- - Rubric: EXCELLENT = all 10 sections with What/Why/Trade-off; ACCEPTABLE = 8+ sections; POOR = <8 sections
307
-
308
- **Test 2: Doc Type Decision - Tech Choice**
309
- - Input: "I need to document our decision to use PostgreSQL instead of MongoDB"
310
- - Expected: Agent chooses Architecture Doc (not Design Doc)
311
- - Rubric: EXCELLENT = correctly identifies Architecture Doc + explains why; POOR = suggests Design Doc
312
-
313
- **Test 3: Doc Type Decision - Feature**
314
- - Input: "I need to document how the user profile feature will work"
315
- - Expected: Agent chooses Design Doc (not Architecture Doc)
316
- - Rubric: EXCELLENT = correctly identifies Design Doc + checks for prerequisites; POOR = suggests Architecture Doc
317
-
318
- **Test 4: Decision Documentation**
319
- - Input: "Document our decision to use Redis for caching"
320
- - Expected: Output includes What, Why, Trade-off, Alternatives Considered
321
- - Rubric: EXCELLENT = all 4 fields with specifics; ACCEPTABLE = What/Why/Trade-off; POOR = missing Why or Trade-off
322
-
323
- **Test 5: Ambiguous Scenario - Tie-breaker**
324
- - Input: "I need to document adding a caching layer that will be used by multiple features"
325
- - Expected: Agent chooses Architecture Doc (affects 2+ features)
326
- - Rubric: EXCELLENT = Architecture Doc + cites tie-breaking rule; ACCEPTABLE = Architecture Doc; POOR = Design Doc
327
-
328
- ### Design Doc Tests (from design-doc-guide.md)
329
-
330
- **Test 6: Create Design Doc**
331
- - Input: "Create a design doc for a three-pane layout feature"
332
- - Expected: Output has required sections (Architecture, Components with [N]/[N+1], User Flow, Key Decisions)
333
- - Rubric: EXCELLENT = all required sections + references user stories/test defs; ACCEPTABLE = missing 1-2 optional sections; POOR = missing User Flow or Components
334
-
335
- **Test 7: Prerequisites Check**
336
- - Input: "Create a design doc for the payment flow feature" (assume no user stories exist)
337
- - Expected: Agent asks about or offers to create user stories first
338
- - Rubric: EXCELLENT = checks prerequisites before creating; POOR = creates design doc without checking
339
-
340
- **Test 8: Complexity Assessment**
341
- - Input: "Do I need a design doc for adding a logout button?"
342
- - Expected: Agent says no (simple, <3 components, single user story)
343
- - Rubric: EXCELLENT = correctly assesses as too simple + explains why; POOR = recommends design doc
344
-
345
- ### Edge Case Tests
346
-
347
- **Test 9: Scattered ADRs Migration**
348
- - Input: "Our project has 50 ADR files in docs/adr/. What should we do?"
349
- - Expected: Agent recommends consolidating into single ARCHITECTURE.md
350
- - Rubric: EXCELLENT = recommends consolidation + provides migration steps; POOR = suggests keeping ADRs
351
-
352
- **Test 10: Borderline Complexity**
353
- - Input: "I'm building a feature that touches exactly 3 components and has 2 user stories"
354
- - Expected: Agent recommends design doc (meets threshold)
355
- - Rubric: EXCELLENT = recommends design doc + cites complexity criteria; ACCEPTABLE = recommends design doc; POOR = says skip design doc
356
-
357
- **Evaluation approach:**
358
- - Use LLM-as-judge with Claude or GPT-4 as evaluator
359
- - Each test should have the SAFEWORD.md and relevant guide loaded as context
360
- - Track pass/fail rates over time to catch regressions
361
-
362
- Please help me set this up in LangSmith. Start by explaining the LangSmith concepts I need to know, then walk me through creating the first few test cases.
363
- ```