safeword 0.2.3 → 0.2.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/arch-review.md +32 -0
- package/.claude/commands/lint.md +6 -0
- package/.claude/commands/quality-review.md +13 -0
- package/.claude/commands/setup-linting.md +6 -0
- package/.claude/hooks/auto-lint.sh +6 -0
- package/.claude/hooks/auto-quality-review.sh +170 -0
- package/.claude/hooks/check-linting-sync.sh +17 -0
- package/.claude/hooks/inject-timestamp.sh +6 -0
- package/.claude/hooks/question-protocol.sh +12 -0
- package/.claude/hooks/run-linters.sh +8 -0
- package/.claude/hooks/run-quality-review.sh +76 -0
- package/.claude/hooks/version-check.sh +10 -0
- package/.claude/mcp/README.md +96 -0
- package/.claude/mcp/arcade.sample.json +9 -0
- package/.claude/mcp/context7.sample.json +7 -0
- package/.claude/mcp/playwright.sample.json +7 -0
- package/.claude/settings.json +62 -0
- package/.claude/skills/quality-reviewer/SKILL.md +190 -0
- package/.claude/skills/safeword-quality-reviewer/SKILL.md +13 -0
- package/.env.arcade.example +4 -0
- package/.env.example +11 -0
- package/.gitmodules +4 -0
- package/.safeword/SAFEWORD.md +33 -0
- package/.safeword/eslint/eslint-base.mjs +101 -0
- package/.safeword/guides/architecture-guide.md +404 -0
- package/.safeword/guides/code-philosophy.md +174 -0
- package/.safeword/guides/context-files-guide.md +405 -0
- package/.safeword/guides/data-architecture-guide.md +183 -0
- package/.safeword/guides/design-doc-guide.md +165 -0
- package/.safeword/guides/learning-extraction.md +515 -0
- package/.safeword/guides/llm-instruction-design.md +239 -0
- package/.safeword/guides/llm-prompting.md +95 -0
- package/.safeword/guides/tdd-best-practices.md +570 -0
- package/.safeword/guides/test-definitions-guide.md +243 -0
- package/.safeword/guides/testing-methodology.md +573 -0
- package/.safeword/guides/user-story-guide.md +237 -0
- package/.safeword/guides/zombie-process-cleanup.md +214 -0
- package/{templates → .safeword}/hooks/agents-md-check.sh +0 -0
- package/{templates → .safeword}/hooks/post-tool.sh +0 -0
- package/{templates → .safeword}/hooks/pre-commit.sh +0 -0
- package/.safeword/planning/002-user-story-quality-evaluation.md +1840 -0
- package/.safeword/planning/003-langsmith-eval-setup-prompt.md +363 -0
- package/.safeword/planning/004-llm-eval-test-cases.md +3226 -0
- package/.safeword/planning/005-architecture-enforcement-system.md +169 -0
- package/.safeword/planning/006-reactive-fix-prevention-research.md +135 -0
- package/.safeword/planning/011-cli-ux-vision.md +330 -0
- package/.safeword/planning/012-project-structure-cleanup.md +154 -0
- package/.safeword/planning/README.md +39 -0
- package/.safeword/planning/automation-plan-v2.md +1225 -0
- package/.safeword/planning/automation-plan-v3.md +1291 -0
- package/.safeword/planning/automation-plan.md +3058 -0
- package/.safeword/planning/design/005-cli-implementation.md +343 -0
- package/.safeword/planning/design/013-cli-self-contained-templates.md +596 -0
- package/.safeword/planning/design/013a-eslint-plugin-suite.md +256 -0
- package/.safeword/planning/design/013b-implementation-snippets.md +385 -0
- package/.safeword/planning/design/013c-config-isolation-strategy.md +242 -0
- package/.safeword/planning/design/code-philosophy-improvements.md +60 -0
- package/.safeword/planning/mcp-analysis.md +545 -0
- package/.safeword/planning/phase2-subagents-vs-skills-analysis.md +451 -0
- package/.safeword/planning/settings-improvements.md +970 -0
- package/.safeword/planning/test-definitions/005-cli-implementation.md +1301 -0
- package/.safeword/planning/test-definitions/cli-self-contained-templates.md +205 -0
- package/.safeword/planning/user-stories/001-guides-review-user-stories.md +1381 -0
- package/.safeword/planning/user-stories/003-reactive-fix-prevention.md +132 -0
- package/.safeword/planning/user-stories/004-technical-constraints.md +86 -0
- package/.safeword/planning/user-stories/005-cli-implementation.md +311 -0
- package/.safeword/planning/user-stories/cli-self-contained-templates.md +172 -0
- package/.safeword/planning/versioned-distribution.md +740 -0
- package/.safeword/prompts/arch-review.md +43 -0
- package/.safeword/prompts/quality-review.md +11 -0
- package/.safeword/scripts/arch-review.sh +235 -0
- package/.safeword/scripts/check-linting-sync.sh +58 -0
- package/.safeword/scripts/setup-linting.sh +559 -0
- package/.safeword/templates/architecture-template.md +136 -0
- package/.safeword/templates/ci/architecture-check.yml +79 -0
- package/.safeword/templates/design-doc-template.md +127 -0
- package/.safeword/templates/test-definitions-feature.md +100 -0
- package/.safeword/templates/ticket-template.md +74 -0
- package/.safeword/templates/user-stories-template.md +82 -0
- package/.safeword/tickets/001-guides-review-user-stories.md +83 -0
- package/.safeword/tickets/002-architecture-enforcement.md +211 -0
- package/.safeword/tickets/003-reactive-fix-prevention.md +57 -0
- package/.safeword/tickets/004-technical-constraints-in-user-stories.md +39 -0
- package/.safeword/tickets/005-cli-implementation.md +248 -0
- package/.safeword/tickets/006-flesh-out-skills.md +43 -0
- package/.safeword/tickets/007-flesh-out-questioning.md +44 -0
- package/.safeword/tickets/008-upgrade-questioning.md +58 -0
- package/.safeword/tickets/009-naming-conventions.md +41 -0
- package/.safeword/tickets/010-safeword-md-cleanup.md +34 -0
- package/.safeword/tickets/011-cursor-setup.md +86 -0
- package/.safeword/tickets/README.md +73 -0
- package/.safeword/version +1 -0
- package/AGENTS.md +59 -0
- package/CLAUDE.md +12 -0
- package/README.md +347 -0
- package/docs/001-cli-implementation-plan.md +856 -0
- package/docs/elite-dx-implementation-plan.md +1034 -0
- package/framework/README.md +131 -0
- package/framework/mcp/README.md +96 -0
- package/framework/mcp/arcade.sample.json +8 -0
- package/framework/mcp/context7.sample.json +6 -0
- package/framework/mcp/playwright.sample.json +6 -0
- package/framework/scripts/arch-review.sh +235 -0
- package/framework/scripts/check-linting-sync.sh +58 -0
- package/framework/scripts/load-env.sh +49 -0
- package/framework/scripts/setup-claude.sh +223 -0
- package/framework/scripts/setup-linting.sh +559 -0
- package/framework/scripts/setup-quality.sh +477 -0
- package/framework/scripts/setup-safeword.sh +550 -0
- package/framework/templates/ci/architecture-check.yml +78 -0
- package/learnings/ai-sdk-v5-breaking-changes.md +178 -0
- package/learnings/e2e-test-zombie-processes.md +231 -0
- package/learnings/milkdown-crepe-editor-property.md +96 -0
- package/learnings/prosemirror-fragment-traversal.md +119 -0
- package/package.json +19 -43
- package/packages/cli/AGENTS.md +1 -0
- package/packages/cli/ARCHITECTURE.md +279 -0
- package/packages/cli/package.json +51 -0
- package/packages/cli/src/cli.ts +63 -0
- package/packages/cli/src/commands/check.ts +166 -0
- package/packages/cli/src/commands/diff.ts +209 -0
- package/packages/cli/src/commands/reset.ts +190 -0
- package/packages/cli/src/commands/setup.ts +325 -0
- package/packages/cli/src/commands/upgrade.ts +163 -0
- package/packages/cli/src/index.ts +3 -0
- package/packages/cli/src/templates/config.ts +58 -0
- package/packages/cli/src/templates/content.ts +18 -0
- package/packages/cli/src/templates/index.ts +12 -0
- package/packages/cli/src/utils/agents-md.ts +66 -0
- package/packages/cli/src/utils/fs.ts +179 -0
- package/packages/cli/src/utils/git.ts +124 -0
- package/packages/cli/src/utils/hooks.ts +29 -0
- package/packages/cli/src/utils/output.ts +60 -0
- package/packages/cli/src/utils/project-detector.test.ts +185 -0
- package/packages/cli/src/utils/project-detector.ts +44 -0
- package/packages/cli/src/utils/version.ts +28 -0
- package/packages/cli/src/version.ts +6 -0
- package/packages/cli/templates/SAFEWORD.md +776 -0
- package/packages/cli/templates/doc-templates/architecture-template.md +136 -0
- package/packages/cli/templates/doc-templates/design-doc-template.md +134 -0
- package/packages/cli/templates/doc-templates/test-definitions-feature.md +131 -0
- package/packages/cli/templates/doc-templates/ticket-template.md +82 -0
- package/packages/cli/templates/doc-templates/user-stories-template.md +92 -0
- package/packages/cli/templates/guides/architecture-guide.md +423 -0
- package/packages/cli/templates/guides/code-philosophy.md +195 -0
- package/packages/cli/templates/guides/context-files-guide.md +457 -0
- package/packages/cli/templates/guides/data-architecture-guide.md +200 -0
- package/packages/cli/templates/guides/design-doc-guide.md +171 -0
- package/packages/cli/templates/guides/learning-extraction.md +552 -0
- package/packages/cli/templates/guides/llm-instruction-design.md +248 -0
- package/packages/cli/templates/guides/llm-prompting.md +102 -0
- package/packages/cli/templates/guides/tdd-best-practices.md +615 -0
- package/packages/cli/templates/guides/test-definitions-guide.md +334 -0
- package/packages/cli/templates/guides/testing-methodology.md +618 -0
- package/packages/cli/templates/guides/user-story-guide.md +256 -0
- package/packages/cli/templates/guides/zombie-process-cleanup.md +219 -0
- package/packages/cli/templates/hooks/agents-md-check.sh +27 -0
- package/packages/cli/templates/hooks/post-tool.sh +4 -0
- package/packages/cli/templates/hooks/pre-commit.sh +10 -0
- package/packages/cli/templates/prompts/arch-review.md +43 -0
- package/packages/cli/templates/prompts/quality-review.md +10 -0
- package/packages/cli/templates/skills/safeword-quality-reviewer/SKILL.md +207 -0
- package/packages/cli/tests/commands/check.test.ts +129 -0
- package/packages/cli/tests/commands/cli.test.ts +89 -0
- package/packages/cli/tests/commands/diff.test.ts +115 -0
- package/packages/cli/tests/commands/reset.test.ts +310 -0
- package/packages/cli/tests/commands/self-healing.test.ts +170 -0
- package/packages/cli/tests/commands/setup-blocking.test.ts +71 -0
- package/packages/cli/tests/commands/setup-core.test.ts +135 -0
- package/packages/cli/tests/commands/setup-git.test.ts +139 -0
- package/packages/cli/tests/commands/setup-hooks.test.ts +334 -0
- package/packages/cli/tests/commands/setup-linting.test.ts +189 -0
- package/packages/cli/tests/commands/setup-noninteractive.test.ts +80 -0
- package/packages/cli/tests/commands/setup-templates.test.ts +181 -0
- package/packages/cli/tests/commands/upgrade.test.ts +215 -0
- package/packages/cli/tests/helpers.ts +243 -0
- package/packages/cli/tests/npm-package.test.ts +83 -0
- package/packages/cli/tests/technical-constraints.test.ts +96 -0
- package/packages/cli/tsconfig.json +25 -0
- package/packages/cli/tsup.config.ts +11 -0
- package/packages/cli/vitest.config.ts +23 -0
- package/promptfoo.yaml +3270 -0
- package/dist/check-3NGQ4NR5.js +0 -129
- package/dist/check-3NGQ4NR5.js.map +0 -1
- package/dist/chunk-2XWIUEQK.js +0 -190
- package/dist/chunk-2XWIUEQK.js.map +0 -1
- package/dist/chunk-GZRQL3SX.js +0 -146
- package/dist/chunk-GZRQL3SX.js.map +0 -1
- package/dist/chunk-ORQHKDT2.js +0 -10
- package/dist/chunk-ORQHKDT2.js.map +0 -1
- package/dist/chunk-W66Z3C5H.js +0 -21
- package/dist/chunk-W66Z3C5H.js.map +0 -1
- package/dist/cli.d.ts +0 -1
- package/dist/cli.js +0 -34
- package/dist/cli.js.map +0 -1
- package/dist/diff-Y6QTAW4O.js +0 -166
- package/dist/diff-Y6QTAW4O.js.map +0 -1
- package/dist/index.d.ts +0 -11
- package/dist/index.js +0 -7
- package/dist/index.js.map +0 -1
- package/dist/reset-3ACTIYYE.js +0 -143
- package/dist/reset-3ACTIYYE.js.map +0 -1
- package/dist/setup-RR4M334C.js +0 -266
- package/dist/setup-RR4M334C.js.map +0 -1
- package/dist/upgrade-6AR3DHUV.js +0 -134
- package/dist/upgrade-6AR3DHUV.js.map +0 -1
- /package/{templates → framework}/SAFEWORD.md +0 -0
- /package/{templates → framework}/guides/architecture-guide.md +0 -0
- /package/{templates → framework}/guides/code-philosophy.md +0 -0
- /package/{templates → framework}/guides/context-files-guide.md +0 -0
- /package/{templates → framework}/guides/data-architecture-guide.md +0 -0
- /package/{templates → framework}/guides/design-doc-guide.md +0 -0
- /package/{templates → framework}/guides/learning-extraction.md +0 -0
- /package/{templates → framework}/guides/llm-instruction-design.md +0 -0
- /package/{templates → framework}/guides/llm-prompting.md +0 -0
- /package/{templates → framework}/guides/tdd-best-practices.md +0 -0
- /package/{templates → framework}/guides/test-definitions-guide.md +0 -0
- /package/{templates → framework}/guides/testing-methodology.md +0 -0
- /package/{templates → framework}/guides/user-story-guide.md +0 -0
- /package/{templates → framework}/guides/zombie-process-cleanup.md +0 -0
- /package/{templates → framework}/prompts/arch-review.md +0 -0
- /package/{templates → framework}/prompts/quality-review.md +0 -0
- /package/{templates/skills/safeword-quality-reviewer → framework/skills/quality-reviewer}/SKILL.md +0 -0
- /package/{templates/doc-templates → framework/templates}/architecture-template.md +0 -0
- /package/{templates/doc-templates → framework/templates}/design-doc-template.md +0 -0
- /package/{templates/doc-templates → framework/templates}/test-definitions-feature.md +0 -0
- /package/{templates/doc-templates → framework/templates}/ticket-template.md +0 -0
- /package/{templates/doc-templates → framework/templates}/user-stories-template.md +0 -0
- /package/{templates → packages/cli/templates}/commands/arch-review.md +0 -0
- /package/{templates → packages/cli/templates}/commands/lint.md +0 -0
- /package/{templates → packages/cli/templates}/commands/quality-review.md +0 -0
- /package/{templates → packages/cli/templates}/hooks/inject-timestamp.sh +0 -0
- /package/{templates → packages/cli/templates}/lib/common.sh +0 -0
- /package/{templates → packages/cli/templates}/lib/jq-fallback.sh +0 -0
- /package/{templates → packages/cli/templates}/markdownlint.jsonc +0 -0
|
@@ -0,0 +1,363 @@
|
|
|
1
|
+
# LangSmith Eval Setup Plan
|
|
2
|
+
|
|
3
|
+
**Goal:** Test whether an LLM agent correctly follows SAFEWORD documentation/guides when given various scenarios.
|
|
4
|
+
|
|
5
|
+
**Scale:** 140 user stories across 13 guides → expect 200+ test cases over time.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Phase 1: Foundation (Do First)
|
|
10
|
+
|
|
11
|
+
### 1.1 Project Structure
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
evals/
|
|
15
|
+
├── package.json # Dependencies: langsmith, langchain, openai/anthropic SDK
|
|
16
|
+
├── tsconfig.json
|
|
17
|
+
├── .env.example # LANGSMITH_API_KEY, OPENAI_API_KEY, etc.
|
|
18
|
+
├── src/
|
|
19
|
+
│ ├── config.ts # LangSmith project config, model selection
|
|
20
|
+
│ ├── context-loader.ts # Load SAFEWORD.md + guides as context
|
|
21
|
+
│ ├── evaluators/
|
|
22
|
+
│ │ ├── section-presence.ts # Check if output has required sections
|
|
23
|
+
│ │ ├── doc-type-decision.ts # Check if correct doc type was chosen
|
|
24
|
+
│ │ └── prerequisites.ts # Check if prerequisites were verified
|
|
25
|
+
│ └── runner.ts # Execute evals, upload to LangSmith
|
|
26
|
+
├── datasets/
|
|
27
|
+
│ ├── architecture-guide/ # Tests for architecture-guide.md
|
|
28
|
+
│ │ ├── create-doc.json # Test 1: Create Architecture Doc
|
|
29
|
+
│ │ ├── tech-choice.json # Test 2: Doc Type Decision - Tech Choice
|
|
30
|
+
│ │ └── ...
|
|
31
|
+
│ ├── design-doc-guide/ # Tests for design-doc-guide.md
|
|
32
|
+
│ └── ... (one folder per guide)
|
|
33
|
+
└── scripts/
|
|
34
|
+
├── run-all.ts # Run full eval suite
|
|
35
|
+
├── run-guide.ts # Run evals for one guide
|
|
36
|
+
└── upload-datasets.ts # Sync datasets to LangSmith
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### 1.2 Core Dependencies
|
|
40
|
+
|
|
41
|
+
```json
|
|
42
|
+
{
|
|
43
|
+
"dependencies": {
|
|
44
|
+
"langsmith": "^0.1.0",
|
|
45
|
+
"openai": "^4.0.0",
|
|
46
|
+
"@anthropic-ai/sdk": "^0.30.0"
|
|
47
|
+
},
|
|
48
|
+
"devDependencies": {
|
|
49
|
+
"tsx": "^4.0.0",
|
|
50
|
+
"dotenv": "^16.0.0"
|
|
51
|
+
}
|
|
52
|
+
}
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### 1.3 Dataset Schema (Standard for All Tests)
|
|
56
|
+
|
|
57
|
+
```typescript
|
|
58
|
+
interface TestCase {
|
|
59
|
+
id: string; // e.g., "arch-001-create-doc"
|
|
60
|
+
guide: string; // e.g., "architecture-guide"
|
|
61
|
+
story_id: number; // Maps to user story number
|
|
62
|
+
input: string; // User prompt
|
|
63
|
+
context_files: string[]; // Which files to load as context
|
|
64
|
+
expected_behavior: string; // What the agent should do
|
|
65
|
+
rubric: {
|
|
66
|
+
excellent: string;
|
|
67
|
+
acceptable: string;
|
|
68
|
+
poor: string;
|
|
69
|
+
};
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Phase 2: Evaluator Patterns (Reusable)
|
|
76
|
+
|
|
77
|
+
### 2.1 Section Presence Evaluator
|
|
78
|
+
|
|
79
|
+
Used by: Tests 1, 6 (any "Create X Doc" test)
|
|
80
|
+
|
|
81
|
+
```typescript
|
|
82
|
+
// evaluators/section-presence.ts
|
|
83
|
+
const evaluate = (output: string, requiredSections: string[]) => {
|
|
84
|
+
const found = requiredSections.filter(s => output.includes(s));
|
|
85
|
+
const score = found.length / requiredSections.length;
|
|
86
|
+
|
|
87
|
+
if (score === 1) return { score: 1, label: 'excellent' };
|
|
88
|
+
if (score >= 0.8) return { score: 0.8, label: 'acceptable' };
|
|
89
|
+
return { score: 0.4, label: 'poor' };
|
|
90
|
+
};
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### 2.2 Decision Evaluator
|
|
94
|
+
|
|
95
|
+
Used by: Tests 2, 3, 5, 8, 10 (any "Which doc type?" test)
|
|
96
|
+
|
|
97
|
+
```typescript
|
|
98
|
+
// evaluators/doc-type-decision.ts
|
|
99
|
+
const evaluate = (output: string, expectedType: 'architecture' | 'design' | 'none') => {
|
|
100
|
+
const mentionsArch = /architecture\s*(doc|document)/i.test(output);
|
|
101
|
+
const mentionsDesign = /design\s*(doc|document)/i.test(output);
|
|
102
|
+
|
|
103
|
+
// Scoring logic based on expectedType
|
|
104
|
+
};
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### 2.3 Prerequisites Evaluator
|
|
108
|
+
|
|
109
|
+
Used by: Tests 7 (any "Did agent check prerequisites?" test)
|
|
110
|
+
|
|
111
|
+
```typescript
|
|
112
|
+
// evaluators/prerequisites.ts
|
|
113
|
+
const evaluate = (output: string) => {
|
|
114
|
+
const asksAboutStories = /user stor(y|ies)/i.test(output);
|
|
115
|
+
const asksAboutTestDefs = /test definition/i.test(output);
|
|
116
|
+
const offersToCreate = /create|offer|would you like/i.test(output);
|
|
117
|
+
|
|
118
|
+
// Excellent = checks before creating
|
|
119
|
+
};
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### 2.4 LLM-as-Judge (General Purpose)
|
|
123
|
+
|
|
124
|
+
Used by: Complex cases where regex isn't enough
|
|
125
|
+
|
|
126
|
+
```typescript
|
|
127
|
+
// evaluators/llm-judge.ts
|
|
128
|
+
const judgePrompt = `
|
|
129
|
+
You are evaluating an AI coding assistant's response.
|
|
130
|
+
|
|
131
|
+
Rubric:
|
|
132
|
+
- EXCELLENT: {rubric.excellent}
|
|
133
|
+
- ACCEPTABLE: {rubric.acceptable}
|
|
134
|
+
- POOR: {rubric.poor}
|
|
135
|
+
|
|
136
|
+
User Input: {input}
|
|
137
|
+
Assistant Output: {output}
|
|
138
|
+
|
|
139
|
+
Rate the response as EXCELLENT, ACCEPTABLE, or POOR. Explain briefly.
|
|
140
|
+
`;
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## Phase 3: Implementation Order
|
|
146
|
+
|
|
147
|
+
### Week 1: Setup + First 10 Tests
|
|
148
|
+
|
|
149
|
+
| Task | Output |
|
|
150
|
+
| ----------------------------------- | ------------------------- |
|
|
151
|
+
| Create `evals/` directory structure | Scaffolding |
|
|
152
|
+
| Install dependencies | `package.json` |
|
|
153
|
+
| Create context loader | Load SAFEWORD.md + guides |
|
|
154
|
+
| Create LangSmith project | `safeword-evals` project |
|
|
155
|
+
| Implement Tests 1-10 from prompt | First dataset uploaded |
|
|
156
|
+
|
|
157
|
+
### Week 2: Evaluator Library
|
|
158
|
+
|
|
159
|
+
| Task | Output |
|
|
160
|
+
| -------------------------- | ----------------------------------- |
|
|
161
|
+
| Section presence evaluator | Reusable for all "create doc" tests |
|
|
162
|
+
| Decision evaluator | Reusable for all "which doc?" tests |
|
|
163
|
+
| Prerequisites evaluator | Reusable for workflow tests |
|
|
164
|
+
| LLM-as-judge wrapper | General fallback |
|
|
165
|
+
|
|
166
|
+
### Week 3+: Scale Up
|
|
167
|
+
|
|
168
|
+
| Guide | # Stories | Priority |
|
|
169
|
+
| ------------------------- | --------- | -------------- |
|
|
170
|
+
| architecture-guide.md | 11 | HIGH (4 done) |
|
|
171
|
+
| design-doc-guide.md | 10 | HIGH (partial) |
|
|
172
|
+
| testing-methodology.md | 13 | MEDIUM |
|
|
173
|
+
| tdd-templates.md | 16 | MEDIUM |
|
|
174
|
+
| llm-instruction-design.md | 15 | LOW |
|
|
175
|
+
| ... | ... | ... |
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Phase 4: CI Integration (Defer)
|
|
180
|
+
|
|
181
|
+
```yaml
|
|
182
|
+
# .github/workflows/evals.yml
|
|
183
|
+
name: Run Evals
|
|
184
|
+
on:
|
|
185
|
+
push:
|
|
186
|
+
paths:
|
|
187
|
+
- 'framework/**' # Run when guides change
|
|
188
|
+
schedule:
|
|
189
|
+
- cron: '0 0 * * 0' # Weekly regression check
|
|
190
|
+
|
|
191
|
+
jobs:
|
|
192
|
+
eval:
|
|
193
|
+
runs-on: ubuntu-latest
|
|
194
|
+
steps:
|
|
195
|
+
- uses: actions/checkout@v4
|
|
196
|
+
- run: cd evals && npm ci
|
|
197
|
+
- run: npm run eval:all
|
|
198
|
+
env:
|
|
199
|
+
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
---
|
|
203
|
+
|
|
204
|
+
## Cost Controls
|
|
205
|
+
|
|
206
|
+
| Control | Implementation |
|
|
207
|
+
| ------------------- | ---------------------------------------------------------------- |
|
|
208
|
+
| **Model selection** | Use GPT-4o-mini for judge (cheap), GPT-4o for target (expensive) |
|
|
209
|
+
| **Sampling** | Run subset in CI, full suite on demand |
|
|
210
|
+
| **Caching** | Cache context loading; SAFEWORD.md rarely changes |
|
|
211
|
+
| **Budget alerts** | LangSmith has spend tracking |
|
|
212
|
+
|
|
213
|
+
**Estimated costs (200 test cases):**
|
|
214
|
+
|
|
215
|
+
- Full suite run: ~$5-10 (depends on output length)
|
|
216
|
+
- CI weekly run: ~$50/month
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## Naming Conventions
|
|
221
|
+
|
|
222
|
+
```
|
|
223
|
+
Test ID format: {guide-prefix}-{story-num}-{test-slug}
|
|
224
|
+
|
|
225
|
+
Examples:
|
|
226
|
+
- arch-001-create-doc (architecture-guide, story 1)
|
|
227
|
+
- arch-002-tech-choice (architecture-guide, story 3)
|
|
228
|
+
- design-001-create-doc (design-doc-guide, story 2)
|
|
229
|
+
- design-002-prereqs (design-doc-guide, story 1)
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
| Guide | Prefix |
|
|
233
|
+
| -------------------------- | ---------- |
|
|
234
|
+
| architecture-guide.md | arch |
|
|
235
|
+
| design-doc-guide.md | design |
|
|
236
|
+
| testing-methodology.md | test |
|
|
237
|
+
| tdd-templates.md | tdd |
|
|
238
|
+
| code-philosophy.md | code |
|
|
239
|
+
| context-files-guide.md | ctx |
|
|
240
|
+
| data-architecture-guide.md | data |
|
|
241
|
+
| learning-extraction.md | learn |
|
|
242
|
+
| llm-instruction-design.md | llm-instr |
|
|
243
|
+
| llm-prompting.md | llm-prompt |
|
|
244
|
+
| test-definitions-guide.md | testdef |
|
|
245
|
+
| user-story-guide.md | story |
|
|
246
|
+
| zombie-process-cleanup.md | zombie |
|
|
247
|
+
|
|
248
|
+
---
|
|
249
|
+
|
|
250
|
+
## Success Criteria
|
|
251
|
+
|
|
252
|
+
| Metric | Target |
|
|
253
|
+
| ------------------ | ------------------------------------- |
|
|
254
|
+
| Test coverage | ≥1 test per evaluated user story |
|
|
255
|
+
| Pass rate baseline | Establish baseline, track regressions |
|
|
256
|
+
| Cost per run | <$15 for full suite |
|
|
257
|
+
| Run time | <10 min for full suite |
|
|
258
|
+
|
|
259
|
+
---
|
|
260
|
+
|
|
261
|
+
## Files to Create (Initial Implementation)
|
|
262
|
+
|
|
263
|
+
1. `evals/package.json` — Dependencies
|
|
264
|
+
2. `evals/src/config.ts` — LangSmith + model config
|
|
265
|
+
3. `evals/src/context-loader.ts` — Load guide files
|
|
266
|
+
4. `evals/src/evaluators/section-presence.ts` — First evaluator
|
|
267
|
+
5. `evals/src/runner.ts` — Execute tests
|
|
268
|
+
6. `evals/datasets/architecture-guide/create-doc.json` — First test case
|
|
269
|
+
7. `evals/scripts/run-all.ts` — Entry point
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Related Files
|
|
274
|
+
|
|
275
|
+
- Evaluation plan: `.safeword/planning/002-user-story-quality-evaluation.md`
|
|
276
|
+
- User stories source: `.safeword/planning/user-stories/001-guides-review-user-stories.md`
|
|
277
|
+
- Original prompt: See "Prompt to Use" section below
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## Prompt to Use in New Thread
|
|
282
|
+
|
|
283
|
+
```
|
|
284
|
+
I want to set up LLM evaluations using LangSmith for my AI coding agent framework (SAFEWORD).
|
|
285
|
+
|
|
286
|
+
**Goal:** Test whether an LLM agent correctly follows the documentation/guides when given various scenarios.
|
|
287
|
+
|
|
288
|
+
**Tech context:**
|
|
289
|
+
- Framework location: /Users/alex/projects/safeword/framework/
|
|
290
|
+
- Guides location: framework/guides/
|
|
291
|
+
- Main instruction file: framework/SAFEWORD.md
|
|
292
|
+
|
|
293
|
+
**What I need:**
|
|
294
|
+
1. LangSmith project setup for evals
|
|
295
|
+
2. Dataset creation with test scenarios
|
|
296
|
+
3. Evaluator rubrics (LLM-as-judge)
|
|
297
|
+
4. Integration with CI (optional, can defer)
|
|
298
|
+
|
|
299
|
+
**Test scenarios to implement:**
|
|
300
|
+
|
|
301
|
+
### Architecture Doc Tests (from architecture-guide.md)
|
|
302
|
+
|
|
303
|
+
**Test 1: Create Architecture Doc**
|
|
304
|
+
- Input: "Create an architecture doc for a new React + Supabase project"
|
|
305
|
+
- Expected: Output contains all 10 required sections (Header, TOC, Overview, Data Principles, Data Model, Components, Data Flows, Key Decisions, Best Practices, Migration)
|
|
306
|
+
- Rubric: EXCELLENT = all 10 sections with What/Why/Trade-off; ACCEPTABLE = 8+ sections; POOR = <8 sections
|
|
307
|
+
|
|
308
|
+
**Test 2: Doc Type Decision - Tech Choice**
|
|
309
|
+
- Input: "I need to document our decision to use PostgreSQL instead of MongoDB"
|
|
310
|
+
- Expected: Agent chooses Architecture Doc (not Design Doc)
|
|
311
|
+
- Rubric: EXCELLENT = correctly identifies Architecture Doc + explains why; POOR = suggests Design Doc
|
|
312
|
+
|
|
313
|
+
**Test 3: Doc Type Decision - Feature**
|
|
314
|
+
- Input: "I need to document how the user profile feature will work"
|
|
315
|
+
- Expected: Agent chooses Design Doc (not Architecture Doc)
|
|
316
|
+
- Rubric: EXCELLENT = correctly identifies Design Doc + checks for prerequisites; POOR = suggests Architecture Doc
|
|
317
|
+
|
|
318
|
+
**Test 4: Decision Documentation**
|
|
319
|
+
- Input: "Document our decision to use Redis for caching"
|
|
320
|
+
- Expected: Output includes What, Why, Trade-off, Alternatives Considered
|
|
321
|
+
- Rubric: EXCELLENT = all 4 fields with specifics; ACCEPTABLE = What/Why/Trade-off; POOR = missing Why or Trade-off
|
|
322
|
+
|
|
323
|
+
**Test 5: Ambiguous Scenario - Tie-breaker**
|
|
324
|
+
- Input: "I need to document adding a caching layer that will be used by multiple features"
|
|
325
|
+
- Expected: Agent chooses Architecture Doc (affects 2+ features)
|
|
326
|
+
- Rubric: EXCELLENT = Architecture Doc + cites tie-breaking rule; ACCEPTABLE = Architecture Doc; POOR = Design Doc
|
|
327
|
+
|
|
328
|
+
### Design Doc Tests (from design-doc-guide.md)
|
|
329
|
+
|
|
330
|
+
**Test 6: Create Design Doc**
|
|
331
|
+
- Input: "Create a design doc for a three-pane layout feature"
|
|
332
|
+
- Expected: Output has required sections (Architecture, Components with [N]/[N+1], User Flow, Key Decisions)
|
|
333
|
+
- Rubric: EXCELLENT = all required sections + references user stories/test defs; ACCEPTABLE = missing 1-2 optional sections; POOR = missing User Flow or Components
|
|
334
|
+
|
|
335
|
+
**Test 7: Prerequisites Check**
|
|
336
|
+
- Input: "Create a design doc for the payment flow feature" (assume no user stories exist)
|
|
337
|
+
- Expected: Agent asks about or offers to create user stories first
|
|
338
|
+
- Rubric: EXCELLENT = checks prerequisites before creating; POOR = creates design doc without checking
|
|
339
|
+
|
|
340
|
+
**Test 8: Complexity Assessment**
|
|
341
|
+
- Input: "Do I need a design doc for adding a logout button?"
|
|
342
|
+
- Expected: Agent says no (simple, <3 components, single user story)
|
|
343
|
+
- Rubric: EXCELLENT = correctly assesses as too simple + explains why; POOR = recommends design doc
|
|
344
|
+
|
|
345
|
+
### Edge Case Tests
|
|
346
|
+
|
|
347
|
+
**Test 9: Scattered ADRs Migration**
|
|
348
|
+
- Input: "Our project has 50 ADR files in docs/adr/. What should we do?"
|
|
349
|
+
- Expected: Agent recommends consolidating into single ARCHITECTURE.md
|
|
350
|
+
- Rubric: EXCELLENT = recommends consolidation + provides migration steps; POOR = suggests keeping ADRs
|
|
351
|
+
|
|
352
|
+
**Test 10: Borderline Complexity**
|
|
353
|
+
- Input: "I'm building a feature that touches exactly 3 components and has 2 user stories"
|
|
354
|
+
- Expected: Agent recommends design doc (meets threshold)
|
|
355
|
+
- Rubric: EXCELLENT = recommends design doc + cites complexity criteria; ACCEPTABLE = recommends design doc; POOR = says skip design doc
|
|
356
|
+
|
|
357
|
+
**Evaluation approach:**
|
|
358
|
+
- Use LLM-as-judge with Claude or GPT-4 as evaluator
|
|
359
|
+
- Each test should have the SAFEWORD.md and relevant guide loaded as context
|
|
360
|
+
- Track pass/fail rates over time to catch regressions
|
|
361
|
+
|
|
362
|
+
Please help me set this up in LangSmith. Start by explaining the LangSmith concepts I need to know, then walk me through creating the first few test cases.
|
|
363
|
+
```
|