@zigrivers/scaffold 2.1.2 → 2.28.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +272 -59
- package/knowledge/core/adr-craft.md +53 -0
- package/knowledge/core/ai-memory-management.md +246 -0
- package/knowledge/core/api-design.md +4 -0
- package/knowledge/core/claude-md-patterns.md +254 -0
- package/knowledge/core/coding-conventions.md +246 -0
- package/knowledge/core/database-design.md +4 -0
- package/knowledge/core/design-system-tokens.md +465 -0
- package/knowledge/core/dev-environment.md +223 -0
- package/knowledge/core/domain-modeling.md +4 -0
- package/knowledge/core/eval-craft.md +1008 -0
- package/knowledge/core/multi-model-review-dispatch.md +250 -0
- package/knowledge/core/operations-runbook.md +37 -226
- package/knowledge/core/project-structure-patterns.md +231 -0
- package/knowledge/core/review-step-template.md +247 -0
- package/knowledge/core/{security-review.md → security-best-practices.md} +5 -1
- package/knowledge/core/task-decomposition.md +57 -34
- package/knowledge/core/task-tracking.md +225 -0
- package/knowledge/core/tech-stack-selection.md +214 -0
- package/knowledge/core/testing-strategy.md +63 -70
- package/knowledge/core/user-stories.md +69 -60
- package/knowledge/core/user-story-innovation.md +57 -0
- package/knowledge/core/ux-specification.md +5 -148
- package/knowledge/finalization/apply-fixes-and-freeze.md +165 -14
- package/knowledge/product/prd-craft.md +55 -34
- package/knowledge/review/review-adr.md +32 -0
- package/knowledge/review/{review-api-contracts.md → review-api-design.md} +34 -1
- package/knowledge/review/{review-database-schema.md → review-database-design.md} +27 -1
- package/knowledge/review/review-domain-modeling.md +33 -0
- package/knowledge/review/review-implementation-tasks.md +50 -0
- package/knowledge/review/review-operations.md +55 -0
- package/knowledge/review/review-prd.md +33 -0
- package/knowledge/review/review-security.md +53 -0
- package/knowledge/review/review-system-architecture.md +28 -0
- package/knowledge/review/review-testing-strategy.md +51 -0
- package/knowledge/review/review-user-stories.md +54 -0
- package/knowledge/review/{review-ux-spec.md → review-ux-specification.md} +37 -1
- package/methodology/custom-defaults.yml +32 -3
- package/methodology/deep.yml +32 -3
- package/methodology/mvp.yml +32 -3
- package/package.json +2 -1
- package/pipeline/architecture/review-architecture.md +18 -6
- package/pipeline/architecture/system-architecture.md +14 -2
- package/pipeline/consolidation/claude-md-optimization.md +73 -0
- package/pipeline/consolidation/workflow-audit.md +73 -0
- package/pipeline/decisions/adrs.md +14 -2
- package/pipeline/decisions/review-adrs.md +18 -5
- package/pipeline/environment/ai-memory-setup.md +70 -0
- package/pipeline/environment/automated-pr-review.md +70 -0
- package/pipeline/environment/design-system.md +73 -0
- package/pipeline/environment/dev-env-setup.md +65 -0
- package/pipeline/environment/git-workflow.md +71 -0
- package/pipeline/finalization/apply-fixes-and-freeze.md +1 -1
- package/pipeline/finalization/developer-onboarding-guide.md +1 -1
- package/pipeline/finalization/implementation-playbook.md +3 -3
- package/pipeline/foundation/beads.md +68 -0
- package/pipeline/foundation/coding-standards.md +68 -0
- package/pipeline/foundation/project-structure.md +69 -0
- package/pipeline/foundation/tdd.md +60 -0
- package/pipeline/foundation/tech-stack.md +74 -0
- package/pipeline/integration/add-e2e-testing.md +65 -0
- package/pipeline/modeling/domain-modeling.md +14 -2
- package/pipeline/modeling/review-domain-modeling.md +18 -5
- package/pipeline/parity/platform-parity-review.md +70 -0
- package/pipeline/planning/implementation-plan-review.md +56 -0
- package/pipeline/planning/{implementation-tasks.md → implementation-plan.md} +29 -9
- package/pipeline/pre/create-prd.md +13 -4
- package/pipeline/pre/innovate-prd.md +37 -8
- package/pipeline/pre/innovate-user-stories.md +38 -7
- package/pipeline/pre/review-prd.md +18 -6
- package/pipeline/pre/review-user-stories.md +23 -6
- package/pipeline/pre/user-stories.md +12 -2
- package/pipeline/quality/create-evals.md +102 -0
- package/pipeline/quality/operations.md +38 -13
- package/pipeline/quality/review-operations.md +17 -5
- package/pipeline/quality/review-security.md +17 -5
- package/pipeline/quality/review-testing.md +20 -8
- package/pipeline/quality/security.md +25 -3
- package/pipeline/quality/story-tests.md +73 -0
- package/pipeline/specification/api-contracts.md +17 -2
- package/pipeline/specification/database-schema.md +17 -2
- package/pipeline/specification/review-api.md +18 -6
- package/pipeline/specification/review-database.md +18 -6
- package/pipeline/specification/review-ux.md +19 -7
- package/pipeline/specification/ux-spec.md +29 -10
- package/pipeline/validation/critical-path-walkthrough.md +34 -7
- package/pipeline/validation/cross-phase-consistency.md +34 -7
- package/pipeline/validation/decision-completeness.md +34 -7
- package/pipeline/validation/dependency-graph-validation.md +34 -7
- package/pipeline/validation/implementability-dry-run.md +34 -7
- package/pipeline/validation/scope-creep-check.md +34 -7
- package/pipeline/validation/traceability-matrix.md +34 -7
- package/skills/multi-model-dispatch/SKILL.md +326 -0
- package/skills/scaffold-pipeline/SKILL.md +195 -0
- package/skills/scaffold-runner/SKILL.md +465 -0
- package/pipeline/planning/review-tasks.md +0 -38
- package/pipeline/quality/testing-strategy.md +0 -42
|
@@ -0,0 +1,1008 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-craft
|
|
3
|
+
description: Writing effective project evals that verify AI-generated code meets documented standards
|
|
4
|
+
topics: [evals, verification, coverage, adherence, consistency, structure, code-quality, standards-compliance]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Eval Craft
|
|
8
|
+
|
|
9
|
+
Evals are project-wide property checks — automated tests that verify AI-generated code meets the project's own documented standards. They sit alongside unit tests and integration tests but serve a fundamentally different purpose.
|
|
10
|
+
|
|
11
|
+
## Summary
|
|
12
|
+
|
|
13
|
+
### What Evals Are
|
|
14
|
+
|
|
15
|
+
Evals verify that a project follows its own documented rules. They operate at the project level — reading documentation, scanning source trees, parsing configuration, and checking git history. They do not test whether code is correct (that is what functional tests do). They test whether the project is internally consistent and complete.
|
|
16
|
+
|
|
17
|
+
Unit tests answer: "Does this function return the right result?"
|
|
18
|
+
Integration tests answer: "Do these components work together?"
|
|
19
|
+
Evals answer: "Does this project follow its own documented rules?"
|
|
20
|
+
|
|
21
|
+
### Eval Categories (up to 13)
|
|
22
|
+
|
|
23
|
+
**5 Core categories** (always generated):
|
|
24
|
+
1. **Consistency** — Doc-tooling sync: command tables ↔ build targets, commit format, cross-doc refs
|
|
25
|
+
2. **Structure** — File placement per project-structure.md, shared code 2+ consumers, test co-location
|
|
26
|
+
3. **Adherence** — Coding patterns from coding-standards.md: TODO format, mock rules, error handling, stack-specific
|
|
27
|
+
4. **Coverage** — Requirement→code keyword matching, AC→test mapping, API endpoint coverage
|
|
28
|
+
5. **Cross-doc** — Technology, path, terminology consistency across scaffold-produced docs
|
|
29
|
+
|
|
30
|
+
**8 Conditional categories** (generated when source doc exists):
|
|
31
|
+
6. **Architecture conformance** ← system-architecture.md — Layer direction, module boundaries, circular deps
|
|
32
|
+
7. **API contract** ← api-contracts.md — Endpoint existence, HTTP methods, error response coverage
|
|
33
|
+
8. **Security patterns** ← security-review.md — Auth middleware, no secrets, input validation, SQL injection
|
|
34
|
+
9. **Database schema** ← database-schema.md — Migration coverage, columns, indexes, foreign keys
|
|
35
|
+
10. **Accessibility** ← ux-spec.md — ARIA, alt text, focus styles, color contrast
|
|
36
|
+
11. **Performance budget** ← plan.md NFRs — Budget files, bundle tracking, perf test existence
|
|
37
|
+
12. **Configuration** ← dev-setup.md — Env var docs, dead config, startup validation
|
|
38
|
+
13. **Error handling** ← coding-standards.md — No bare catches, error responses tested, custom error classes
|
|
39
|
+
|
|
40
|
+
### Design Principles
|
|
41
|
+
|
|
42
|
+
- **Binary PASS/FAIL, not scores** — prevents Goodhart's Law gaming
|
|
43
|
+
- **Every eval needs a false-positive mitigation strategy** — exclusion mechanism is mandatory
|
|
44
|
+
- **Conditional on source doc** — never generate evals for docs that don't exist
|
|
45
|
+
- **Prefer grep over AST** — faster to write, run, and maintain
|
|
46
|
+
- **Evals must be fast** — full suite under 30 seconds, individual file under 2 seconds
|
|
47
|
+
- **One category per eval file** — don't mix consistency and adherence in one file
|
|
48
|
+
- **Document what evals don't check** — prevent false confidence
|
|
49
|
+
|
|
50
|
+
### What Evals Do NOT Verify
|
|
51
|
+
|
|
52
|
+
- Whether code is correct (functional tests)
|
|
53
|
+
- Whether code is elegant or well-designed (code review)
|
|
54
|
+
- Whether tests are good quality (manual review)
|
|
55
|
+
- Whether the UI looks right (visual testing, beyond automated a11y)
|
|
56
|
+
- Deep security vulnerabilities (use dedicated SAST/DAST tools)
|
|
57
|
+
- Actual runtime performance (evals verify tracking config, not measurements)
|
|
58
|
+
|
|
59
|
+
This boundary must be documented explicitly in `docs/eval-standards.md`.
|
|
60
|
+
|
|
61
|
+
## Deep Guidance
|
|
62
|
+
|
|
63
|
+
### Evals vs. Other Quality Tools
|
|
64
|
+
|
|
65
|
+
| Tool | What It Checks | Scope | When It Runs |
|
|
66
|
+
|------|---------------|-------|-------------|
|
|
67
|
+
| **Unit tests** | Algorithmic correctness of individual functions | Single function/class | Every commit |
|
|
68
|
+
| **Integration tests** | Component interaction correctness | Multi-component | Every commit |
|
|
69
|
+
| **E2E tests** | User-facing behavior correctness | Full system | Pre-merge/CI |
|
|
70
|
+
| **Linters** | Syntax rules, style enforcement | Single file | Pre-commit |
|
|
71
|
+
| **Evals** | Standards adherence, coverage completeness, doc-code sync | Entire project | On demand / CI opt-in |
|
|
72
|
+
|
|
73
|
+
### What Evals Verify
|
|
74
|
+
|
|
75
|
+
- Documentation says X commands exist → do they?
|
|
76
|
+
- Coding standards say no bare `any` types → are there any?
|
|
77
|
+
- Project structure says shared code needs 2+ consumers → does it?
|
|
78
|
+
- User stories define acceptance criteria → do tests reference them?
|
|
79
|
+
- Commit format is documented → do recent commits follow it?
|
|
80
|
+
|
|
81
|
+
### Consistency Evals — Deep Dive
|
|
82
|
+
|
|
83
|
+
#### What to Check
|
|
84
|
+
|
|
85
|
+
- **Command tables → build targets**: Every command listed in CLAUDE.md's Key Commands table has a corresponding target in Makefile, package.json scripts, or pyproject.toml. And vice versa — no orphan targets that aren't documented.
|
|
86
|
+
- **Commit message format**: Recent commits in `git log` follow the format documented in `docs/coding-standards.md`. Use regex matching against the documented pattern.
|
|
87
|
+
- **Cross-document references**: File paths referenced in documentation actually exist on disk. Markdown links between docs resolve. Section references point to real headings.
|
|
88
|
+
- **Configuration consistency**: Values in config files match what docs describe (port numbers, environment variable names, feature flags).
|
|
89
|
+
|
|
90
|
+
#### Techniques
|
|
91
|
+
|
|
92
|
+
**Parsing markdown tables**: Extract rows from pipe-delimited tables, pull backtick-quoted commands, normalize whitespace.
|
|
93
|
+
|
|
94
|
+
```typescript
|
|
95
|
+
function extractCommandsFromTable(markdown: string): string[] {
|
|
96
|
+
const tableRows = markdown.match(/\|[^|]+\|[^|]+\|/g) || [];
|
|
97
|
+
return tableRows
|
|
98
|
+
.map(row => row.match(/`([^`]+)`/)?.[1])
|
|
99
|
+
.filter((cmd): cmd is string => cmd !== undefined)
|
|
100
|
+
.filter(cmd => !cmd.startsWith('|')); // skip header separators
|
|
101
|
+
}
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
**Matching commands to targets**: Different build systems have different lookup methods.
|
|
105
|
+
|
|
106
|
+
```typescript
|
|
107
|
+
// package.json: direct key lookup
|
|
108
|
+
const scripts = JSON.parse(readFileSync('package.json', 'utf-8')).scripts || {};
|
|
109
|
+
const hasTarget = (cmd: string) => cmd.replace(/^npm run /, '') in scripts;
|
|
110
|
+
|
|
111
|
+
// Makefile: parse target lines (lines starting with word characters followed by colon)
|
|
112
|
+
const makefile = readFileSync('Makefile', 'utf-8');
|
|
113
|
+
const targets = makefile.match(/^[\w-]+(?=\s*:)/gm) || [];
|
|
114
|
+
const hasTarget = (cmd: string) => targets.includes(cmd.replace(/^make /, ''));
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Validating cross-doc references**: Scan for markdown links and file path references, verify targets exist.
|
|
118
|
+
|
|
119
|
+
```typescript
|
|
120
|
+
function findBrokenRefs(docPath: string): string[] {
|
|
121
|
+
const content = readFileSync(docPath, 'utf-8');
|
|
122
|
+
const refs = content.match(/`((?:docs|src|tests|scripts)\/[^`]+)`/g) || [];
|
|
123
|
+
return refs
|
|
124
|
+
.map(ref => ref.replace(/`/g, ''))
|
|
125
|
+
.filter(ref => !existsSync(ref));
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
#### Common False Positives
|
|
130
|
+
|
|
131
|
+
- **Dynamic commands**: Commands with arguments (e.g., `scripts/setup-agent-worktree.sh <name>`) won't match a static Makefile target. Solution: match the base command, not the full invocation.
|
|
132
|
+
- **Aliased commands**: `npm test` matches `scripts.test`, not `scripts["npm test"]`. Normalize before matching.
|
|
133
|
+
- **Heading anchors in docs**: Markdown heading IDs are auto-generated and case-sensitive. Match against the raw heading text, not a guessed slug.
|
|
134
|
+
- **Git log on fresh repos**: A new project has no commit history. Guard against empty log output.
|
|
135
|
+
|
|
136
|
+
### Structure Evals — Deep Dive
|
|
137
|
+
|
|
138
|
+
#### What to Check
|
|
139
|
+
|
|
140
|
+
- **File placement**: Source files are in the directories their module defines. No stray files in the root. No feature code in shared directories.
|
|
141
|
+
- **Shared code consumer count**: Files in `shared/`, `common/`, or `lib/` directories must have 2+ distinct importers. A "shared" file used by only one module is misplaced — it belongs in that module.
|
|
142
|
+
- **Test co-location**: Test files follow the convention from `docs/tdd-standards.md` — either co-located with source (`foo.test.ts` next to `foo.ts`) or in a mirror directory (`tests/features/` mirroring `src/features/`).
|
|
143
|
+
- **No orphan files**: Every source file is either imported by another file or is an entry point (main, index, route handler). Dead files indicate incomplete cleanup.
|
|
144
|
+
|
|
145
|
+
#### Techniques
|
|
146
|
+
|
|
147
|
+
**Checking the 2+ consumer rule**: Scan all source files for imports of the shared file.
|
|
148
|
+
|
|
149
|
+
```typescript
|
|
150
|
+
function countConsumers(sharedFile: string, allFiles: string[]): number {
|
|
151
|
+
// Build possible import forms for this file
|
|
152
|
+
const importPath = sharedFile
|
|
153
|
+
.replace(/^src\//, '@/')
|
|
154
|
+
.replace(/\.\w+$/, ''); // strip extension
|
|
155
|
+
const fileName = path.basename(sharedFile, path.extname(sharedFile));
|
|
156
|
+
|
|
157
|
+
return allFiles.filter(f => {
|
|
158
|
+
if (f === sharedFile) return false;
|
|
159
|
+
const content = readFileSync(f, 'utf-8');
|
|
160
|
+
// Check both path-style imports and named imports
|
|
161
|
+
return content.includes(importPath) ||
|
|
162
|
+
content.includes(`from '${importPath}'`) ||
|
|
163
|
+
content.includes(`require('${importPath}')`);
|
|
164
|
+
}).length;
|
|
165
|
+
}
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**Verifying test co-location**: Given the project's test convention, check that every source file has a corresponding test file.
|
|
169
|
+
|
|
170
|
+
```typescript
|
|
171
|
+
// Co-located: src/features/auth/login.ts -> src/features/auth/login.test.ts
|
|
172
|
+
function findColocatedTest(sourceFile: string): string {
|
|
173
|
+
const ext = path.extname(sourceFile);
|
|
174
|
+
return sourceFile.replace(ext, `.test${ext}`);
|
|
175
|
+
}
|
|
176
|
+
|
|
177
|
+
// Mirror directory: src/features/auth/login.ts -> tests/features/auth/login.test.ts
|
|
178
|
+
function findMirrorTest(sourceFile: string): string {
|
|
179
|
+
const ext = path.extname(sourceFile);
|
|
180
|
+
return sourceFile
|
|
181
|
+
.replace(/^src\//, 'tests/')
|
|
182
|
+
.replace(ext, `.test${ext}`);
|
|
183
|
+
}
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
#### Common False Positives
|
|
187
|
+
|
|
188
|
+
- **Entry points with no importers**: `main.ts`, `index.ts`, route files, and CLI entry points won't be imported by other files. Maintain an exclusion list for known entry point patterns.
|
|
189
|
+
- **Type-only files**: TypeScript `.d.ts` files and type exports may have only type-level imports that are erased at runtime. Count type imports as consumers.
|
|
190
|
+
- **Generated files**: Files in `generated/` or `__generated__/` directories follow different rules. Exclude them from structure checks.
|
|
191
|
+
- **Config files**: `tailwind.config.ts`, `vite.config.ts`, etc. live in the root by convention, not by module placement rules.
|
|
192
|
+
|
|
193
|
+
### Adherence Evals — Deep Dive
|
|
194
|
+
|
|
195
|
+
#### What to Check
|
|
196
|
+
|
|
197
|
+
- **TODO format**: No `TODO`, `FIXME`, or `HACK` comments without a task ID (e.g., `[BD-123]`). Untagged TODOs are tracking gaps.
|
|
198
|
+
- **Mock patterns**: Test files follow the project's mocking strategy. If `docs/tdd-standards.md` says "don't mock the database in integration tests," scan for that pattern.
|
|
199
|
+
- **Error handling**: Code follows the error handling convention (e.g., custom error classes, no bare `catch {}`, no swallowed errors).
|
|
200
|
+
- **Stack-specific patterns**: Derived from `docs/tech-stack.md`:
|
|
201
|
+
- TypeScript: no `any` type, no `@ts-ignore` without justification
|
|
202
|
+
- Python: no bare `except:`, no `import *`
|
|
203
|
+
- Go: no ignored error returns (unchecked `_`)
|
|
204
|
+
- Shell: `set -euo pipefail` at top of scripts
|
|
205
|
+
|
|
206
|
+
#### The Exclusion Mechanism
|
|
207
|
+
|
|
208
|
+
Adherence evals MUST support exclusions. Users will have legitimate reasons to violate patterns — third-party library types require `any`, a specific `@ts-ignore` is the only workaround for a known bug, etc.
|
|
209
|
+
|
|
210
|
+
**Exclusion pattern** (inline comment):
|
|
211
|
+
```typescript
|
|
212
|
+
// eval-ignore: any — third-party lib returns untyped response
|
|
213
|
+
const data: any = legacyApi.fetch();
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Exclusion pattern** (test file configuration):
|
|
217
|
+
```typescript
|
|
218
|
+
const EXCLUDED_FILES = [
|
|
219
|
+
'src/legacy/**', // pre-standards code, being migrated
|
|
220
|
+
'src/generated/**', // auto-generated, not human-maintained
|
|
221
|
+
];
|
|
222
|
+
|
|
223
|
+
const EXCLUDED_PATTERNS = [
|
|
224
|
+
{ pattern: /: any\b/, file: 'src/adapters/legacy-api.ts', reason: 'legacy API returns untyped' },
|
|
225
|
+
];
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
The exclusion list lives in the adherence eval file itself. When the Create Evals prompt runs in update mode, it preserves these exclusions — they represent institutional knowledge about legitimate exceptions.
|
|
229
|
+
|
|
230
|
+
#### Techniques
|
|
231
|
+
|
|
232
|
+
**Scanning for pattern violations**: Use regex against file content, filtering by file extension and respecting exclusions.
|
|
233
|
+
|
|
234
|
+
```typescript
|
|
235
|
+
function findViolations(
|
|
236
|
+
pattern: RegExp,
|
|
237
|
+
files: string[],
|
|
238
|
+
exclusions: { file: string; reason: string }[]
|
|
239
|
+
): { file: string; line: number; match: string }[] {
|
|
240
|
+
const excluded = new Set(exclusions.map(e => e.file));
|
|
241
|
+
const violations: { file: string; line: number; match: string }[] = [];
|
|
242
|
+
|
|
243
|
+
for (const file of files) {
|
|
244
|
+
if (excluded.has(file)) continue;
|
|
245
|
+
const lines = readFileSync(file, 'utf-8').split('\n');
|
|
246
|
+
for (let i = 0; i < lines.length; i++) {
|
|
247
|
+
if (lines[i].includes('eval-ignore')) continue; // inline exclusion
|
|
248
|
+
const match = lines[i].match(pattern);
|
|
249
|
+
if (match) {
|
|
250
|
+
violations.push({ file, line: i + 1, match: match[0] });
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
}
|
|
254
|
+
return violations;
|
|
255
|
+
}
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Stack-specific pattern detection**: Read `docs/tech-stack.md` to determine which patterns apply.
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
# pytest example: detect bare except clauses
|
|
262
|
+
import ast, pathlib
|
|
263
|
+
|
|
264
|
+
def find_bare_excepts(src_dir: str) -> list[dict]:
|
|
265
|
+
violations = []
|
|
266
|
+
for py_file in pathlib.Path(src_dir).rglob("*.py"):
|
|
267
|
+
try:
|
|
268
|
+
tree = ast.parse(py_file.read_text())
|
|
269
|
+
except SyntaxError:
|
|
270
|
+
continue
|
|
271
|
+
for node in ast.walk(tree):
|
|
272
|
+
if isinstance(node, ast.ExceptHandler) and node.type is None:
|
|
273
|
+
violations.append({
|
|
274
|
+
"file": str(py_file),
|
|
275
|
+
"line": node.lineno,
|
|
276
|
+
"issue": "bare except: clause"
|
|
277
|
+
})
|
|
278
|
+
return violations
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
#### String Matching vs. AST Parsing
|
|
282
|
+
|
|
283
|
+
**Prefer string/regex matching** for most adherence checks:
|
|
284
|
+
- Faster to write and run
|
|
285
|
+
- Works across all languages without language-specific tooling
|
|
286
|
+
- Sufficient for pattern detection (TODO format, type annotations, comment conventions)
|
|
287
|
+
- Easier to maintain and debug
|
|
288
|
+
|
|
289
|
+
**Use AST parsing only when**:
|
|
290
|
+
- Regex would produce unacceptable false positive rates (e.g., matching patterns inside string literals)
|
|
291
|
+
- The check requires structural understanding (e.g., "function has more than 5 parameters")
|
|
292
|
+
- The project already has AST tooling available (e.g., Python projects can use `ast` module with zero dependencies)
|
|
293
|
+
|
|
294
|
+
Rule of thumb: if grep can find it, use grep. If grep finds too many false positives, try a more specific regex. Only reach for AST parsing as a last resort.
|
|
295
|
+
|
|
296
|
+
### Coverage Evals — Deep Dive
|
|
297
|
+
|
|
298
|
+
#### What to Check
|
|
299
|
+
|
|
300
|
+
- **Feature coverage**: Every Must-have feature in `docs/plan.md` or `docs/plan.md` maps to at least one implementation file. Match by keywords from the feature description against file names and file content.
|
|
301
|
+
- **AC coverage**: Every acceptance criterion in `docs/user-stories.md` is referenced by at least one test file. Match by domain keywords extracted from the AC text.
|
|
302
|
+
- **API coverage**: Every API endpoint described in docs has a route definition and at least one test.
|
|
303
|
+
|
|
304
|
+
#### Extracting Requirements from Markdown
|
|
305
|
+
|
|
306
|
+
Parse structured markdown to extract testable requirements:
|
|
307
|
+
|
|
308
|
+
```typescript
|
|
309
|
+
interface Requirement {
|
|
310
|
+
id: string;
|
|
311
|
+
text: string;
|
|
312
|
+
keywords: string[];
|
|
313
|
+
}
|
|
314
|
+
|
|
315
|
+
function extractACs(userStoriesContent: string): Requirement[] {
|
|
316
|
+
const acs: Requirement[] = [];
|
|
317
|
+
const lines = userStoriesContent.split('\n');
|
|
318
|
+
let currentStoryId = '';
|
|
319
|
+
|
|
320
|
+
for (const line of lines) {
|
|
321
|
+
// Match story headers like "## US-001: User Login"
|
|
322
|
+
const storyMatch = line.match(/^##\s+(US-\d+)/);
|
|
323
|
+
if (storyMatch) {
|
|
324
|
+
currentStoryId = storyMatch[1];
|
|
325
|
+
continue;
|
|
326
|
+
}
|
|
327
|
+
|
|
328
|
+
// Match AC lines like "- AC-1: Given..." or "- [ ] User can..."
|
|
329
|
+
const acMatch = line.match(/^[-*]\s+(?:AC-\d+:\s*)?(.+)/);
|
|
330
|
+
if (acMatch && currentStoryId) {
|
|
331
|
+
const text = acMatch[1];
|
|
332
|
+
acs.push({
|
|
333
|
+
id: `${currentStoryId}`,
|
|
334
|
+
text,
|
|
335
|
+
keywords: extractKeywords(text),
|
|
336
|
+
});
|
|
337
|
+
}
|
|
338
|
+
}
|
|
339
|
+
return acs;
|
|
340
|
+
}
|
|
341
|
+
|
|
342
|
+
function extractKeywords(text: string): string[] {
|
|
343
|
+
// Remove common filler words, keep domain terms
|
|
344
|
+
const stopWords = new Set([
|
|
345
|
+
'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
|
|
346
|
+
'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
|
|
347
|
+
'shall', 'should', 'may', 'might', 'must', 'can', 'could',
|
|
348
|
+
'would', 'i', 'me', 'my', 'we', 'our', 'you', 'your',
|
|
349
|
+
'when', 'then', 'given', 'and', 'or', 'but', 'if', 'that',
|
|
350
|
+
'this', 'with', 'for', 'from', 'to', 'in', 'on', 'at',
|
|
351
|
+
'by', 'not', 'no', 'see', 'also',
|
|
352
|
+
]);
|
|
353
|
+
return text
|
|
354
|
+
.toLowerCase()
|
|
355
|
+
.replace(/[^a-z0-9\s-]/g, '')
|
|
356
|
+
.split(/\s+/)
|
|
357
|
+
.filter(w => w.length > 2 && !stopWords.has(w));
|
|
358
|
+
}
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
#### The Confidence Spectrum
|
|
362
|
+
|
|
363
|
+
Coverage matching is inherently fuzzy. Categorize matches by confidence:
|
|
364
|
+
|
|
365
|
+
| Confidence | Criteria | Action |
|
|
366
|
+
|-----------|---------|--------|
|
|
367
|
+
| **High** | Test file contains the exact requirement ID (e.g., `US-001`) or exact key phrase (e.g., `"5 failed attempts"`) | PASS |
|
|
368
|
+
| **Medium** | Test file contains 2+ domain keywords from the AC (e.g., `login` + `lockout`) | PASS with note |
|
|
369
|
+
| **Low** | Test file contains only 1 keyword that could be coincidental | WARNING — manual review needed |
|
|
370
|
+
| **None** | No test file references any keyword from the AC | FAIL — likely missing coverage |
|
|
371
|
+
|
|
372
|
+
Coverage evals should report the confidence level alongside the result, so that teams can calibrate their response.
|
|
373
|
+
|
|
374
|
+
#### Handling Compound ACs
|
|
375
|
+
|
|
376
|
+
Some ACs contain multiple verifiable behaviors:
|
|
377
|
+
|
|
378
|
+
```
|
|
379
|
+
AC-3: Given I enter an invalid email, When I submit,
|
|
380
|
+
Then I see "Invalid email format" AND the submit button is disabled
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
This AC has two checkable behaviors: the error message and the button state. Coverage evals should extract both key phrases and check for each independently. A test that checks the error message but not the button state gets a "partial" coverage note.
|
|
384
|
+
|
|
385
|
+
```typescript
|
|
386
|
+
function splitCompoundAC(acText: string): string[] {
|
|
387
|
+
// Split on AND/and that separate distinct behaviors
|
|
388
|
+
return acText
|
|
389
|
+
.split(/\b(?:AND|and)\b/)
|
|
390
|
+
.map(s => s.trim())
|
|
391
|
+
.filter(s => s.length > 10); // filter out fragments
|
|
392
|
+
}
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
#### Feature-to-Code Mapping Heuristics
|
|
396
|
+
|
|
397
|
+
When checking whether a feature has implementation:
|
|
398
|
+
|
|
399
|
+
1. **File name matching**: Feature "user authentication" → look for files containing `auth`, `login`, `session` in their paths
|
|
400
|
+
2. **Content keyword matching**: Grep source files for domain terms from the feature description
|
|
401
|
+
3. **Route/endpoint matching**: Feature describes an API operation → look for the HTTP method + path pattern in route definitions
|
|
402
|
+
4. **Component matching**: Feature describes a UI screen → look for component files with matching names
|
|
403
|
+
|
|
404
|
+
```typescript
|
|
405
|
+
function featureHasImplementation(
|
|
406
|
+
featureKeywords: string[],
|
|
407
|
+
sourceFiles: string[]
|
|
408
|
+
): { found: boolean; matchedFiles: string[] } {
|
|
409
|
+
const matchedFiles = sourceFiles.filter(file => {
|
|
410
|
+
const fileLower = file.toLowerCase();
|
|
411
|
+
const content = readFileSync(file, 'utf-8').toLowerCase();
|
|
412
|
+
|
|
413
|
+
// Check file path for keyword matches
|
|
414
|
+
const pathMatch = featureKeywords.some(kw => fileLower.includes(kw));
|
|
415
|
+
// Check file content for 2+ keyword matches
|
|
416
|
+
const contentMatches = featureKeywords.filter(kw => content.includes(kw));
|
|
417
|
+
|
|
418
|
+
return pathMatch || contentMatches.length >= 2;
|
|
419
|
+
});
|
|
420
|
+
|
|
421
|
+
return { found: matchedFiles.length > 0, matchedFiles };
|
|
422
|
+
}
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
### Architecture Conformance Evals — Deep Dive
|
|
426
|
+
|
|
427
|
+
**Source doc**: `docs/system-architecture.md` + `docs/project-structure.md`
|
|
428
|
+
|
|
429
|
+
Architecture conformance evals prevent code from silently diverging from documented architecture. They check three things: import direction, module boundaries, and circular dependencies.
|
|
430
|
+
|
|
431
|
+
**Import direction checking** (grep-based):
|
|
432
|
+
1. Parse the architecture doc for layer definitions (e.g., "controllers depend on services, services depend on repositories")
|
|
433
|
+
2. For each source file, determine its layer from its file path (e.g., `src/controllers/` → controller layer)
|
|
434
|
+
3. Scan its import statements and verify they only reference allowed layers
|
|
435
|
+
4. Report violations: "src/repositories/user.ts imports from src/controllers/auth.ts — repositories should not depend on controllers"
|
|
436
|
+
|
|
437
|
+
**Module boundary checking**:
|
|
438
|
+
1. Parse `docs/project-structure.md` for feature directories (e.g., `src/features/auth/`, `src/features/billing/`)
|
|
439
|
+
2. For each feature directory, scan imports
|
|
440
|
+
3. Flag imports that reach into another feature's internal modules (e.g., `src/features/auth/services/token.ts` importing `src/features/billing/internal/invoice-calc.ts`)
|
|
441
|
+
4. Cross-feature imports should go through shared/public interfaces only
|
|
442
|
+
|
|
443
|
+
**False positive mitigation**: Exclude shared/common directories, type-only imports, and test files. Allow an `// eval-exclude: cross-feature` inline comment to suppress.
|
|
444
|
+
|
|
445
|
+
**Tool recommendations**: For JS/TS projects with complex architecture, recommend `dependency-cruiser` or `eslint-plugin-boundaries` in `docs/eval-standards.md`. The eval checks for their config existence as a positive signal.
|
|
446
|
+
|
|
447
|
+
### API Contract Evals — Deep Dive
|
|
448
|
+
|
|
449
|
+
**Source doc**: `docs/api-contracts.md`
|
|
450
|
+
|
|
451
|
+
API contract evals verify that documented API specifications match actual code. They use grep/regex — not runtime testing (that's Dredd/Pact territory).
|
|
452
|
+
|
|
453
|
+
**Endpoint existence checking**:
|
|
454
|
+
1. Parse `docs/api-contracts.md` for endpoint definitions (look for patterns like `GET /api/v1/users`, `POST /api/auth/login`)
|
|
455
|
+
2. For each endpoint, search route definition files for the path pattern
|
|
456
|
+
3. Report missing routes: "POST /api/v1/orders documented but no route definition found"
|
|
457
|
+
|
|
458
|
+
**Error response coverage**:
|
|
459
|
+
1. For each endpoint, extract documented error codes (400, 401, 403, 404, 422)
|
|
460
|
+
2. Search test files for tests that verify these status codes for that endpoint
|
|
461
|
+
3. Report: "GET /api/v1/users/:id documents 404 response but no test triggers it"
|
|
462
|
+
|
|
463
|
+
**False positive mitigation**: Route frameworks vary widely — check for common patterns (`app.get`, `router.post`, `@Get()`, `@app.route`). Allow pattern overrides via a config section in `docs/eval-standards.md`.
|
|
464
|
+
|
|
465
|
+
### Security Pattern Evals — Deep Dive
|
|
466
|
+
|
|
467
|
+
**Source doc**: `docs/security-review.md`
|
|
468
|
+
|
|
469
|
+
Security evals verify documented security controls are implemented. They check patterns, not vulnerabilities — use SAST/DAST tools for deep scanning.
|
|
470
|
+
|
|
471
|
+
**No hardcoded secrets** (regex patterns):
|
|
472
|
+
```
|
|
473
|
+
# Common secret patterns to flag
|
|
474
|
+
(?:api[_-]?key|secret|password|token|credential)\s*[:=]\s*['"][^'"]{8,}['"]
|
|
475
|
+
(?:sk|pk)[-_][a-zA-Z0-9]{20,} # Stripe-style keys
|
|
476
|
+
AKIA[0-9A-Z]{16} # AWS access key IDs
|
|
477
|
+
ghp_[a-zA-Z0-9]{36} # GitHub personal tokens
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
**Auth middleware presence**:
|
|
481
|
+
1. Parse security review for protected routes/resources
|
|
482
|
+
2. Check route definitions for auth middleware application (e.g., `requireAuth`, `@authenticated`, `authMiddleware`)
|
|
483
|
+
3. Flag unprotected routes that should be protected
|
|
484
|
+
|
|
485
|
+
**SQL injection prevention**:
|
|
486
|
+
1. Search for database query patterns (e.g., string concatenation in SQL: `` `SELECT * FROM ${table}` ``)
|
|
487
|
+
2. Flag string interpolation in query strings
|
|
488
|
+
3. Allow parameterized queries and ORM usage
|
|
489
|
+
|
|
490
|
+
**False positive mitigation**: Security evals are PRESERVED on re-run because teams customize exclusion patterns extensively. Inline `// eval-exclude: secret-pattern` for test fixtures and config examples.
|
|
491
|
+
|
|
492
|
+
### Database Schema Evals — Deep Dive
|
|
493
|
+
|
|
494
|
+
**Source doc**: `docs/database-schema.md`
|
|
495
|
+
|
|
496
|
+
Database evals verify migration files produce the documented schema. Grep-based — no database connection needed.
|
|
497
|
+
|
|
498
|
+
**Migration existence checking**:
|
|
499
|
+
1. Parse `docs/database-schema.md` for table names
|
|
500
|
+
2. Search migration files for `CREATE TABLE` or equivalent ORM statements
|
|
501
|
+
3. Report: "Table 'order_items' documented but no migration creates it"
|
|
502
|
+
|
|
503
|
+
**Column coverage**:
|
|
504
|
+
1. For each documented table, extract column names
|
|
505
|
+
2. Search the table's migration file for column definitions
|
|
506
|
+
3. Report missing columns
|
|
507
|
+
|
|
508
|
+
**False positive mitigation**: ORM-generated migrations use different syntax than raw SQL. Check for both patterns. Allow `docs/eval-standards.md` to specify the migration framework for accurate matching.
|
|
509
|
+
|
|
510
|
+
### Accessibility Evals — Deep Dive
|
|
511
|
+
|
|
512
|
+
**Source doc**: `docs/ux-spec.md` (accessibility section)
|
|
513
|
+
|
|
514
|
+
Only generated when the UX spec documents accessibility requirements (search for "WCAG", "accessibility", "a11y", "screen reader").
|
|
515
|
+
|
|
516
|
+
**Alt text checking**:
|
|
517
|
+
```
|
|
518
|
+
# Flag img elements without alt attribute
|
|
519
|
+
<img[^>]*(?!alt=)[^>]*>
|
|
520
|
+
# Also check framework-specific: Image, next/image without alt
|
|
521
|
+
```
|
|
522
|
+
|
|
523
|
+
**Focus style checking**: Search CSS/styled-components for `:focus` or `:focus-visible` rules. Flag interactive elements (button, a, input) without visible focus styles.
|
|
524
|
+
|
|
525
|
+
**Tool recommendation**: If the project has a frontend, recommend `@axe-core/cli` or Playwright's built-in accessibility assertions in `docs/eval-standards.md`. The eval checks for axe-core in dependencies as a positive signal.
|
|
526
|
+
|
|
527
|
+
### Performance Budget Evals — Deep Dive
|
|
528
|
+
|
|
529
|
+
**Source doc**: `docs/plan.md` (non-functional requirements section)
|
|
530
|
+
|
|
531
|
+
Only generated when the PRD contains performance targets. Search for patterns: "response time", "load time", "within X seconds", "under X ms", "bundle size".
|
|
532
|
+
|
|
533
|
+
**What to check**:
|
|
534
|
+
1. A performance budget file exists (`budget.json`, `.size-limit.json`, or equivalent)
|
|
535
|
+
2. CI config references performance testing (Lighthouse CI, k6, Artillery)
|
|
536
|
+
3. Critical user flows (from user stories) have corresponding performance test config
|
|
537
|
+
|
|
538
|
+
**False positive mitigation**: Skip if no performance-related NFRs are found in plan.md. Don't enforce specific tools — just verify that *some* performance tracking exists.
|
|
539
|
+
|
|
540
|
+
### Configuration Validation Evals — Deep Dive
|
|
541
|
+
|
|
542
|
+
**Source doc**: `docs/dev-setup.md`
|
|
543
|
+
|
|
544
|
+
Config evals prevent the "works on my machine" problem by verifying env vars are documented and validated.
|
|
545
|
+
|
|
546
|
+
**Env var scanning** (per-stack patterns):
|
|
547
|
+
- TypeScript/JS: `process.env.X` or `process.env['X']`
|
|
548
|
+
- Python: `os.environ["X"]` or `os.getenv("X")`
|
|
549
|
+
- Go: `os.Getenv("X")`
|
|
550
|
+
- Shell: `$X` or `${X}`
|
|
551
|
+
|
|
552
|
+
For each env var found in code, verify it appears in `.env.example` or `docs/dev-setup.md`. For each var in `.env.example`, verify it's actually referenced in code (detect dead config).
|
|
553
|
+
|
|
554
|
+
**Startup validation check**: Search for config schema validation at app startup — Zod parse of process.env, Pydantic BaseSettings, Go envconfig struct tags. The existence of startup validation is a positive signal.
|
|
555
|
+
|
|
556
|
+
### Error Handling Completeness Evals — Deep Dive
|
|
557
|
+
|
|
558
|
+
**Source doc**: `docs/coding-standards.md` + `docs/api-contracts.md`
|
|
559
|
+
|
|
560
|
+
Error handling evals verify that documented error patterns are followed and documented error responses are tested.
|
|
561
|
+
|
|
562
|
+
**Bare catch detection** (per-stack):
|
|
563
|
+
- TypeScript/JS: `catch\s*\(\s*\w*\s*\)\s*\{\s*\}` (empty catch block)
|
|
564
|
+
- Python: `except:\s*$` or `except Exception:\s*pass`
|
|
565
|
+
- Go: `if err != nil \{\s*\}` (swallowed errors)
|
|
566
|
+
|
|
567
|
+
**Error response test coverage**: For each error code documented in API contracts, search test files for assertions on that status code + endpoint combination. Report: "PUT /api/users/:id documents 422 but no test verifies it."
|
|
568
|
+
|
|
569
|
+
**False positive mitigation**: Error handling evals are PRESERVED on re-run. Intentionally empty catches (e.g., cleanup code) can use `// eval-exclude: bare-catch` inline comments.
|
|
570
|
+
|
|
571
|
+
### Eval Design Principles — Extended
|
|
572
|
+
|
|
573
|
+
#### 1. Binary PASS/FAIL, Not Scores
|
|
574
|
+
|
|
575
|
+
Evals produce pass or fail, not scores. A "compliance score of 87%" invites gaming — teams optimize the score instead of fixing the underlying issues. Binary results force a clear decision: either the project meets the standard or it doesn't.
|
|
576
|
+
|
|
577
|
+
If a finding is not worth failing the eval, it should be a warning in the output, not a reduced score. If it IS worth failing, it should be a hard failure with a clear remediation path.
|
|
578
|
+
|
|
579
|
+
This directly mitigates Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
|
|
580
|
+
|
|
581
|
+
#### 2. Every Eval Needs a False-Positive Mitigation Strategy
|
|
582
|
+
|
|
583
|
+
Before writing an eval, answer: "What legitimate code will this incorrectly flag?" If the answer is "nothing," you haven't thought hard enough. Every pattern check has false positives.
|
|
584
|
+
|
|
585
|
+
The mitigation strategy is the exclusion mechanism. Without it, teams disable noisy evals entirely — losing the signal along with the noise. With it, the eval stays active and exclusions document institutional knowledge.
|
|
586
|
+
|
|
587
|
+
#### 3. Prefer Grep Over AST
|
|
588
|
+
|
|
589
|
+
String and regex matching is:
|
|
590
|
+
- **Faster** to write (minutes, not hours)
|
|
591
|
+
- **Faster** to run (milliseconds, not seconds)
|
|
592
|
+
- **More portable** (works across languages with the same code)
|
|
593
|
+
- **Easier to debug** ("this regex matched this line" is obvious; "this AST visitor triggered on this node" is not)
|
|
594
|
+
|
|
595
|
+
The tradeoff is precision — regex can match patterns inside comments or string literals. In practice, eval-level checks (TODOs, type annotations, import patterns) rarely suffer from this. When they do, a more specific regex usually solves it.
|
|
596
|
+
|
|
597
|
+
#### 4. Evals Must Be Fast
|
|
598
|
+
|
|
599
|
+
The entire eval suite should run in seconds, not minutes. If evals are slow, developers won't run them.
|
|
600
|
+
|
|
601
|
+
**Performance targets:**
|
|
602
|
+
- Individual eval file: < 2 seconds
|
|
603
|
+
- Full eval suite (`make eval`): < 30 seconds
|
|
604
|
+
- File I/O: read files once, share across checks via helpers
|
|
605
|
+
|
|
606
|
+
**What makes evals slow:**
|
|
607
|
+
- Spawning subprocesses per check (shell out to `git` once, not per-commit)
|
|
608
|
+
- Reading the same file multiple times (cache file content in a helper)
|
|
609
|
+
- Globbing the entire source tree repeatedly (glob once, filter per-eval)
|
|
610
|
+
|
|
611
|
+
#### 5. One Category of Problem Per Eval
|
|
612
|
+
|
|
613
|
+
Each eval file checks one category: consistency, structure, adherence, or coverage. Don't mix them. A consistency eval that also checks adherence patterns is harder to maintain, harder to reason about when it fails, and harder to exclude false positives from.
|
|
614
|
+
|
|
615
|
+
If a check doesn't fit neatly into one category, it probably needs to be split or it doesn't belong in evals at all.
|
|
616
|
+
|
|
617
|
+
#### 6. Document What Evals Don't Check
|
|
618
|
+
|
|
619
|
+
The `docs/eval-standards.md` file must explicitly list what evals do NOT verify. This prevents false confidence and clearly delineates eval scope from functional testing, security scanning, and manual review.
|
|
620
|
+
|
|
621
|
+
### Framework-Specific Patterns
|
|
622
|
+
|
|
623
|
+
#### vitest / jest (TypeScript/JavaScript)
|
|
624
|
+
|
|
625
|
+
The most common eval framework for TypeScript projects. Evals are `.test.ts` files in `tests/evals/`.
|
|
626
|
+
|
|
627
|
+
```typescript
|
|
628
|
+
// tests/evals/helpers.ts — shared utilities
|
|
629
|
+
import { readFileSync, existsSync } from 'fs';
|
|
630
|
+
import { globSync } from 'glob';
|
|
631
|
+
import { execSync } from 'child_process';
|
|
632
|
+
|
|
633
|
+
export function readFile(path: string): string {
|
|
634
|
+
return readFileSync(path, 'utf-8');
|
|
635
|
+
}
|
|
636
|
+
|
|
637
|
+
export function sourceFiles(pattern = 'src/**/*.{ts,tsx}'): string[] {
|
|
638
|
+
return globSync(pattern).filter(f => !f.includes('.test.'));
|
|
639
|
+
}
|
|
640
|
+
|
|
641
|
+
export function testFiles(pattern = 'src/**/*.test.{ts,tsx}'): string[] {
|
|
642
|
+
return globSync(pattern);
|
|
643
|
+
}
|
|
644
|
+
|
|
645
|
+
export function gitLog(count = 20): string[] {
|
|
646
|
+
try {
|
|
647
|
+
return execSync(`git log --oneline -${count}`)
|
|
648
|
+
.toString().trim().split('\n');
|
|
649
|
+
} catch {
|
|
650
|
+
return []; // no git history (fresh repo)
|
|
651
|
+
}
|
|
652
|
+
}
|
|
653
|
+
|
|
654
|
+
export function markdownFiles(dir = 'docs'): string[] {
|
|
655
|
+
return globSync(`${dir}/**/*.md`);
|
|
656
|
+
}
|
|
657
|
+
```
|
|
658
|
+
|
|
659
|
+
**Running evals separately**:
|
|
660
|
+
```json
|
|
661
|
+
// package.json
|
|
662
|
+
{
|
|
663
|
+
"scripts": {
|
|
664
|
+
"eval": "vitest run tests/evals/ --reporter=verbose"
|
|
665
|
+
}
|
|
666
|
+
}
|
|
667
|
+
```
|
|
668
|
+
|
|
669
|
+
#### pytest (Python)
|
|
670
|
+
|
|
671
|
+
Python evals use pytest with `conftest.py` for shared fixtures.
|
|
672
|
+
|
|
673
|
+
```python
|
|
674
|
+
# tests/evals/conftest.py
|
|
675
|
+
import pathlib
|
|
676
|
+
import subprocess
|
|
677
|
+
import pytest
|
|
678
|
+
|
|
679
|
+
@pytest.fixture(scope="session")
|
|
680
|
+
def source_files():
|
|
681
|
+
return list(pathlib.Path("src").rglob("*.py"))
|
|
682
|
+
|
|
683
|
+
@pytest.fixture(scope="session")
|
|
684
|
+
def test_files():
|
|
685
|
+
return list(pathlib.Path("tests").rglob("test_*.py"))
|
|
686
|
+
|
|
687
|
+
@pytest.fixture(scope="session")
|
|
688
|
+
def git_log():
|
|
689
|
+
try:
|
|
690
|
+
result = subprocess.run(
|
|
691
|
+
["git", "log", "--oneline", "-20"],
|
|
692
|
+
capture_output=True, text=True, check=True
|
|
693
|
+
)
|
|
694
|
+
return result.stdout.strip().split("\n")
|
|
695
|
+
except subprocess.CalledProcessError:
|
|
696
|
+
return []
|
|
697
|
+
|
|
698
|
+
@pytest.fixture(scope="session")
|
|
699
|
+
def doc_files():
|
|
700
|
+
return list(pathlib.Path("docs").rglob("*.md"))
|
|
701
|
+
```
|
|
702
|
+
|
|
703
|
+
```python
|
|
704
|
+
# tests/evals/test_adherence.py
|
|
705
|
+
import re
|
|
706
|
+
|
|
707
|
+
def test_no_bare_except(source_files):
|
|
708
|
+
"""No bare except: clauses — must catch specific exception types."""
|
|
709
|
+
violations = []
|
|
710
|
+
for f in source_files:
|
|
711
|
+
for i, line in enumerate(f.read_text().splitlines(), 1):
|
|
712
|
+
if re.match(r'\s*except\s*:', line):
|
|
713
|
+
if 'eval-ignore' not in line:
|
|
714
|
+
violations.append(f"{f}:{i}")
|
|
715
|
+
assert violations == [], f"Bare except: found in: {violations}"
|
|
716
|
+
|
|
717
|
+
def test_no_import_star(source_files):
|
|
718
|
+
"""No wildcard imports — explicit imports only."""
|
|
719
|
+
violations = []
|
|
720
|
+
for f in source_files:
|
|
721
|
+
for i, line in enumerate(f.read_text().splitlines(), 1):
|
|
722
|
+
if re.match(r'from\s+\S+\s+import\s+\*', line):
|
|
723
|
+
violations.append(f"{f}:{i}")
|
|
724
|
+
assert violations == [], f"Wildcard imports found in: {violations}"
|
|
725
|
+
```
|
|
726
|
+
|
|
727
|
+
**Running evals separately**:
|
|
728
|
+
```toml
|
|
729
|
+
# pyproject.toml
|
|
730
|
+
[tool.pytest.ini_options]
|
|
731
|
+
markers = ["eval: project eval checks"]
|
|
732
|
+
|
|
733
|
+
# Or use a separate config
|
|
734
|
+
# pytest tests/evals/ -v
|
|
735
|
+
```
|
|
736
|
+
|
|
737
|
+
#### bats (Shell)
|
|
738
|
+
|
|
739
|
+
Shell project evals use bats-core. Simpler patterns — mostly file existence and grep checks.
|
|
740
|
+
|
|
741
|
+
```bash
|
|
742
|
+
# tests/evals/consistency.bats
|
|
743
|
+
#!/usr/bin/env bats
|
|
744
|
+
|
|
745
|
+
setup() {
|
|
746
|
+
load '../test_helper/common-setup'
|
|
747
|
+
}
|
|
748
|
+
|
|
749
|
+
@test "every Makefile target is documented in CLAUDE.md" {
|
|
750
|
+
# Extract targets from Makefile (lines matching "target:")
|
|
751
|
+
local targets
|
|
752
|
+
targets=$(grep -oE '^[a-zA-Z_-]+:' Makefile | tr -d ':' | sort)
|
|
753
|
+
|
|
754
|
+
for target in $targets; do
|
|
755
|
+
# Skip internal targets (prefixed with _)
|
|
756
|
+
[[ "$target" == _* ]] && continue
|
|
757
|
+
run grep -q "\`make $target\`" CLAUDE.md
|
|
758
|
+
[ "$status" -eq 0 ] || fail "Makefile target '$target' not documented in CLAUDE.md"
|
|
759
|
+
done
|
|
760
|
+
}
|
|
761
|
+
|
|
762
|
+
@test "all scripts have shellcheck directive" {
|
|
763
|
+
for script in scripts/*.sh; do
|
|
764
|
+
[ -f "$script" ] || continue
|
|
765
|
+
run head -5 "$script"
|
|
766
|
+
echo "$output" | grep -q 'shellcheck' || \
|
|
767
|
+
fail "$script missing shellcheck directive in first 5 lines"
|
|
768
|
+
done
|
|
769
|
+
}
|
|
770
|
+
|
|
771
|
+
@test "no TODO without task ID" {
|
|
772
|
+
local violations
|
|
773
|
+
violations=$(grep -rn 'TODO\|FIXME\|HACK' scripts/ lib/ --include='*.sh' \
|
|
774
|
+
| grep -v '\[BD-' \
|
|
775
|
+
| grep -v 'eval-ignore' || true)
|
|
776
|
+
[ -z "$violations" ] || fail "Untagged TODOs found:\n$violations"
|
|
777
|
+
}
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
#### go test (Go)
|
|
781
|
+
|
|
782
|
+
Go evals use the standard `testing` package with file I/O helpers.
|
|
783
|
+
|
|
784
|
+
```go
|
|
785
|
+
// tests/evals/helpers_test.go
|
|
786
|
+
package evals
|
|
787
|
+
|
|
788
|
+
import (
|
|
789
|
+
"os"
|
|
790
|
+
"os/exec"
|
|
791
|
+
"path/filepath"
|
|
792
|
+
"strings"
|
|
793
|
+
"testing"
|
|
794
|
+
)
|
|
795
|
+
|
|
796
|
+
func sourceFiles(t *testing.T) []string {
|
|
797
|
+
t.Helper()
|
|
798
|
+
var files []string
|
|
799
|
+
filepath.Walk(".", func(path string, info os.FileInfo, err error) error {
|
|
800
|
+
if err != nil { return nil }
|
|
801
|
+
if strings.HasSuffix(path, ".go") &&
|
|
802
|
+
!strings.HasSuffix(path, "_test.go") &&
|
|
803
|
+
!strings.Contains(path, "vendor/") {
|
|
804
|
+
files = append(files, path)
|
|
805
|
+
}
|
|
806
|
+
return nil
|
|
807
|
+
})
|
|
808
|
+
return files
|
|
809
|
+
}
|
|
810
|
+
|
|
811
|
+
func gitLog(t *testing.T, count int) []string {
|
|
812
|
+
t.Helper()
|
|
813
|
+
out, err := exec.Command("git", "log", "--oneline",
|
|
814
|
+
"-"+string(rune('0'+count))).Output()
|
|
815
|
+
if err != nil { return nil }
|
|
816
|
+
return strings.Split(strings.TrimSpace(string(out)), "\n")
|
|
817
|
+
}
|
|
818
|
+
|
|
819
|
+
func readDoc(t *testing.T, path string) string {
|
|
820
|
+
t.Helper()
|
|
821
|
+
data, err := os.ReadFile(path)
|
|
822
|
+
if err != nil {
|
|
823
|
+
t.Skipf("doc not found: %s", path)
|
|
824
|
+
}
|
|
825
|
+
return string(data)
|
|
826
|
+
}
|
|
827
|
+
```
|
|
828
|
+
|
|
829
|
+
```go
|
|
830
|
+
// tests/evals/adherence_test.go
|
|
831
|
+
package evals
|
|
832
|
+
|
|
833
|
+
import (
|
|
834
|
+
"os"
|
|
835
|
+
"regexp"
|
|
836
|
+
"strings"
|
|
837
|
+
"testing"
|
|
838
|
+
)
|
|
839
|
+
|
|
840
|
+
func TestNoIgnoredErrors(t *testing.T) {
|
|
841
|
+
// Go convention: error returns must be checked, not assigned to _
|
|
842
|
+
pattern := regexp.MustCompile(`\b\w+,\s*_\s*:?=\s*\w+\(`)
|
|
843
|
+
for _, file := range sourceFiles(t) {
|
|
844
|
+
data, _ := os.ReadFile(file)
|
|
845
|
+
lines := strings.Split(string(data), "\n")
|
|
846
|
+
for i, line := range lines {
|
|
847
|
+
if strings.Contains(line, "eval-ignore") { continue }
|
|
848
|
+
if pattern.MatchString(line) {
|
|
849
|
+
t.Errorf("%s:%d: unchecked error return: %s",
|
|
850
|
+
file, i+1, strings.TrimSpace(line))
|
|
851
|
+
}
|
|
852
|
+
}
|
|
853
|
+
}
|
|
854
|
+
}
|
|
855
|
+
```
|
|
856
|
+
|
|
857
|
+
### Common Eval Anti-Patterns
|
|
858
|
+
|
|
859
|
+
#### 1. Evals That Are Too Specific
|
|
860
|
+
|
|
861
|
+
**Symptom**: Evals break on every refactor, even when the project is perfectly compliant.
|
|
862
|
+
|
|
863
|
+
**Example**: An eval that checks for exactly 5 entries in a Key Commands table. Adding a 6th command fails the eval even though the project is more documented, not less.
|
|
864
|
+
|
|
865
|
+
**Fix**: Check the property (every command has a target), not the count. Check patterns (files exist in the right directories), not exact paths.
|
|
866
|
+
|
|
867
|
+
#### 2. Evals That Are Too Vague
|
|
868
|
+
|
|
869
|
+
**Symptom**: Evals always pass, even on projects with clear problems.
|
|
870
|
+
|
|
871
|
+
**Example**: An eval that checks "at least one test file exists." Every project passes, even one with a single meaningless test.
|
|
872
|
+
|
|
873
|
+
**Fix**: Be specific about what the eval verifies. "Every module directory has at least one test file" is more useful than "some tests exist."
|
|
874
|
+
|
|
875
|
+
#### 3. Evals That Test the Framework
|
|
876
|
+
|
|
877
|
+
**Symptom**: Eval failures reveal framework behavior, not project problems.
|
|
878
|
+
|
|
879
|
+
**Example**: An eval that verifies `glob('**/*.ts')` returns files. If it returns nothing, the problem is the glob pattern or the working directory, not the project.
|
|
880
|
+
|
|
881
|
+
**Fix**: Evals should assume the testing framework works correctly. If a helper returns no files, skip the eval rather than failing it (use `test.skip` or `t.Skip()`).
|
|
882
|
+
|
|
883
|
+
#### 4. Evals That Duplicate Linter Rules
|
|
884
|
+
|
|
885
|
+
**Symptom**: Evals flag the same issues as ESLint, Ruff, or ShellCheck.
|
|
886
|
+
|
|
887
|
+
**Example**: An eval that checks for unused variables. ESLint already does this with better precision and better editor integration.
|
|
888
|
+
|
|
889
|
+
**Fix**: Don't duplicate what linters do. Evals check project-level properties that linters can't: cross-file consistency, doc-code sync, requirement coverage. If a linter rule covers it, defer to the linter.
|
|
890
|
+
|
|
891
|
+
#### 5. Evals Without Exclusion Mechanisms
|
|
892
|
+
|
|
893
|
+
**Symptom**: Teams disable entire eval files because they produce too many false positives that can't be individually suppressed.
|
|
894
|
+
|
|
895
|
+
**Example**: An adherence eval that flags every use of `any` in TypeScript, including legitimate uses in generic utility types, third-party library interfaces, and JSON parsing.
|
|
896
|
+
|
|
897
|
+
**Fix**: Every adherence eval must support both file-level exclusions (glob patterns) and line-level exclusions (inline comments). Document how to add exclusions in `docs/eval-standards.md`.
|
|
898
|
+
|
|
899
|
+
#### 6. Coverage Evals With Exact String Matching
|
|
900
|
+
|
|
901
|
+
**Symptom**: Coverage evals fail because the test uses slightly different wording than the acceptance criterion.
|
|
902
|
+
|
|
903
|
+
**Example**: AC says "user sees error message" — eval looks for the exact string "user sees error message" in test files. The test actually says `expect(screen.getByText('Invalid email')).toBeVisible()` which validates the same requirement.
|
|
904
|
+
|
|
905
|
+
**Fix**: Use keyword extraction, not exact string matching. Extract domain terms from the AC ("error", "message", "invalid", "email") and match on 2+ keyword co-occurrence in test content.
|
|
906
|
+
|
|
907
|
+
### The Update/Review Cycle
|
|
908
|
+
|
|
909
|
+
Evals evolve with the project. The Create Evals prompt handles both fresh creation and updates to existing evals.
|
|
910
|
+
|
|
911
|
+
#### Fresh Mode
|
|
912
|
+
|
|
913
|
+
1. Read all project docs to understand standards and conventions
|
|
914
|
+
2. Generate all four eval categories from documentation content
|
|
915
|
+
3. Generate the helpers file for shared utilities
|
|
916
|
+
4. Run `make eval` to verify no false positives on current codebase
|
|
917
|
+
5. Fix any failures — they represent either false positives (add exclusions) or real problems (report to user)
|
|
918
|
+
6. Create `docs/eval-standards.md` documenting scope and boundaries
|
|
919
|
+
|
|
920
|
+
#### Update Mode
|
|
921
|
+
|
|
922
|
+
1. Read existing evals and compare against what current docs would produce
|
|
923
|
+
2. Categorize content as ADD, RESTRUCTURE, or PRESERVE
|
|
924
|
+
3. **Consistency and structure evals**: Fully regenerated from current docs. These are derived directly from documentation content — regenerating ensures they stay in sync.
|
|
925
|
+
4. **Adherence evals**: PRESERVED on re-run. Users customize exclusion patterns over time, and regenerating would lose that institutional knowledge. Only add new adherence checks for newly documented patterns.
|
|
926
|
+
5. **Coverage evals**: Regenerated when `docs/plan.md` or `docs/user-stories.md` change. The keyword extraction must reflect current requirements.
|
|
927
|
+
6. **Helpers**: Regenerated. Shared utilities should always match current patterns.
|
|
928
|
+
7. Run `make eval` to verify updates don't break existing passing evals
|
|
929
|
+
|
|
930
|
+
#### When to Add vs. Modify Evals
|
|
931
|
+
|
|
932
|
+
**Add a new eval when**:
|
|
933
|
+
- A new document creates new standards to verify (e.g., adding `docs/design-system.md` enables design token adherence checks)
|
|
934
|
+
- A new type of requirement appears (e.g., adding API contracts enables endpoint coverage checks)
|
|
935
|
+
- A recurring issue is found that evals should catch (log to `tasks/lessons.md`, then add the eval)
|
|
936
|
+
|
|
937
|
+
**Modify an existing eval when**:
|
|
938
|
+
- A false positive pattern is identified (add exclusion)
|
|
939
|
+
- The underlying standard changes (update the check to match)
|
|
940
|
+
- A check is too broad or too narrow (refine the regex/pattern)
|
|
941
|
+
|
|
942
|
+
**Delete an eval when**:
|
|
943
|
+
- The standard it checks has been removed from documentation
|
|
944
|
+
- The check is fully covered by a linter rule that's now configured
|
|
945
|
+
- The eval has been disabled for so long that nobody remembers what it checked
|
|
946
|
+
|
|
947
|
+
#### The Feedback Loop
|
|
948
|
+
|
|
949
|
+
Eval failures create a feedback loop with `tasks/lessons.md`:
|
|
950
|
+
|
|
951
|
+
```
|
|
952
|
+
1. Eval fails → team investigates
|
|
953
|
+
2. If real problem: fix the code, close the finding
|
|
954
|
+
3. If false positive: add exclusion, document why
|
|
955
|
+
4. If recurring pattern: add to tasks/lessons.md
|
|
956
|
+
5. Next eval update: lessons.md patterns become new adherence checks
|
|
957
|
+
```
|
|
958
|
+
|
|
959
|
+
This cycle ensures evals get better over time rather than accumulating noise.
|
|
960
|
+
|
|
961
|
+
### Eval Severity and Triage
|
|
962
|
+
|
|
963
|
+
Not all eval failures are equal. Categorize findings by severity to guide response priority.
|
|
964
|
+
|
|
965
|
+
#### P0: Critical — Missing Coverage for Must-Have Features
|
|
966
|
+
|
|
967
|
+
Nothing tests a critical path. A Must-have feature from `docs/plan.md` has zero matching test files. An API endpoint has no integration test.
|
|
968
|
+
|
|
969
|
+
**Response**: Create a task immediately. This is a gap that can reach production.
|
|
970
|
+
|
|
971
|
+
**Examples**:
|
|
972
|
+
- "Feature 'user authentication' has no test files matching any auth-related keywords"
|
|
973
|
+
- "POST /api/v1/payments endpoint has no test file"
|
|
974
|
+
- "No tests reference any acceptance criteria from US-001 (core user flow)"
|
|
975
|
+
|
|
976
|
+
#### P1: High — Structure Violations, Consistency Gaps, Untested ACs
|
|
977
|
+
|
|
978
|
+
The project is out of sync or structurally incorrect. Not an immediate risk, but will compound.
|
|
979
|
+
|
|
980
|
+
**Response**: Fix in the current iteration. These indicate drift between documentation and implementation.
|
|
981
|
+
|
|
982
|
+
**Examples**:
|
|
983
|
+
- "CLAUDE.md lists `make deploy` but no Makefile target exists"
|
|
984
|
+
- "3 files in `src/shared/` are only imported by one module"
|
|
985
|
+
- "AC-3 of US-005 (account lockout after 5 failures) has no matching test assertions"
|
|
986
|
+
- "Commits from the last 5 PRs don't follow the documented format"
|
|
987
|
+
|
|
988
|
+
#### P2: Medium — Adherence Pattern Violations
|
|
989
|
+
|
|
990
|
+
Code doesn't follow a documented convention. May have legitimate exclusions.
|
|
991
|
+
|
|
992
|
+
**Response**: Review and either fix or add an exclusion with justification. These are the noisiest findings — handle carefully.
|
|
993
|
+
|
|
994
|
+
**Examples**:
|
|
995
|
+
- "14 uses of `any` type found across 8 files"
|
|
996
|
+
- "TODO without task ID in `src/features/billing/invoice.ts:42`"
|
|
997
|
+
- "Test file `src/features/auth/login.test.ts` mocks the database (docs say don't mock DB in integration tests)"
|
|
998
|
+
|
|
999
|
+
#### P3: Low — Informational Findings
|
|
1000
|
+
|
|
1001
|
+
Style observations, minor inconsistencies, documentation improvements. Not actionable as tasks.
|
|
1002
|
+
|
|
1003
|
+
**Response**: Note in the eval report. Fix opportunistically during related work. Don't create tasks for these.
|
|
1004
|
+
|
|
1005
|
+
**Examples**:
|
|
1006
|
+
- "3 documentation files have no cross-references from other docs (possibly orphaned)"
|
|
1007
|
+
- "Coverage eval matched 'user profile' by file name only, not by test content — confidence is low"
|
|
1008
|
+
- "Makefile has 2 targets not listed in CLAUDE.md Key Commands, but they start with `_` (internal targets)"
|