@bhargavvc/sdd-cc 1.30.0 → 1.35.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.ja-JP.md +144 -110
- package/README.ko-KR.md +143 -107
- package/README.md +183 -112
- package/README.pt-BR.md +90 -52
- package/README.zh-CN.md +141 -101
- package/agents/sdd-advisor-researcher.md +23 -0
- package/agents/sdd-ai-researcher.md +133 -0
- package/agents/sdd-code-fixer.md +516 -0
- package/agents/sdd-code-reviewer.md +355 -0
- package/agents/sdd-codebase-mapper.md +3 -3
- package/agents/sdd-debugger.md +17 -5
- package/agents/sdd-doc-verifier.md +201 -0
- package/agents/sdd-doc-writer.md +602 -0
- package/agents/sdd-domain-researcher.md +153 -0
- package/agents/sdd-eval-auditor.md +164 -0
- package/agents/sdd-eval-planner.md +154 -0
- package/agents/sdd-executor.md +87 -4
- package/agents/sdd-framework-selector.md +160 -0
- package/agents/sdd-intel-updater.md +314 -0
- package/agents/sdd-nyquist-auditor.md +1 -1
- package/agents/sdd-phase-researcher.md +71 -4
- package/agents/sdd-plan-checker.md +100 -6
- package/agents/sdd-planner.md +145 -206
- package/agents/sdd-project-researcher.md +25 -2
- package/agents/sdd-research-synthesizer.md +3 -3
- package/agents/sdd-roadmapper.md +6 -6
- package/agents/sdd-security-auditor.md +128 -0
- package/agents/sdd-ui-auditor.md +43 -3
- package/agents/sdd-ui-checker.md +5 -5
- package/agents/sdd-ui-researcher.md +27 -4
- package/agents/sdd-user-profiler.md +2 -2
- package/agents/sdd-verifier.md +142 -22
- package/bin/install.js +2151 -551
- package/commands/sdd/add-backlog.md +5 -5
- package/commands/sdd/add-tests.md +2 -2
- package/commands/sdd/ai-integration-phase.md +36 -0
- package/commands/sdd/analyze-dependencies.md +34 -0
- package/commands/sdd/audit-fix.md +33 -0
- package/commands/sdd/autonomous.md +7 -2
- package/commands/sdd/cleanup.md +5 -0
- package/commands/sdd/code-review-fix.md +52 -0
- package/commands/sdd/code-review.md +55 -0
- package/commands/sdd/complete-milestone.md +6 -6
- package/commands/sdd/debug.md +22 -9
- package/commands/sdd/discuss-phase.md +7 -2
- package/commands/sdd/do.md +1 -1
- package/commands/sdd/docs-update.md +48 -0
- package/commands/sdd/eval-review.md +32 -0
- package/commands/sdd/execute-phase.md +4 -0
- package/commands/sdd/explore.md +27 -0
- package/commands/sdd/fast.md +2 -2
- package/commands/sdd/from-sdd2.md +45 -0
- package/commands/sdd/help.md +2 -0
- package/commands/sdd/import.md +36 -0
- package/commands/sdd/intel.md +179 -0
- package/commands/sdd/join-discord.md +2 -1
- package/commands/sdd/manager.md +1 -0
- package/commands/sdd/map-codebase.md +3 -3
- package/commands/sdd/new-milestone.md +1 -1
- package/commands/sdd/new-project.md +5 -1
- package/commands/sdd/new-workspace.md +1 -1
- package/commands/sdd/next.md +2 -0
- package/commands/sdd/plan-milestone-gaps.md +2 -2
- package/commands/sdd/plan-phase.md +6 -1
- package/commands/sdd/plant-seed.md +1 -1
- package/commands/sdd/profile-user.md +1 -1
- package/commands/sdd/quick.md +5 -3
- package/commands/sdd/reapply-patches.md +230 -42
- package/commands/sdd/research-phase.md +3 -3
- package/commands/sdd/review-backlog.md +1 -0
- package/commands/sdd/review.md +6 -3
- package/commands/sdd/scan.md +26 -0
- package/commands/sdd/secure-phase.md +35 -0
- package/commands/sdd/ship.md +1 -1
- package/commands/sdd/thread.md +5 -5
- package/commands/sdd/undo.md +34 -0
- package/commands/sdd/verify-work.md +1 -1
- package/commands/sdd/workstreams.md +17 -11
- package/hooks/dist/sdd-check-update.js +33 -8
- package/hooks/dist/sdd-context-monitor.js +17 -8
- package/hooks/dist/sdd-phase-boundary.sh +27 -0
- package/hooks/dist/sdd-prompt-guard.js +1 -0
- package/hooks/dist/sdd-read-guard.js +82 -0
- package/hooks/dist/sdd-session-state.sh +33 -0
- package/hooks/dist/sdd-statusline.js +137 -15
- package/hooks/dist/sdd-validate-commit.sh +47 -0
- package/hooks/dist/sdd-workflow-guard.js +4 -4
- package/hooks/sdd-check-update.js +139 -0
- package/hooks/sdd-context-monitor.js +165 -0
- package/hooks/sdd-phase-boundary.sh +27 -0
- package/hooks/sdd-prompt-guard.js +97 -0
- package/hooks/sdd-read-guard.js +82 -0
- package/hooks/sdd-session-state.sh +33 -0
- package/hooks/sdd-statusline.js +241 -0
- package/hooks/sdd-validate-commit.sh +47 -0
- package/hooks/sdd-workflow-guard.js +94 -0
- package/package.json +3 -3
- package/scripts/build-hooks.js +18 -7
- package/scripts/prompt-injection-scan.sh +1 -0
- package/scripts/rebrand-gsd-to-sdd.sh +221 -220
- package/scripts/run-tests.cjs +5 -1
- package/scripts/sync-upstream.sh +1 -1
- package/sdd/bin/lib/commands.cjs +79 -17
- package/sdd/bin/lib/config.cjs +90 -48
- package/sdd/bin/lib/core.cjs +452 -87
- package/sdd/bin/lib/docs.cjs +267 -0
- package/sdd/bin/lib/frontmatter.cjs +381 -336
- package/sdd/bin/lib/init.cjs +110 -16
- package/sdd/bin/lib/intel.cjs +660 -0
- package/sdd/bin/lib/learnings.cjs +378 -0
- package/sdd/bin/lib/milestone.cjs +42 -11
- package/sdd/bin/lib/model-profiles.cjs +17 -15
- package/sdd/bin/lib/phase.cjs +367 -288
- package/sdd/bin/lib/profile-output.cjs +106 -10
- package/sdd/bin/lib/roadmap.cjs +146 -115
- package/sdd/bin/lib/schema-detect.cjs +238 -0
- package/sdd/bin/lib/sdd2-import.cjs +511 -0
- package/sdd/bin/lib/security.cjs +124 -3
- package/sdd/bin/lib/state.cjs +648 -264
- package/sdd/bin/lib/template.cjs +8 -4
- package/sdd/bin/lib/verify.cjs +209 -28
- package/sdd/bin/lib/workstream.cjs +7 -3
- package/sdd/bin/sdd-tools.cjs +184 -12
- package/sdd/contexts/dev.md +21 -0
- package/sdd/contexts/research.md +22 -0
- package/sdd/contexts/review.md +22 -0
- package/sdd/references/agent-contracts.md +79 -0
- package/sdd/references/ai-evals.md +156 -0
- package/sdd/references/ai-frameworks.md +186 -0
- package/sdd/references/artifact-types.md +113 -0
- package/sdd/references/common-bug-patterns.md +114 -0
- package/sdd/references/context-budget.md +49 -0
- package/sdd/references/continuation-format.md +25 -25
- package/sdd/references/domain-probes.md +125 -0
- package/sdd/references/few-shot-examples/plan-checker.md +73 -0
- package/sdd/references/few-shot-examples/verifier.md +109 -0
- package/sdd/references/gate-prompts.md +100 -0
- package/sdd/references/gates.md +70 -0
- package/sdd/references/git-integration.md +1 -1
- package/sdd/references/ios-scaffold.md +123 -0
- package/sdd/references/model-profile-resolution.md +2 -0
- package/sdd/references/model-profiles.md +24 -18
- package/sdd/references/planner-gap-closure.md +62 -0
- package/sdd/references/planner-reviews.md +39 -0
- package/sdd/references/planner-revision.md +87 -0
- package/sdd/references/planning-config.md +252 -0
- package/sdd/references/revision-loop.md +97 -0
- package/sdd/references/thinking-models-debug.md +44 -0
- package/sdd/references/thinking-models-execution.md +50 -0
- package/sdd/references/thinking-models-planning.md +62 -0
- package/sdd/references/thinking-models-research.md +50 -0
- package/sdd/references/thinking-models-verification.md +55 -0
- package/sdd/references/thinking-partner.md +96 -0
- package/sdd/references/ui-brand.md +4 -4
- package/sdd/references/universal-anti-patterns.md +63 -0
- package/sdd/references/verification-overrides.md +227 -0
- package/sdd/references/workstream-flag.md +56 -3
- package/sdd/templates/AI-SPEC.md +246 -0
- package/sdd/templates/DEBUG.md +1 -1
- package/sdd/templates/SECURITY.md +61 -0
- package/sdd/templates/UAT.md +4 -4
- package/sdd/templates/VALIDATION.md +4 -4
- package/sdd/templates/claude-md.md +32 -9
- package/sdd/templates/config.json +4 -0
- package/sdd/templates/debug-subagent-prompt.md +1 -1
- package/sdd/templates/dev-preferences.md +1 -1
- package/sdd/templates/discovery.md +2 -2
- package/sdd/templates/phase-prompt.md +1 -1
- package/sdd/templates/planner-subagent-prompt.md +3 -3
- package/sdd/templates/project.md +1 -1
- package/sdd/templates/research.md +1 -1
- package/sdd/templates/state.md +2 -2
- package/sdd/workflows/add-phase.md +8 -8
- package/sdd/workflows/add-tests.md +12 -9
- package/sdd/workflows/add-todo.md +5 -3
- package/sdd/workflows/ai-integration-phase.md +284 -0
- package/sdd/workflows/analyze-dependencies.md +96 -0
- package/sdd/workflows/audit-fix.md +157 -0
- package/sdd/workflows/audit-milestone.md +11 -11
- package/sdd/workflows/audit-uat.md +2 -2
- package/sdd/workflows/autonomous.md +195 -27
- package/sdd/workflows/check-todos.md +12 -10
- package/sdd/workflows/cleanup.md +2 -0
- package/sdd/workflows/code-review-fix.md +497 -0
- package/sdd/workflows/code-review.md +515 -0
- package/sdd/workflows/complete-milestone.md +56 -22
- package/sdd/workflows/diagnose-issues.md +10 -3
- package/sdd/workflows/discovery-phase.md +5 -3
- package/sdd/workflows/discuss-phase-assumptions.md +24 -6
- package/sdd/workflows/discuss-phase-power.md +291 -0
- package/sdd/workflows/discuss-phase.md +173 -21
- package/sdd/workflows/do.md +23 -21
- package/sdd/workflows/docs-update.md +1155 -0
- package/sdd/workflows/eval-review.md +155 -0
- package/sdd/workflows/execute-phase.md +594 -38
- package/sdd/workflows/execute-plan.md +67 -96
- package/sdd/workflows/explore.md +139 -0
- package/sdd/workflows/fast.md +5 -5
- package/sdd/workflows/forensics.md +2 -2
- package/sdd/workflows/health.md +4 -4
- package/sdd/workflows/help.md +122 -119
- package/sdd/workflows/import.md +276 -0
- package/sdd/workflows/inbox.md +387 -0
- package/sdd/workflows/insert-phase.md +7 -7
- package/sdd/workflows/list-phase-assumptions.md +4 -4
- package/sdd/workflows/list-workspaces.md +2 -2
- package/sdd/workflows/manager.md +35 -32
- package/sdd/workflows/map-codebase.md +7 -5
- package/sdd/workflows/milestone-summary.md +2 -2
- package/sdd/workflows/new-milestone.md +17 -9
- package/sdd/workflows/new-project.md +50 -25
- package/sdd/workflows/new-workspace.md +7 -5
- package/sdd/workflows/next.md +67 -11
- package/sdd/workflows/note.md +9 -7
- package/sdd/workflows/pause-work.md +75 -12
- package/sdd/workflows/plan-milestone-gaps.md +8 -8
- package/sdd/workflows/plan-phase.md +294 -42
- package/sdd/workflows/plant-seed.md +6 -3
- package/sdd/workflows/pr-branch.md +42 -14
- package/sdd/workflows/profile-user.md +9 -7
- package/sdd/workflows/progress.md +45 -45
- package/sdd/workflows/quick.md +195 -47
- package/sdd/workflows/remove-phase.md +6 -6
- package/sdd/workflows/remove-workspace.md +3 -1
- package/sdd/workflows/research-phase.md +2 -2
- package/sdd/workflows/resume-project.md +12 -12
- package/sdd/workflows/review.md +109 -9
- package/sdd/workflows/scan.md +102 -0
- package/sdd/workflows/secure-phase.md +166 -0
- package/sdd/workflows/session-report.md +2 -2
- package/sdd/workflows/settings.md +38 -12
- package/sdd/workflows/ship.md +21 -9
- package/sdd/workflows/stats.md +1 -1
- package/sdd/workflows/transition.md +23 -23
- package/sdd/workflows/ui-phase.md +15 -7
- package/sdd/workflows/ui-review.md +29 -4
- package/sdd/workflows/undo.md +314 -0
- package/sdd/workflows/update.md +171 -20
- package/sdd/workflows/validate-phase.md +6 -4
- package/sdd/workflows/verify-phase.md +210 -6
- package/sdd/workflows/verify-work.md +83 -9
- package/sdd/commands/sdd/workstreams.md +0 -63
package/sdd/bin/sdd-tools.cjs
CHANGED
|
@@ -70,6 +70,16 @@
|
|
|
70
70
|
* audit-uat Scan all phases for unresolved UAT/verification items
|
|
71
71
|
* uat render-checkpoint --file <path> Render the current UAT checkpoint block
|
|
72
72
|
*
|
|
73
|
+
* Intel:
|
|
74
|
+
* intel query <term> Query intel files for a term
|
|
75
|
+
* intel status Show intel file freshness
|
|
76
|
+
* intel update Trigger intel refresh (returns agent spawn hint)
|
|
77
|
+
* intel diff Show changed intel entries since last snapshot
|
|
78
|
+
* intel snapshot Save current intel state as diff baseline
|
|
79
|
+
* intel patch-meta <file> Update _meta.updated_at in an intel file
|
|
80
|
+
* intel validate Validate intel file structure
|
|
81
|
+
* intel extract-exports <file> Extract exported symbols from a source file
|
|
82
|
+
*
|
|
73
83
|
* Scaffolding:
|
|
74
84
|
* scaffold context --phase <N> Create CONTEXT.md template
|
|
75
85
|
* scaffold uat --phase <N> Create UAT.md template
|
|
@@ -93,6 +103,7 @@
|
|
|
93
103
|
* verify commits <h1> [h2] ... Batch verify commit hashes
|
|
94
104
|
* verify artifacts <plan-file> Check must_haves.artifacts
|
|
95
105
|
* verify key-links <plan-file> Check must_haves.key_links
|
|
106
|
+
* verify schema-drift <phase> [--skip] Detect schema file changes without push
|
|
96
107
|
*
|
|
97
108
|
* Template Fill:
|
|
98
109
|
* template fill summary --phase N Create pre-filled SUMMARY.md
|
|
@@ -133,6 +144,20 @@
|
|
|
133
144
|
* init milestone-op All context for milestone operations
|
|
134
145
|
* init map-codebase All context for map-codebase workflow
|
|
135
146
|
* init progress All context for progress workflow
|
|
147
|
+
*
|
|
148
|
+
* Documentation:
|
|
149
|
+
* docs-init Project context for docs-update workflow
|
|
150
|
+
*
|
|
151
|
+
* Learnings:
|
|
152
|
+
* learnings list List all global learnings (JSON)
|
|
153
|
+
* learnings query --tag <tag> Query learnings by tag
|
|
154
|
+
* learnings copy Copy from current project's LEARNINGS.md
|
|
155
|
+
* learnings prune --older-than <dur> Remove entries older than duration (e.g. 90d)
|
|
156
|
+
* learnings delete <id> Delete a learning by ID
|
|
157
|
+
*
|
|
158
|
+
* SDD-2 Migration:
|
|
159
|
+
* from-sdd2 [--path <dir>] [--force] [--dry-run]
|
|
160
|
+
* Import a SDD-2 (.sdd/) project back to SDD v1 (.planning/) format
|
|
136
161
|
*/
|
|
137
162
|
|
|
138
163
|
const fs = require('fs');
|
|
@@ -152,6 +177,8 @@ const frontmatter = require('./lib/frontmatter.cjs');
|
|
|
152
177
|
const profilePipeline = require('./lib/profile-pipeline.cjs');
|
|
153
178
|
const profileOutput = require('./lib/profile-output.cjs');
|
|
154
179
|
const workstream = require('./lib/workstream.cjs');
|
|
180
|
+
const docs = require('./lib/docs.cjs');
|
|
181
|
+
const learnings = require('./lib/learnings.cjs');
|
|
155
182
|
|
|
156
183
|
// ─── Arg parsing helpers ──────────────────────────────────────────────────────
|
|
157
184
|
|
|
@@ -230,7 +257,7 @@ async function main() {
|
|
|
230
257
|
}
|
|
231
258
|
|
|
232
259
|
// Optional workstream override for parallel milestone work.
|
|
233
|
-
// Priority: --ws flag > SDD_WORKSTREAM env var >
|
|
260
|
+
// Priority: --ws flag > SDD_WORKSTREAM env var > session-scoped pointer > shared legacy pointer > null
|
|
234
261
|
const wsEqArg = args.find(arg => arg.startsWith('--ws='));
|
|
235
262
|
const wsIdx = args.indexOf('--ws');
|
|
236
263
|
let ws = null;
|
|
@@ -271,10 +298,31 @@ async function main() {
|
|
|
271
298
|
args.splice(pickIdx, 2);
|
|
272
299
|
}
|
|
273
300
|
|
|
301
|
+
// --default <value>: for config-get, return this value instead of erroring
|
|
302
|
+
// when the key is absent. Allows workflows to express optional config reads
|
|
303
|
+
// without defensive `2>/dev/null || true` boilerplate (#1893).
|
|
304
|
+
const defaultIdx = args.indexOf('--default');
|
|
305
|
+
let defaultValue = undefined;
|
|
306
|
+
if (defaultIdx !== -1) {
|
|
307
|
+
defaultValue = args[defaultIdx + 1];
|
|
308
|
+
if (defaultValue === undefined) defaultValue = '';
|
|
309
|
+
args.splice(defaultIdx, 2);
|
|
310
|
+
}
|
|
311
|
+
|
|
274
312
|
const command = args[0];
|
|
275
313
|
|
|
276
314
|
if (!command) {
|
|
277
|
-
error('Usage: sdd-tools <command> [args] [--raw] [--pick <field>] [--cwd <path>] [--ws <name>]\nCommands: state, resolve-model, find-phase, commit, verify-summary, verify, frontmatter, template, generate-slug, current-timestamp, list-todos, verify-path-exists, config-ensure-section, config-new-project, init, workstream');
|
|
315
|
+
error('Usage: sdd-tools <command> [args] [--raw] [--pick <field>] [--cwd <path>] [--ws <name>]\nCommands: state, resolve-model, find-phase, commit, verify-summary, verify, frontmatter, template, generate-slug, current-timestamp, list-todos, verify-path-exists, config-ensure-section, config-new-project, init, workstream, docs-init');
|
|
316
|
+
}
|
|
317
|
+
|
|
318
|
+
// Reject flags that are never valid for any sdd-tools command. AI agents
|
|
319
|
+
// sometimes hallucinate --help or --version on tool invocations; silently
|
|
320
|
+
// ignoring them can cause destructive operations to proceed unchecked.
|
|
321
|
+
const NEVER_VALID_FLAGS = new Set(['-h', '--help', '-?', '--h', '--version', '-v', '--usage']);
|
|
322
|
+
for (const arg of args) {
|
|
323
|
+
if (NEVER_VALID_FLAGS.has(arg)) {
|
|
324
|
+
error(`Unknown flag: ${arg}\nsdd-tools does not accept help or version flags. Run "sdd-tools" with no arguments for usage.`);
|
|
325
|
+
}
|
|
278
326
|
}
|
|
279
327
|
|
|
280
328
|
// Multi-repo guard: resolve project root for commands that read/write .planning/.
|
|
@@ -313,7 +361,7 @@ async function main() {
|
|
|
313
361
|
}
|
|
314
362
|
};
|
|
315
363
|
try {
|
|
316
|
-
await runCommand(command, args, cwd, raw);
|
|
364
|
+
await runCommand(command, args, cwd, raw, defaultValue);
|
|
317
365
|
cleanup();
|
|
318
366
|
} catch (e) {
|
|
319
367
|
fs.writeSync = origWriteSync;
|
|
@@ -322,7 +370,27 @@ async function main() {
|
|
|
322
370
|
return;
|
|
323
371
|
}
|
|
324
372
|
|
|
325
|
-
|
|
373
|
+
// Intercept stdout to transparently resolve @file: references (#1891).
|
|
374
|
+
// core.cjs output() writes @file:<path> when JSON > 50KB. The --pick path
|
|
375
|
+
// already resolves this, but the normal path wrote @file: to stdout, forcing
|
|
376
|
+
// every workflow to have a bash-specific `if [[ "$INIT" == @file:* ]]` check
|
|
377
|
+
// that breaks on PowerShell and other non-bash shells.
|
|
378
|
+
const origWriteSync2 = fs.writeSync;
|
|
379
|
+
const outChunks = [];
|
|
380
|
+
fs.writeSync = function (fd, data, ...rest) {
|
|
381
|
+
if (fd === 1) { outChunks.push(String(data)); return; }
|
|
382
|
+
return origWriteSync2.call(fs, fd, data, ...rest);
|
|
383
|
+
};
|
|
384
|
+
try {
|
|
385
|
+
await runCommand(command, args, cwd, raw, defaultValue);
|
|
386
|
+
} finally {
|
|
387
|
+
fs.writeSync = origWriteSync2;
|
|
388
|
+
}
|
|
389
|
+
let captured = outChunks.join('');
|
|
390
|
+
if (captured.startsWith('@file:')) {
|
|
391
|
+
captured = fs.readFileSync(captured.slice(6), 'utf-8');
|
|
392
|
+
}
|
|
393
|
+
origWriteSync2.call(fs, 1, captured);
|
|
326
394
|
}
|
|
327
395
|
|
|
328
396
|
/**
|
|
@@ -348,7 +416,7 @@ function extractField(obj, fieldPath) {
|
|
|
348
416
|
return current;
|
|
349
417
|
}
|
|
350
418
|
|
|
351
|
-
async function runCommand(command, args, cwd, raw) {
|
|
419
|
+
async function runCommand(command, args, cwd, raw, defaultValue) {
|
|
352
420
|
switch (command) {
|
|
353
421
|
case 'state': {
|
|
354
422
|
const subcommand = args[1];
|
|
@@ -394,6 +462,14 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
394
462
|
state.cmdSignalWaiting(cwd, type, question, options, p, raw);
|
|
395
463
|
} else if (subcommand === 'signal-resume') {
|
|
396
464
|
state.cmdSignalResume(cwd, raw);
|
|
465
|
+
} else if (subcommand === 'planned-phase') {
|
|
466
|
+
const { phase: p, name, plans } = parseNamedArgs(args, ['phase', 'name', 'plans']);
|
|
467
|
+
state.cmdStatePlannedPhase(cwd, p, plans !== null ? parseInt(plans, 10) : null, raw);
|
|
468
|
+
} else if (subcommand === 'validate') {
|
|
469
|
+
state.cmdStateValidate(cwd, raw);
|
|
470
|
+
} else if (subcommand === 'sync') {
|
|
471
|
+
const { verify } = parseNamedArgs(args, [], ['verify']);
|
|
472
|
+
state.cmdStateSync(cwd, { verify }, raw);
|
|
397
473
|
} else {
|
|
398
474
|
state.cmdStateLoad(cwd, raw);
|
|
399
475
|
}
|
|
@@ -425,6 +501,11 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
425
501
|
break;
|
|
426
502
|
}
|
|
427
503
|
|
|
504
|
+
case 'check-commit': {
|
|
505
|
+
commands.cmdCheckCommit(cwd, raw);
|
|
506
|
+
break;
|
|
507
|
+
}
|
|
508
|
+
|
|
428
509
|
case 'commit-to-subrepo': {
|
|
429
510
|
const message = args[1];
|
|
430
511
|
const filesIndex = args.indexOf('--files');
|
|
@@ -498,8 +579,11 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
498
579
|
verify.cmdVerifyArtifacts(cwd, args[2], raw);
|
|
499
580
|
} else if (subcommand === 'key-links') {
|
|
500
581
|
verify.cmdVerifyKeyLinks(cwd, args[2], raw);
|
|
582
|
+
} else if (subcommand === 'schema-drift') {
|
|
583
|
+
const skipFlag = args.includes('--skip');
|
|
584
|
+
verify.cmdVerifySchemaDrift(cwd, args[2], skipFlag, raw);
|
|
501
585
|
} else {
|
|
502
|
-
error('Unknown verify subcommand. Available: plan-structure, phase-completeness, references, commits, artifacts, key-links');
|
|
586
|
+
error('Unknown verify subcommand. Available: plan-structure, phase-completeness, references, commits, artifacts, key-links, schema-drift');
|
|
503
587
|
}
|
|
504
588
|
break;
|
|
505
589
|
}
|
|
@@ -540,7 +624,7 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
540
624
|
}
|
|
541
625
|
|
|
542
626
|
case 'config-get': {
|
|
543
|
-
config.cmdConfigGet(cwd, args[1], raw);
|
|
627
|
+
config.cmdConfigGet(cwd, args[1], raw, defaultValue);
|
|
544
628
|
break;
|
|
545
629
|
}
|
|
546
630
|
|
|
@@ -570,8 +654,10 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
570
654
|
includeArchived: args.includes('--include-archived'),
|
|
571
655
|
};
|
|
572
656
|
phase.cmdPhasesList(cwd, options, raw);
|
|
657
|
+
} else if (subcommand === 'clear') {
|
|
658
|
+
milestone.cmdPhasesClear(cwd, raw, args.slice(2));
|
|
573
659
|
} else {
|
|
574
|
-
error('Unknown phases subcommand. Available: list');
|
|
660
|
+
error('Unknown phases subcommand. Available: list, clear');
|
|
575
661
|
}
|
|
576
662
|
break;
|
|
577
663
|
}
|
|
@@ -712,12 +798,16 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
712
798
|
case 'init': {
|
|
713
799
|
const workflow = args[1];
|
|
714
800
|
switch (workflow) {
|
|
715
|
-
case 'execute-phase':
|
|
716
|
-
|
|
801
|
+
case 'execute-phase': {
|
|
802
|
+
const { validate: epValidate } = parseNamedArgs(args, [], ['validate']);
|
|
803
|
+
init.cmdInitExecutePhase(cwd, args[2], raw, { validate: epValidate });
|
|
717
804
|
break;
|
|
718
|
-
|
|
719
|
-
|
|
805
|
+
}
|
|
806
|
+
case 'plan-phase': {
|
|
807
|
+
const { validate: ppValidate } = parseNamedArgs(args, [], ['validate']);
|
|
808
|
+
init.cmdInitPlanPhase(cwd, args[2], raw, { validate: ppValidate });
|
|
720
809
|
break;
|
|
810
|
+
}
|
|
721
811
|
case 'new-project':
|
|
722
812
|
init.cmdInitNewProject(cwd, raw);
|
|
723
813
|
break;
|
|
@@ -910,6 +1000,88 @@ async function runCommand(command, args, cwd, raw) {
|
|
|
910
1000
|
break;
|
|
911
1001
|
}
|
|
912
1002
|
|
|
1003
|
+
// ─── Intel ────────────────────────────────────────────────────────────
|
|
1004
|
+
|
|
1005
|
+
case 'intel': {
|
|
1006
|
+
const intel = require('./lib/intel.cjs');
|
|
1007
|
+
const subcommand = args[1];
|
|
1008
|
+
if (subcommand === 'query') {
|
|
1009
|
+
const term = args[2];
|
|
1010
|
+
if (!term) error('Usage: sdd-tools intel query <term>');
|
|
1011
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1012
|
+
core.output(intel.intelQuery(term, planningDir), raw);
|
|
1013
|
+
} else if (subcommand === 'status') {
|
|
1014
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1015
|
+
core.output(intel.intelStatus(planningDir), raw);
|
|
1016
|
+
} else if (subcommand === 'diff') {
|
|
1017
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1018
|
+
core.output(intel.intelDiff(planningDir), raw);
|
|
1019
|
+
} else if (subcommand === 'snapshot') {
|
|
1020
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1021
|
+
core.output(intel.intelSnapshot(planningDir), raw);
|
|
1022
|
+
} else if (subcommand === 'patch-meta') {
|
|
1023
|
+
const filePath = args[2];
|
|
1024
|
+
if (!filePath) error('Usage: sdd-tools intel patch-meta <file-path>');
|
|
1025
|
+
core.output(intel.intelPatchMeta(path.resolve(cwd, filePath)), raw);
|
|
1026
|
+
} else if (subcommand === 'validate') {
|
|
1027
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1028
|
+
core.output(intel.intelValidate(planningDir), raw);
|
|
1029
|
+
} else if (subcommand === 'extract-exports') {
|
|
1030
|
+
const filePath = args[2];
|
|
1031
|
+
if (!filePath) error('Usage: sdd-tools intel extract-exports <file-path>');
|
|
1032
|
+
core.output(intel.intelExtractExports(path.resolve(cwd, filePath)), raw);
|
|
1033
|
+
} else if (subcommand === 'update') {
|
|
1034
|
+
const planningDir = path.join(cwd, '.planning');
|
|
1035
|
+
core.output(intel.intelUpdate(planningDir), raw);
|
|
1036
|
+
} else {
|
|
1037
|
+
error('Unknown intel subcommand. Available: query, status, update, diff, snapshot, patch-meta, validate, extract-exports');
|
|
1038
|
+
}
|
|
1039
|
+
break;
|
|
1040
|
+
}
|
|
1041
|
+
|
|
1042
|
+
// ─── Documentation ────────────────────────────────────────────────────
|
|
1043
|
+
|
|
1044
|
+
case 'docs-init': {
|
|
1045
|
+
docs.cmdDocsInit(cwd, raw);
|
|
1046
|
+
break;
|
|
1047
|
+
}
|
|
1048
|
+
|
|
1049
|
+
// ─── Learnings ─────────────────────────────────────────────────────────
|
|
1050
|
+
|
|
1051
|
+
case 'learnings': {
|
|
1052
|
+
const subcommand = args[1];
|
|
1053
|
+
if (subcommand === 'list') {
|
|
1054
|
+
learnings.cmdLearningsList(raw);
|
|
1055
|
+
} else if (subcommand === 'query') {
|
|
1056
|
+
const tagIdx = args.indexOf('--tag');
|
|
1057
|
+
const tag = tagIdx !== -1 ? args[tagIdx + 1] : null;
|
|
1058
|
+
if (!tag) error('Usage: sdd-tools learnings query --tag <tag>');
|
|
1059
|
+
learnings.cmdLearningsQuery(tag, raw);
|
|
1060
|
+
} else if (subcommand === 'copy') {
|
|
1061
|
+
learnings.cmdLearningsCopy(cwd, raw);
|
|
1062
|
+
} else if (subcommand === 'prune') {
|
|
1063
|
+
const olderIdx = args.indexOf('--older-than');
|
|
1064
|
+
const olderThan = olderIdx !== -1 ? args[olderIdx + 1] : null;
|
|
1065
|
+
if (!olderThan) error('Usage: sdd-tools learnings prune --older-than <duration>');
|
|
1066
|
+
learnings.cmdLearningsPrune(olderThan, raw);
|
|
1067
|
+
} else if (subcommand === 'delete') {
|
|
1068
|
+
const id = args[2];
|
|
1069
|
+
if (!id) error('Usage: sdd-tools learnings delete <id>');
|
|
1070
|
+
learnings.cmdLearningsDelete(id, raw);
|
|
1071
|
+
} else {
|
|
1072
|
+
error('Unknown learnings subcommand. Available: list, query, copy, prune, delete');
|
|
1073
|
+
}
|
|
1074
|
+
break;
|
|
1075
|
+
}
|
|
1076
|
+
|
|
1077
|
+
// ─── SDD-2 Reverse Migration ───────────────────────────────────────────
|
|
1078
|
+
|
|
1079
|
+
case 'from-sdd2': {
|
|
1080
|
+
const sdd2Import = require('./lib/sdd2-import.cjs');
|
|
1081
|
+
sdd2Import.cmdFromSdd2(args.slice(1), cwd, raw);
|
|
1082
|
+
break;
|
|
1083
|
+
}
|
|
1084
|
+
|
|
913
1085
|
default:
|
|
914
1086
|
error(`Unknown command: ${command}`);
|
|
915
1087
|
}
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# Dev Context Profile
|
|
2
|
+
|
|
3
|
+
Agent output guidance for dev mode. Loaded when `context: dev` is set in config.json.
|
|
4
|
+
|
|
5
|
+
## Output Style
|
|
6
|
+
|
|
7
|
+
- Concise, action-oriented responses
|
|
8
|
+
- Lead with the code change or command, follow with brief rationale
|
|
9
|
+
- Skip preamble — assume the developer has full context
|
|
10
|
+
- Use inline code references (`file:line`) over prose descriptions
|
|
11
|
+
|
|
12
|
+
## Focus Areas
|
|
13
|
+
|
|
14
|
+
- Working code that compiles and passes tests
|
|
15
|
+
- Minimal diff — change only what is necessary
|
|
16
|
+
- Flag side effects or breaking changes immediately
|
|
17
|
+
- Surface the next actionable step at the end of every response
|
|
18
|
+
|
|
19
|
+
## Verbosity
|
|
20
|
+
|
|
21
|
+
Low. One-liner explanations unless the change is non-obvious. Omit background theory, alternative approaches, and caveats that do not affect the current task.
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
# Research Context Profile
|
|
2
|
+
|
|
3
|
+
Agent output guidance for research mode. Loaded when `context: research` is set in config.json.
|
|
4
|
+
|
|
5
|
+
## Output Style
|
|
6
|
+
|
|
7
|
+
- Verbose, exploratory responses that surface trade-offs and alternatives
|
|
8
|
+
- Present multiple approaches with pros and cons before recommending one
|
|
9
|
+
- Include links, references, and citations where available
|
|
10
|
+
- Use structured headings and bullet lists for scan-ability
|
|
11
|
+
|
|
12
|
+
## Focus Areas
|
|
13
|
+
|
|
14
|
+
- Breadth of options — enumerate before narrowing
|
|
15
|
+
- Prior art and ecosystem conventions
|
|
16
|
+
- Risks, edge cases, and failure modes
|
|
17
|
+
- Dependencies and compatibility implications
|
|
18
|
+
- Long-term maintainability of each approach
|
|
19
|
+
|
|
20
|
+
## Verbosity
|
|
21
|
+
|
|
22
|
+
High. Explain reasoning, show evidence, and document assumptions. Include background context even if the developer likely knows it — research artifacts are read by future contributors who may not.
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
# Review Context Profile
|
|
2
|
+
|
|
3
|
+
Agent output guidance for review mode. Loaded when `context: review` is set in config.json.
|
|
4
|
+
|
|
5
|
+
## Output Style
|
|
6
|
+
|
|
7
|
+
- Critical, detail-focused responses that prioritize correctness
|
|
8
|
+
- Organize findings by severity: blocking, important, nit
|
|
9
|
+
- Reference specific lines and files for every finding
|
|
10
|
+
- State what is correct as well as what needs change — confirm the good parts
|
|
11
|
+
|
|
12
|
+
## Focus Areas
|
|
13
|
+
|
|
14
|
+
- Correctness — logic errors, off-by-ones, missing edge cases
|
|
15
|
+
- Security — input validation, injection vectors, secret exposure
|
|
16
|
+
- Performance — unnecessary allocations, O(n^2) patterns, missing caching
|
|
17
|
+
- Style and consistency — naming, formatting, import order
|
|
18
|
+
- Test coverage — untested branches, missing assertions, flaky patterns
|
|
19
|
+
|
|
20
|
+
## Verbosity
|
|
21
|
+
|
|
22
|
+
Medium. Be thorough on findings but terse in explanation. Each issue should be one to three sentences: what is wrong, why it matters, and how to fix it.
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
# Agent Contracts
|
|
2
|
+
|
|
3
|
+
Completion markers and handoff schemas for all SDD agents. Workflows use these markers to detect agent completion and route accordingly.
|
|
4
|
+
|
|
5
|
+
This doc describes what IS, not what should be. Casing inconsistencies are documented as they appear in agent source files.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Agent Registry
|
|
10
|
+
|
|
11
|
+
| Agent | Role | Completion Markers |
|
|
12
|
+
|-------|------|--------------------|
|
|
13
|
+
| sdd-planner | Plan creation | `## PLANNING COMPLETE` |
|
|
14
|
+
| sdd-executor | Plan execution | `## PLAN COMPLETE`, `## CHECKPOINT REACHED` |
|
|
15
|
+
| sdd-phase-researcher | Phase-scoped research | `## RESEARCH COMPLETE`, `## RESEARCH BLOCKED` |
|
|
16
|
+
| sdd-project-researcher | Project-wide research | `## RESEARCH COMPLETE`, `## RESEARCH BLOCKED` |
|
|
17
|
+
| sdd-plan-checker | Plan validation | `## VERIFICATION PASSED`, `## ISSUES FOUND` |
|
|
18
|
+
| sdd-research-synthesizer | Multi-research synthesis | `## SYNTHESIS COMPLETE`, `## SYNTHESIS BLOCKED` |
|
|
19
|
+
| sdd-debugger | Debug investigation | `## DEBUG COMPLETE`, `## ROOT CAUSE FOUND`, `## CHECKPOINT REACHED` |
|
|
20
|
+
| sdd-roadmapper | Roadmap creation/revision | `## ROADMAP CREATED`, `## ROADMAP REVISED`, `## ROADMAP BLOCKED` |
|
|
21
|
+
| sdd-ui-auditor | UI review | `## UI REVIEW COMPLETE` |
|
|
22
|
+
| sdd-ui-checker | UI validation | `## ISSUES FOUND` |
|
|
23
|
+
| sdd-ui-researcher | UI spec creation | `## UI-SPEC COMPLETE`, `## UI-SPEC BLOCKED` |
|
|
24
|
+
| sdd-verifier | Post-execution verification | `## Verification Complete` (title case) |
|
|
25
|
+
| sdd-integration-checker | Cross-phase integration check | `## Integration Check Complete` (title case) |
|
|
26
|
+
| sdd-nyquist-auditor | Sampling audit | `## PARTIAL`, `## ESCALATE` (non-standard) |
|
|
27
|
+
| sdd-security-auditor | Security audit | `## OPEN_THREATS`, `## ESCALATE` (non-standard) |
|
|
28
|
+
| sdd-codebase-mapper | Codebase analysis | No marker (writes docs directly) |
|
|
29
|
+
| sdd-assumptions-analyzer | Assumption extraction | No marker (returns `## Assumptions` sections) |
|
|
30
|
+
| sdd-doc-verifier | Doc validation | No marker (writes JSON to `.planning/tmp/`) |
|
|
31
|
+
| sdd-doc-writer | Doc generation | No marker (writes docs directly) |
|
|
32
|
+
| sdd-advisor-researcher | Advisory research | No marker (utility agent) |
|
|
33
|
+
| sdd-user-profiler | User profiling | No marker (returns JSON in analysis tags) |
|
|
34
|
+
| sdd-intel-updater | Codebase intelligence analysis | `## INTEL UPDATE COMPLETE`, `## INTEL UPDATE FAILED` |
|
|
35
|
+
|
|
36
|
+
## Marker Rules
|
|
37
|
+
|
|
38
|
+
1. **ALL-CAPS markers** (e.g., `## PLANNING COMPLETE`) are the standard convention
|
|
39
|
+
2. **Title-case markers** (e.g., `## Verification Complete`) exist in sdd-verifier and sdd-integration-checker -- these are intentional as-is, not bugs
|
|
40
|
+
3. **Non-standard markers** (e.g., `## PARTIAL`, `## ESCALATE`) in audit agents indicate partial results requiring orchestrator judgment
|
|
41
|
+
4. **Agents without markers** either write artifacts directly to disk or return structured data (JSON/sections) that the caller parses
|
|
42
|
+
5. Markers must appear as H2 headings (`## `) at the start of a line in the agent's final output
|
|
43
|
+
|
|
44
|
+
## Key Handoff Contracts
|
|
45
|
+
|
|
46
|
+
### Planner -> Executor (via PLAN.md)
|
|
47
|
+
|
|
48
|
+
| Field | Required | Description |
|
|
49
|
+
|-------|----------|-------------|
|
|
50
|
+
| Frontmatter | Yes | phase, plan, type, wave, depends_on, files_modified, autonomous, requirements |
|
|
51
|
+
| `<objective>` | Yes | What the plan achieves |
|
|
52
|
+
| `<tasks>` | Yes | Ordered task list with type, files, action, verify, acceptance_criteria |
|
|
53
|
+
| `<verification>` | Yes | Overall verification steps |
|
|
54
|
+
| `<success_criteria>` | Yes | Measurable completion criteria |
|
|
55
|
+
|
|
56
|
+
### Executor -> Verifier (via SUMMARY.md)
|
|
57
|
+
|
|
58
|
+
| Field | Required | Description |
|
|
59
|
+
|-------|----------|-------------|
|
|
60
|
+
| Frontmatter | Yes | phase, plan, subsystem, tags, key-files, metrics |
|
|
61
|
+
| Commits table | Yes | Per-task commit hashes and descriptions |
|
|
62
|
+
| Deviations section | Yes | Auto-fixed issues or "None" |
|
|
63
|
+
| Self-Check | Yes | PASSED or FAILED with details |
|
|
64
|
+
|
|
65
|
+
## Workflow Regex Patterns
|
|
66
|
+
|
|
67
|
+
Workflows match these markers to detect agent completion:
|
|
68
|
+
|
|
69
|
+
**plan-phase.md matches:**
|
|
70
|
+
- `## RESEARCH COMPLETE` / `## RESEARCH BLOCKED` (researcher output)
|
|
71
|
+
- `## PLANNING COMPLETE` (planner output)
|
|
72
|
+
- `## CHECKPOINT REACHED` (planner/executor pause)
|
|
73
|
+
- `## VERIFICATION PASSED` / `## ISSUES FOUND` (plan-checker output)
|
|
74
|
+
|
|
75
|
+
**execute-phase.md matches:**
|
|
76
|
+
- `## PHASE COMPLETE` (all plans in phase done)
|
|
77
|
+
- `## Self-Check: FAILED` (summary self-check)
|
|
78
|
+
|
|
79
|
+
> **NOTE:** `## PLAN COMPLETE` is the sdd-executor's completion marker but execute-phase.md does not regex-match it. Instead, it detects executor completion via spot-checks (SUMMARY.md existence, git commit state). This is intentional behavior, not a mismatch.
|
|
@@ -0,0 +1,156 @@
|
|
|
1
|
+
# AI Evaluation Reference
|
|
2
|
+
|
|
3
|
+
> Reference used by `sdd-eval-planner` and `sdd-eval-auditor`.
|
|
4
|
+
> Based on "AI Evals for Everyone" course (Reganti & Badam) + industry practice.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Core Concepts
|
|
9
|
+
|
|
10
|
+
### Why Evals Exist
|
|
11
|
+
AI systems are non-deterministic. Input X does not reliably produce output Y across runs, users, or edge cases. Evals are the continuous process of assessing whether your system's behavior meets expectations under real-world conditions — unit tests and integration tests alone are insufficient.
|
|
12
|
+
|
|
13
|
+
### Model vs. Product Evaluation
|
|
14
|
+
- **Model evals** (MMLU, HumanEval, GSM8K) — measure general capability in standardized conditions. Use as initial filter only.
|
|
15
|
+
- **Product evals** — measure behavior inside your specific system, with your data, your users, your domain rules. This is where 80% of eval effort belongs.
|
|
16
|
+
|
|
17
|
+
### The Three Components of Every Eval
|
|
18
|
+
- **Input** — everything affecting the system: query, history, retrieved docs, system prompt, config
|
|
19
|
+
- **Expected** — what good behavior looks like, defined through rubrics
|
|
20
|
+
- **Actual** — what the system produced, including intermediate steps, tool calls, and reasoning traces
|
|
21
|
+
|
|
22
|
+
### Three Measurement Approaches
|
|
23
|
+
1. **Code-based metrics** — deterministic checks: JSON validation, required disclaimers, performance thresholds, classification flags. Fast, cheap, reliable. Use first.
|
|
24
|
+
2. **LLM judges** — one model evaluates another against a rubric. Powerful for subjective qualities (tone, reasoning, escalation). Requires calibration against human judgment before trusting.
|
|
25
|
+
3. **Human evaluation** — gold standard for nuanced judgment. Doesn't scale. Use for calibration, edge cases, periodic sampling, and high-stakes decisions.
|
|
26
|
+
|
|
27
|
+
Most effective systems combine all three.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Evaluation Dimensions
|
|
32
|
+
|
|
33
|
+
### Pre-Deployment (Development Phase)
|
|
34
|
+
|
|
35
|
+
| Dimension | What It Measures | When It Matters |
|
|
36
|
+
|-----------|-----------------|-----------------|
|
|
37
|
+
| **Factual accuracy** | Correctness of claims against ground truth | RAG, knowledge bases, any factual assertions |
|
|
38
|
+
| **Context faithfulness** | Response grounded in provided context vs. fabricated | RAG pipelines, document Q&A, retrieval-augmented systems |
|
|
39
|
+
| **Hallucination detection** | Plausible but unsupported claims | All generative systems, high-stakes domains |
|
|
40
|
+
| **Escalation accuracy** | Correct identification of when human intervention needed | Customer service, healthcare, financial advisory |
|
|
41
|
+
| **Policy compliance** | Adherence to business rules, legal requirements, disclaimers | Regulated industries, enterprise deployments |
|
|
42
|
+
| **Tone/style appropriateness** | Match with brand voice, audience expectations, emotional context | Customer-facing systems, content generation |
|
|
43
|
+
| **Output structure validity** | Schema compliance, required fields, format correctness | Structured extraction, API integrations, data pipelines |
|
|
44
|
+
| **Task completion** | Whether the system accomplished the stated goal | Agentic workflows, multi-step tasks |
|
|
45
|
+
| **Tool use correctness** | Correct selection and invocation of tools | Agent systems with tool calls |
|
|
46
|
+
| **Safety** | Absence of harmful, biased, or inappropriate outputs | All user-facing systems |
|
|
47
|
+
|
|
48
|
+
### Production Monitoring
|
|
49
|
+
|
|
50
|
+
| Dimension | Monitoring Approach |
|
|
51
|
+
|-----------|---------------------|
|
|
52
|
+
| **Safety violations** | Online guardrail — real-time, immediate intervention |
|
|
53
|
+
| **Compliance failures** | Online guardrail — block or escalate before user sees output |
|
|
54
|
+
| **Quality degradation trends** | Offline flywheel — batch analysis of sampled interactions |
|
|
55
|
+
| **Emerging failure modes** | Signal-metric divergence — when user behavior signals diverge from metric scores, investigate manually |
|
|
56
|
+
| **Cost/latency drift** | Code-based metrics — automated threshold alerts |
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## The Guardrail vs. Flywheel Decision
|
|
61
|
+
|
|
62
|
+
Ask: "If this behavior goes wrong, would it be catastrophic for my business?"
|
|
63
|
+
|
|
64
|
+
- **Yes → Guardrail** — run online, real-time, with immediate intervention (block, escalate, hand off). Be selective: guardrails add latency.
|
|
65
|
+
- **No → Flywheel** — run offline as batch analysis feeding system refinements over time.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## Rubric Design
|
|
70
|
+
|
|
71
|
+
Generic metrics are meaningless without context. "Helpfulness" in real estate means summarizing listings clearly. In healthcare it means knowing when *not* to answer.
|
|
72
|
+
|
|
73
|
+
A rubric must define:
|
|
74
|
+
1. The dimension being measured
|
|
75
|
+
2. What scores 1, 3, and 5 on a 5-point scale (or pass/fail criteria)
|
|
76
|
+
3. Domain-specific examples of acceptable vs. unacceptable behavior
|
|
77
|
+
|
|
78
|
+
Without rubrics, LLM judges produce noise rather than signal.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Reference Dataset Guidelines
|
|
83
|
+
|
|
84
|
+
- Start with **10-20 high-quality examples** — not 200 mediocre ones
|
|
85
|
+
- Cover: critical success scenarios, common user workflows, known edge cases, historical failure modes
|
|
86
|
+
- Have domain experts label the examples (not just engineers)
|
|
87
|
+
- Expand based on what you learn in production — don't build for hypothetical coverage
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Eval Tooling Guide
|
|
92
|
+
|
|
93
|
+
| Tool | Type | Best For | Key Strength |
|
|
94
|
+
|------|------|----------|-------------|
|
|
95
|
+
| **RAGAS** | Python library | RAG evaluation | Purpose-built metrics: faithfulness, answer relevance, context precision/recall |
|
|
96
|
+
| **Langfuse** | Platform (open-source, self-hostable) | All system types | Strong tracing, prompt management, good for teams wanting infrastructure control |
|
|
97
|
+
| **LangSmith** | Platform (commercial) | LangChain/LangGraph ecosystems | Tightest integration with LangChain; best if already in that ecosystem |
|
|
98
|
+
| **Arize Phoenix** | Platform (open-source + hosted) | RAG + multi-agent tracing | Strong RAG eval + trace visualization; open-source with hosted option |
|
|
99
|
+
| **Braintrust** | Platform (commercial) | Model-agnostic evaluation | Dataset and experiment management; good for comparing across frameworks |
|
|
100
|
+
| **Promptfoo** | CLI tool (open-source) | Prompt testing, CI/CD | CLI-first, excellent for CI/CD prompt regression testing |
|
|
101
|
+
|
|
102
|
+
### Tool Selection by System Type
|
|
103
|
+
|
|
104
|
+
| System Type | Recommended Tooling |
|
|
105
|
+
|-------------|---------------------|
|
|
106
|
+
| RAG / Knowledge Q&A | RAGAS + Arize Phoenix or Braintrust |
|
|
107
|
+
| Multi-agent systems | Langfuse + Arize Phoenix |
|
|
108
|
+
| Conversational / single-model | Promptfoo + Braintrust |
|
|
109
|
+
| Structured extraction | Promptfoo + code-based validators |
|
|
110
|
+
| LangChain/LangGraph projects | LangSmith (native integration) |
|
|
111
|
+
| Production monitoring (all types) | Langfuse, Arize Phoenix, or LangSmith |
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Evals in the Development Lifecycle
|
|
116
|
+
|
|
117
|
+
### Plan Phase (Evaluation-Aware Design)
|
|
118
|
+
Before writing code, define:
|
|
119
|
+
1. What type of AI system is being built → determines framework and dominant eval concerns
|
|
120
|
+
2. Critical failure modes (3-5 behaviors that cannot go wrong)
|
|
121
|
+
3. Rubrics — explicit definitions of acceptable/unacceptable behavior per dimension
|
|
122
|
+
4. Evaluation strategy — which dimensions use code metrics, LLM judges, or human review
|
|
123
|
+
5. Reference dataset requirements — size, composition, labeling approach
|
|
124
|
+
6. Eval tooling selection
|
|
125
|
+
|
|
126
|
+
Output: EVALS-SPEC section of AI-SPEC.md
|
|
127
|
+
|
|
128
|
+
### Execute Phase (Instrument While Building)
|
|
129
|
+
- Add tracing from day one (Langfuse, Arize Phoenix, or LangSmith)
|
|
130
|
+
- Build reference dataset concurrently with implementation
|
|
131
|
+
- Implement code-based checks first; add LLM judges only for subjective dimensions
|
|
132
|
+
- Run evals in CI/CD via Promptfoo or Braintrust
|
|
133
|
+
|
|
134
|
+
### Verify Phase (Pre-Deployment Validation)
|
|
135
|
+
- Run full reference dataset against all metrics
|
|
136
|
+
- Conduct human review of edge cases and LLM judge disagreements
|
|
137
|
+
- Calibrate LLM judges against human scores (target ≥ 0.7 correlation before trusting)
|
|
138
|
+
- Define and configure production guardrails
|
|
139
|
+
- Establish monitoring baseline
|
|
140
|
+
|
|
141
|
+
### Monitor Phase (Production Evaluation Loop)
|
|
142
|
+
- Smart sampling — weight toward interactions with concerning signals (retries, unusual length, explicit escalations)
|
|
143
|
+
- Online guardrails on every interaction
|
|
144
|
+
- Offline flywheel on sampled batch
|
|
145
|
+
- Watch for signal-metric divergence — the early warning system for evaluation gaps
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Common Pitfalls
|
|
150
|
+
|
|
151
|
+
1. **Assuming benchmarks predict product success** — they don't; model evals are a filter, not a verdict
|
|
152
|
+
2. **Engineering evals in isolation** — domain experts must co-define rubrics; engineers alone miss critical nuances
|
|
153
|
+
3. **Building comprehensive coverage on day one** — start small (10-20 examples), expand from real failure modes
|
|
154
|
+
4. **Trusting uncalibrated LLM judges** — validate against human judgment before relying on them
|
|
155
|
+
5. **Measuring everything** — only track metrics that drive decisions; "collect it all" produces noise
|
|
156
|
+
6. **Treating evaluation as one-time setup** — user behavior evolves, requirements change, failure modes emerge; evaluation is continuous
|