agentic-sdlc-wizard 1.22.0 → 1.24.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +37 -0
- package/CLAUDE_CODE_SDLC_WIZARD.md +139 -27
- package/README.md +2 -0
- package/cli/templates/hooks/instructions-loaded-check.sh +43 -0
- package/cli/templates/hooks/sdlc-prompt-check.sh +1 -1
- package/cli/templates/settings.json +1 -0
- package/cli/templates/skills/sdlc/SKILL.md +260 -148
- package/cli/templates/skills/setup/SKILL.md +10 -0
- package/cli/templates/skills/update/SKILL.md +4 -3
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,43 @@ All notable changes to the SDLC Wizard.
|
|
|
4
4
|
|
|
5
5
|
> **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
|
|
6
6
|
|
|
7
|
+
## [1.24.0] - 2026-04-04
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
- Hook `if` conditionals — CC v2.1.85+ `if` field on PreToolUse hook. TDD check only spawns for source files (repo: `.github/workflows/*`, template: `src/**`). Documented in wizard CC features section with matcher-vs-if comparison table (#68)
|
|
11
|
+
- Autocompact tuning guidance — `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` and `CLAUDE_CODE_AUTO_COMPACT_WINDOW` env vars with community-recommended thresholds (75% for 200K, 30% for 1M). 1M vs 200K context window comparison table. Setup wizard Step 9.5 for context window configuration (#88)
|
|
12
|
+
- 6 hook tests for `if` field (52 total hook tests)
|
|
13
|
+
- 5 autocompact/context tests (70 total self-update tests)
|
|
14
|
+
|
|
15
|
+
### Fixed
|
|
16
|
+
- E2E tdd_red detection — three bugs since inception: test-only scenarios scored 0 (missing elif branch), golden outputs were .txt not JSON, golden-scores.json encoded the bug it was meant to catch. Codex cross-model review caught regex false-positive + missing JSON pairing (#86)
|
|
17
|
+
- 29 deterministic + 9 regression tests for tdd_red fix
|
|
18
|
+
|
|
19
|
+
## [1.23.0] - 2026-04-01
|
|
20
|
+
|
|
21
|
+
### Added
|
|
22
|
+
- Update notification hook — `instructions-loaded-check.sh` checks npm for newer wizard version each session. Non-blocking, graceful on network failure. One-liner: "SDLC Wizard update available: X → Y (run /update-wizard)" (#64)
|
|
23
|
+
- Cross-model review standardization — mission-first handoff (mission/success/failure fields), preflight self-review doc, verification checklist, adversarial framing, domain template guidance, convergence reduced to 2-3 rounds. Audited 4 repos + 14 external repos + 7 papers (#72, #56)
|
|
24
|
+
- Release Planning Gate — section in SDLC skill. Before implementing release items: list all, plan each at 95% confidence, identify blocks, present plans as batch. Prove It Gate strengthened with absorption check (#73)
|
|
25
|
+
- 6 quality tests for update notification (fake npm in PATH, version comparison, failure modes)
|
|
26
|
+
- 12 quality tests for cross-model review, context position, release planning
|
|
27
|
+
- Testing Diamond boundary table — explicit E2E (UI/browser ~5%) vs Integration (API/no UI ~90%) vs Unit (pure logic ~5%) in SKILL.md and wizard doc (#65)
|
|
28
|
+
- Skill frontmatter docs — expanded to full table covering `paths:`, `context: fork`, `effort:`, `disable-model-invocation:`, `argument-hint:` (#69)
|
|
29
|
+
- `--bare` mode documentation in SKILL.md — complete wizard bypass warning for scripted headless calls (#70)
|
|
30
|
+
- 6 quality tests for #65/#69/#70
|
|
31
|
+
- "NEVER AUTO-MERGE" enforcement gate in CI Shepherd section — same weight as "ALL TESTS MUST PASS." Full shepherd sequence documented as mandatory (post-mortem from PR #145 incident)
|
|
32
|
+
- Post-Mortem pattern — when process fails, feed it back: Incident → Root Cause → New Rule → Test → Ship. "Every mistake becomes a rule"
|
|
33
|
+
- 4 quality tests for enforcement gate + post-mortem
|
|
34
|
+
|
|
35
|
+
### Fixed
|
|
36
|
+
- Dead-code pipe in `test_prove_it_absorption()` — `grep -qi | grep -qi` was a no-op (P1 from PR #145 CI review)
|
|
37
|
+
|
|
38
|
+
### Changed
|
|
39
|
+
- Moved "ALL TESTS MUST PASS" from 61% depth to 11% depth in SDLC skill (Lost in the Middle fix) (#57)
|
|
40
|
+
- Prove It Gate now requires absorption check — "can this be a section in an existing skill?" — before proposing new skills/components
|
|
41
|
+
- Wizard "E2E vs Manual Testing" section replaced with "E2E vs Integration — The Critical Boundary" (#65)
|
|
42
|
+
- Wizard "Skill Effort Frontmatter" section expanded to "Skill Frontmatter Fields" with full field reference (#69)
|
|
43
|
+
|
|
7
44
|
## [1.22.0] - 2026-04-01
|
|
8
45
|
|
|
9
46
|
### Added
|
|
@@ -307,9 +307,24 @@ New built-in commands available to use alongside the wizard:
|
|
|
307
307
|
|
|
308
308
|
**Tip**: `/simplify` pairs well with the self-review phase. Run it after implementation as an additional quality check.
|
|
309
309
|
|
|
310
|
-
### Skill
|
|
310
|
+
### Skill Frontmatter Fields (v2.1.80+)
|
|
311
311
|
|
|
312
|
-
Skills
|
|
312
|
+
Skills support these frontmatter fields:
|
|
313
|
+
|
|
314
|
+
| Field | Purpose | Example |
|
|
315
|
+
|-------|---------|---------|
|
|
316
|
+
| `name` | Skill name (matches `/command`) | `name: sdlc` |
|
|
317
|
+
| `description` | Trigger description for auto-invocation | `description: Full SDLC workflow...` |
|
|
318
|
+
| `effort` | Set reasoning effort level | `effort: high` |
|
|
319
|
+
| `paths` | Restrict skill to specific file patterns | `paths: ["src/**/*.ts", "tests/**"]` |
|
|
320
|
+
| `context` | Context mode (`fork` = isolated subagent) | `context: fork` |
|
|
321
|
+
| `argument-hint` | Hint for `$ARGUMENTS` placeholder | `argument-hint: [task description]` |
|
|
322
|
+
| `disable-model-invocation` | Prevent skill from being auto-invoked by model | `disable-model-invocation: true` |
|
|
323
|
+
|
|
324
|
+
**Key fields explained:**
|
|
325
|
+
- **`effort: high`** — The wizard's `/sdlc` skill uses this to ensure Claude gives full attention. `max` is available but costs significantly more tokens.
|
|
326
|
+
- **`paths:`** — Limits when a skill activates based on files being worked on. Useful for language-specific or directory-specific skills.
|
|
327
|
+
- **`context: fork`** — Runs the skill in an isolated subagent context. The subagent gets its own context window, so it won't pollute the main conversation. Useful for review skills or analysis that should run independently.
|
|
313
328
|
|
|
314
329
|
### InstructionsLoaded Hook (v2.1.69+)
|
|
315
330
|
|
|
@@ -323,6 +338,29 @@ Skills can now reference companion files using `${CLAUDE_SKILL_DIR}`. Useful if
|
|
|
323
338
|
|
|
324
339
|
Hook events now include `agent_id` and `agent_type` fields. Hooks can behave differently for subagents vs the main agent if needed.
|
|
325
340
|
|
|
341
|
+
### Hook `if` Conditionals (v2.1.85+)
|
|
342
|
+
|
|
343
|
+
The `if` field on individual hook handlers filters by tool name AND arguments using permission rule syntax. The hook process only spawns when the condition matches — reducing unnecessary process spawns.
|
|
344
|
+
|
|
345
|
+
```json
|
|
346
|
+
{
|
|
347
|
+
"type": "command",
|
|
348
|
+
"if": "Write(src/**) Edit(src/**) MultiEdit(src/**)",
|
|
349
|
+
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/tdd-pretool-check.sh"
|
|
350
|
+
}
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
| Field | Level | Matches On | Syntax |
|
|
354
|
+
|-------|-------|------------|--------|
|
|
355
|
+
| `matcher` | Group (all hooks in array) | Tool name only | Regex (`Write\|Edit`) |
|
|
356
|
+
| `if` | Individual handler | Tool name + arguments | Permission rule (`Edit(src/**)`) |
|
|
357
|
+
|
|
358
|
+
**Pattern examples:** `Edit(*.ts)`, `Write(src/**)`, `Bash(git *)`. Same syntax as `allowedTools` in settings.json.
|
|
359
|
+
|
|
360
|
+
**Only works on tool-use events:** `PreToolUse`, `PostToolUse`, `PostToolUseFailure`. Adding `if` to non-tool events prevents the hook from running.
|
|
361
|
+
|
|
362
|
+
**CUSTOMIZE:** Replace `src/**` with your source directory pattern. The wizard generates this based on your project structure detected in Step 0.4.
|
|
363
|
+
|
|
326
364
|
### Security Hardening (v2.1.49-v2.1.78)
|
|
327
365
|
|
|
328
366
|
Several fixes that strengthen wizard enforcement:
|
|
@@ -475,10 +513,11 @@ Here's the "Testing Diamond" approach (recommended for AI agents):
|
|
|
475
513
|
- **Confidence**: If integration tests pass, production usually works
|
|
476
514
|
- **AI-friendly**: Give Claude concrete pass/fail feedback on real behavior
|
|
477
515
|
|
|
478
|
-
**E2E vs
|
|
479
|
-
- **E2E (
|
|
480
|
-
- **
|
|
481
|
-
- **
|
|
516
|
+
**E2E vs Integration — The Critical Boundary:**
|
|
517
|
+
- **E2E**: Tests that go through the user's actual UI/browser (Playwright, Cypress). ~5% of suite.
|
|
518
|
+
- **Integration**: Tests that hit real systems via API without UI — real DB, real cache, real services. ~90% of suite.
|
|
519
|
+
- **Unit**: Pure logic only — no DB, no API, no filesystem. ~5% of suite.
|
|
520
|
+
- **The rule**: If your test doesn't open a browser or render a UI, it's not E2E — it's integration. Mislabeling leads to overinvestment in slow browser tests.
|
|
482
521
|
|
|
483
522
|
**But your team decides:**
|
|
484
523
|
|
|
@@ -677,12 +716,66 @@ Two tools for managing context — use the right one:
|
|
|
677
716
|
- `/clear` after 2+ failed corrections on the same issue (context is polluted with bad approaches — start fresh with a better prompt)
|
|
678
717
|
- After committing a PR, `/clear` before starting the next feature
|
|
679
718
|
|
|
680
|
-
**Auto-compact** fires automatically at ~95% context capacity.
|
|
719
|
+
**Auto-compact** fires automatically at ~95% context capacity. Claude Code handles this by default — but the default threshold may not be ideal for all use cases (see "Autocompact Tuning" below). The SDLC skill suggests `/compact` during CI idle time as a "context GC" opportunity.
|
|
681
720
|
|
|
682
721
|
**What survives `/compact`:** Key decisions, code changes, task state (as a summary). What can be lost: detailed early-conversation instructions not in CLAUDE.md, specific file contents read long ago.
|
|
683
722
|
|
|
684
723
|
**Best practice:** Put persistent instructions in CLAUDE.md (survives both `/compact` and `/clear`), not in conversation.
|
|
685
724
|
|
|
725
|
+
### Autocompact Tuning
|
|
726
|
+
|
|
727
|
+
Override the default auto-compact threshold with environment variables. These are community-discovered settings referenced in upstream issues ([#34332](https://github.com/anthropics/claude-code/issues/34332), [#42375](https://github.com/anthropics/claude-code/issues/42375)) — not yet officially documented by Anthropic:
|
|
728
|
+
|
|
729
|
+
| Variable | What It Does | Default |
|
|
730
|
+
|----------|-------------|---------|
|
|
731
|
+
| `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` | Trigger compaction at this % of context capacity (1-100) | ~95% |
|
|
732
|
+
| `CLAUDE_CODE_AUTO_COMPACT_WINDOW` | Override context capacity in tokens (useful for 1M models) | Model default |
|
|
733
|
+
|
|
734
|
+
Set these in your shell profile (`~/.bashrc`, `~/.zshrc`) or per-project `.envrc`:
|
|
735
|
+
|
|
736
|
+
```bash
|
|
737
|
+
# Example: compact earlier on a 200K model
|
|
738
|
+
export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=75
|
|
739
|
+
```
|
|
740
|
+
|
|
741
|
+
**Community-recommended thresholds by use case:**
|
|
742
|
+
|
|
743
|
+
| Use Case | AUTOCOMPACT % | Why |
|
|
744
|
+
|----------|--------------|-----|
|
|
745
|
+
| General development (200K) | 75% | Leaves room for implementation after planning |
|
|
746
|
+
| Complex refactors (200K) | 80% | Slightly more context before compaction |
|
|
747
|
+
| CI pipelines | 60% | Short tasks, compact early to stay fast |
|
|
748
|
+
| 1M context model | 30% | See "1M vs 200K" below — 95% on 1M wastes budget |
|
|
749
|
+
| Short tasks | 60-70% | Less context needed, compact early |
|
|
750
|
+
|
|
751
|
+
**Important:** Values above the default ~95% threshold have no effect — you can only trigger compaction *earlier*, not later. Noise (progress ticks, thinking blocks, stale reads) makes up 50-70% of session tokens, so threshold tuning matters less than noise reduction (scoped reads, subagents, `/compact` between phases).
|
|
752
|
+
|
|
753
|
+
**Note:** These env vars may change as Claude Code evolves. Check [Claude Code settings docs](https://docs.anthropic.com/en/docs/claude-code/settings) for the latest supported configuration.
|
|
754
|
+
|
|
755
|
+
### 1M vs 200K Context Window
|
|
756
|
+
|
|
757
|
+
Claude Code supports both 200K and 1M context windows. Choose based on your task:
|
|
758
|
+
|
|
759
|
+
| | 200K Context | 1M Context |
|
|
760
|
+
|---|---|---|
|
|
761
|
+
| **Best for** | Normal SDLC cycles (plan → TDD → review) | Multi-feature releases, deep codebase exploration |
|
|
762
|
+
| **Typical usage** | 50-80K tokens per task | 200K+ tokens for complex workflows |
|
|
763
|
+
| **Cost** | Lower total cost per session | ~5x more tokens consumed (cost scales linearly) |
|
|
764
|
+
| **Auto-compact** | Default 95% works well | Reported to fire at ~76K ([issue #34332](https://github.com/anthropics/claude-code/issues/34332)) |
|
|
765
|
+
| **Suggested override** | `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=75` | `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` or `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` |
|
|
766
|
+
|
|
767
|
+
**Default to 200K.** Normal SDLC tasks (single feature, bug fix, refactor) rarely exceed 80K tokens. The 200K window handles this with room to spare.
|
|
768
|
+
|
|
769
|
+
**Switch to 1M when:**
|
|
770
|
+
- Implementing multiple related features in one session
|
|
771
|
+
- Deep research across a large codebase (reading 20+ files)
|
|
772
|
+
- Multi-agent workflows that accumulate context
|
|
773
|
+
- Complex debugging sessions that need full history
|
|
774
|
+
|
|
775
|
+
**Cost awareness:** No per-token premium since March 2026, but total cost scales linearly with context consumed. A 900K-token session costs ~$4.50 in input alone. Use `/cost` to monitor.
|
|
776
|
+
|
|
777
|
+
**1M autocompact workaround:** On 1M models, the default auto-compact has been reported to fire too early (~76K, per [issue #34332](https://github.com/anthropics/claude-code/issues/34332)). Community workaround: set `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` or `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` to use more of the window.
|
|
778
|
+
|
|
686
779
|
---
|
|
687
780
|
|
|
688
781
|
## Example Workflow (End-to-End)
|
|
@@ -1043,8 +1136,10 @@ Feature branches still recommended for solo devs (keeps main clean, easy rollbac
|
|
|
1043
1136
|
- **No** → Skip CI shepherd entirely (Claude still runs local tests, just doesn't interact with CI after pushing)
|
|
1044
1137
|
|
|
1045
1138
|
**What the CI shepherd does:**
|
|
1046
|
-
1. **CI fix loop:** After pushing, Claude watches CI via `gh pr checks`, reads
|
|
1047
|
-
2. **
|
|
1139
|
+
1. **CI fix loop:** After pushing, Claude watches CI via `gh pr checks`, reads logs on **pass and fail** (`gh run view <RUN_ID> --log`, not just `--log-failed`), diagnoses and fixes failures, pushes again (max 2 attempts)
|
|
1140
|
+
2. **Log review on pass:** Passing CI can still hide warnings, skipped steps, degraded scores, or silent test exclusions. A green checkmark is necessary but not sufficient — always read the logs
|
|
1141
|
+
3. **Review feedback loop:** After CI passes and logs look clean, Claude reads automated review comments, implements valid suggestions, pushes and re-reviews (max 3 iterations)
|
|
1142
|
+
4. **Pre-release CI audit:** Before cutting any release, review CI runs across ALL PRs merged since last release. Look for warnings in passing runs, degraded scores, skipped suites. Use `gh run list` + `gh run view <ID> --log`
|
|
1048
1143
|
|
|
1049
1144
|
**Recommendation:** Yes if you have CI configured. The shepherd closes the loop between "local tests pass" and "PR is actually ready to merge."
|
|
1050
1145
|
|
|
@@ -1111,7 +1206,7 @@ Claude scans for:
|
|
|
1111
1206
|
├── Test frameworks: detected from config files and test patterns
|
|
1112
1207
|
├── Lint/format tools: from config files
|
|
1113
1208
|
├── CI/CD: .github/workflows/, .gitlab-ci.yml, etc.
|
|
1114
|
-
├── Feature docs: *
|
|
1209
|
+
├── Feature docs: *_DOCS.md, docs/features/, docs/decisions/
|
|
1115
1210
|
├── README, CLAUDE.md, ARCHITECTURE.md
|
|
1116
1211
|
│
|
|
1117
1212
|
├── Deployment targets (for ARCHITECTURE.md environments):
|
|
@@ -1528,6 +1623,7 @@ Create `.claude/settings.json`:
|
|
|
1528
1623
|
"hooks": [
|
|
1529
1624
|
{
|
|
1530
1625
|
"type": "command",
|
|
1626
|
+
"if": "Write(src/**) Edit(src/**) MultiEdit(src/**)",
|
|
1531
1627
|
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/tdd-pretool-check.sh"
|
|
1532
1628
|
}
|
|
1533
1629
|
]
|
|
@@ -1579,7 +1675,7 @@ The `allowedTools` array is auto-generated based on your stack detected in Step
|
|
|
1579
1675
|
| Hook | When It Fires | Purpose |
|
|
1580
1676
|
|------|---------------|---------|
|
|
1581
1677
|
| `UserPromptSubmit` | Every message you send | Baseline SDLC reminder, skill auto-invoke |
|
|
1582
|
-
| `PreToolUse` | Before Claude edits files | TDD check: "Did you write the test first?" |
|
|
1678
|
+
| `PreToolUse` | Before Claude edits files | TDD check: "Did you write the test first?" Uses `if` field to only fire on source files |
|
|
1583
1679
|
|
|
1584
1680
|
### How Skill Auto-Invoke Works
|
|
1585
1681
|
|
|
@@ -1630,7 +1726,7 @@ Workflow phases:
|
|
|
1630
1726
|
3. Implementation (TDD after compact)
|
|
1631
1727
|
4. SELF-REVIEW (/code-review) → BEFORE presenting to user
|
|
1632
1728
|
|
|
1633
|
-
Quick refs: SDLC.md | TESTING.md | *
|
|
1729
|
+
Quick refs: SDLC.md | TESTING.md | *_DOCS.md for feature
|
|
1634
1730
|
EOF
|
|
1635
1731
|
```
|
|
1636
1732
|
|
|
@@ -1713,7 +1809,7 @@ TodoWrite([
|
|
|
1713
1809
|
{ content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
|
|
1714
1810
|
{ content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
|
|
1715
1811
|
// TRANSITION PHASE (After plan mode, before compact)
|
|
1716
|
-
{ content: "Doc sync: update feature docs
|
|
1812
|
+
{ content: "Doc sync: update or create feature docs — MUST be current before commit", status: "pending", activeForm: "Syncing feature docs" },
|
|
1717
1813
|
{ content: "Request /compact before TDD", status: "pending", activeForm: "Requesting compact" },
|
|
1718
1814
|
// IMPLEMENTATION PHASE (After compact)
|
|
1719
1815
|
{ content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
|
|
@@ -1770,7 +1866,7 @@ TodoWrite([
|
|
|
1770
1866
|
|
|
1771
1867
|
**Workflow:**
|
|
1772
1868
|
1. **Plan Mode** (editing blocked): Research → Write plan file → Present approach + confidence
|
|
1773
|
-
2. **Transition** (after approval): Doc sync (update feature docs
|
|
1869
|
+
2. **Transition** (after approval): Doc sync (update or create feature docs — MUST be current before commit) → Request /compact
|
|
1774
1870
|
3. **Implementation** (after compact): TDD RED → GREEN → PASS
|
|
1775
1871
|
|
|
1776
1872
|
**Before TDD, MUST ask:** "Docs updated. Run `/compact` before implementation?"
|
|
@@ -2270,10 +2366,10 @@ Create `CLAUDE.md` in your project root. This is your project-specific configura
|
|
|
2270
2366
|
- Follow conventional commits: `type(scope): description`
|
|
2271
2367
|
- NEVER commit with failing tests
|
|
2272
2368
|
|
|
2273
|
-
##
|
|
2369
|
+
## Feature Docs
|
|
2274
2370
|
|
|
2275
|
-
- Before coding a feature: READ its `*
|
|
2276
|
-
- After completing work: UPDATE the
|
|
2371
|
+
- Before coding a feature: READ its `*_DOCS.md` file
|
|
2372
|
+
- After completing work: UPDATE the feature doc (or create one if 3+ files touched)
|
|
2277
2373
|
|
|
2278
2374
|
## Testing Notes
|
|
2279
2375
|
|
|
@@ -2401,7 +2497,7 @@ If deployment fails or post-deploy verification catches issues:
|
|
|
2401
2497
|
|
|
2402
2498
|
**SDLC.md:**
|
|
2403
2499
|
```markdown
|
|
2404
|
-
<!-- SDLC Wizard Version: 1.
|
|
2500
|
+
<!-- SDLC Wizard Version: 1.24.0 -->
|
|
2405
2501
|
<!-- Setup Date: [DATE] -->
|
|
2406
2502
|
<!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
|
|
2407
2503
|
<!-- Git Workflow: [PRs or Solo] -->
|
|
@@ -2725,7 +2821,7 @@ Want me to file these? (yes/no/not now)
|
|
|
2725
2821
|
|
|
2726
2822
|
| Learning Type | Update Where |
|
|
2727
2823
|
|---------------|--------------|
|
|
2728
|
-
| Feature-specific gotchas, decisions | Feature docs (`*
|
|
2824
|
+
| Feature-specific gotchas, decisions | Feature docs (`*_DOCS.md`, e.g., `AUTH_DOCS.md`) |
|
|
2729
2825
|
| Testing patterns, gotchas | `TESTING.md` |
|
|
2730
2826
|
| Architecture decisions | `ARCHITECTURE.md` |
|
|
2731
2827
|
| Commands, general project context | `CLAUDE.md` (or `/revise-claude-md`) |
|
|
@@ -2744,14 +2840,16 @@ Want me to file these? (yes/no/not now)
|
|
|
2744
2840
|
|
|
2745
2841
|
### Feature Documentation
|
|
2746
2842
|
|
|
2747
|
-
|
|
2843
|
+
Feature docs are living documents — the single source of truth for each feature, kept current just like `TESTING.md` and `ARCHITECTURE.md`. Use `*_DOCS.md` as the standard pattern:
|
|
2748
2844
|
|
|
2749
2845
|
| Pattern | When to Use | Example |
|
|
2750
2846
|
|---------|-------------|---------|
|
|
2751
|
-
| `*
|
|
2847
|
+
| `*_DOCS.md` | Per-feature living docs (primary) | `AUTH_DOCS.md`, `PAYMENTS_DOCS.md`, `SEARCH_DOCS.md` |
|
|
2752
2848
|
| `docs/decisions/NNN-title.md` (ADR) | Architecture decisions that need rationale | `docs/decisions/001-use-postgres.md` |
|
|
2753
2849
|
| `docs/features/name.md` | Feature docs in a `docs/` directory | `docs/features/auth.md` |
|
|
2754
2850
|
|
|
2851
|
+
**When to create a feature doc:** If a feature touches 3+ files and no `*_DOCS.md` exists, create one. Keep it simple — what the feature does, key decisions, gotchas. The doc grows with the feature over time.
|
|
2852
|
+
|
|
2755
2853
|
**Feature doc template:**
|
|
2756
2854
|
|
|
2757
2855
|
```markdown
|
|
@@ -2790,13 +2888,16 @@ What are the trade-offs? What becomes easier/harder?
|
|
|
2790
2888
|
|
|
2791
2889
|
Store ADRs in `docs/decisions/`. Number sequentially. Claude reads these during planning to understand why things are built the way they are.
|
|
2792
2890
|
|
|
2793
|
-
**Keeping docs in sync with code:**
|
|
2891
|
+
**Keeping docs in sync with code (REQUIRED):**
|
|
2794
2892
|
|
|
2795
|
-
Docs
|
|
2893
|
+
Docs MUST be current before commit. Stale docs mislead future sessions, waste tokens, and cause wrong implementations. The SDLC skill enforces this:
|
|
2796
2894
|
|
|
2797
2895
|
- During planning, Claude reads feature docs for the area being changed
|
|
2798
|
-
- If the code change contradicts what the doc says
|
|
2896
|
+
- If the code change contradicts what the doc says → MUST update the doc
|
|
2897
|
+
- If the code change extends documented behavior → MUST add to the doc
|
|
2898
|
+
- If a `ROADMAP.md` exists → update it (mark items done, add new items). ROADMAP feeds CHANGELOG — keeping it current means releases write themselves
|
|
2799
2899
|
- The "After Session" step routes learnings to the right doc
|
|
2900
|
+
- Plan files get closed out — if the session's work came from a plan, it gets deleted or marked complete so future sessions aren't misled
|
|
2800
2901
|
- Stale docs cause low confidence — if Claude struggles, the doc may need updating
|
|
2801
2902
|
|
|
2802
2903
|
**CLAUDE.md health:** Run `/claude-md-improver` periodically (quarterly or after major changes). It audits CLAUDE.md specifically — structure, clarity, completeness (6 criteria, 100-point rubric). It does NOT cover feature docs, TESTING.md, or ADRs — the SDLC workflow handles those.
|
|
@@ -3014,7 +3115,7 @@ Use an independent AI model from a different company as a code reviewer. The aut
|
|
|
3014
3115
|
**The Protocol:**
|
|
3015
3116
|
|
|
3016
3117
|
1. Create a `.reviews/` directory in your project
|
|
3017
|
-
2. After Claude completes its SDLC loop (self-review passes), write a handoff file:
|
|
3118
|
+
2. After Claude completes its SDLC loop (self-review passes), write a preflight doc (what you already checked) then a mission-first handoff file:
|
|
3018
3119
|
|
|
3019
3120
|
```jsonc
|
|
3020
3121
|
// .reviews/handoff.json
|
|
@@ -3022,12 +3123,22 @@ Use an independent AI model from a different company as a code reviewer. The aut
|
|
|
3022
3123
|
"review_id": "feature-xyz-001",
|
|
3023
3124
|
"status": "PENDING_REVIEW",
|
|
3024
3125
|
"round": 1,
|
|
3126
|
+
"mission": "What changed and why — context for the reviewer",
|
|
3127
|
+
"success": "What 'correctly reviewed' looks like",
|
|
3128
|
+
"failure": "What gets missed if the reviewer is superficial",
|
|
3025
3129
|
"files_changed": ["src/auth.ts", "tests/auth.test.ts"],
|
|
3026
|
-
"
|
|
3130
|
+
"verification_checklist": [
|
|
3131
|
+
"(a) Verify input validation at auth.ts:45",
|
|
3132
|
+
"(b) Verify test covers null-token edge case"
|
|
3133
|
+
],
|
|
3134
|
+
"review_instructions": "Focus on security and edge cases. Assume bugs may be present until proven otherwise.",
|
|
3135
|
+
"preflight_path": ".reviews/preflight-feature-xyz-001.md",
|
|
3027
3136
|
"artifact_path": ".reviews/feature-xyz-001/"
|
|
3028
3137
|
}
|
|
3029
3138
|
```
|
|
3030
3139
|
|
|
3140
|
+
The `mission/success/failure` fields give the reviewer context. Without them, you get generic "looks good" feedback. With them, reviewers dig into source files and verify specific claims. The `verification_checklist` tells the reviewer exactly what to verify — not "review this" but specific items with file:line references.
|
|
3141
|
+
|
|
3031
3142
|
3. Run the independent reviewer (Round 1 — full review). These commands use your Codex default model — configure it to the latest, most capable model available:
|
|
3032
3143
|
|
|
3033
3144
|
```bash
|
|
@@ -3131,6 +3242,7 @@ Claude writes code → self-review passes → handoff.json (round 1)
|
|
|
3131
3242
|
- `-c 'model_reasoning_effort="xhigh"'` — Maximum reasoning depth. This is where you get the most value. Testing showed `xhigh` caught 3 findings that `high` missed on the same content.
|
|
3132
3243
|
- `-s danger-full-access` — Full filesystem read/write so the reviewer can read your actual code.
|
|
3133
3244
|
- `-o .reviews/latest-review.md` — Save the review output for Claude to read back.
|
|
3245
|
+
- **Claude Code sandbox bypass required:** Codex's Rust binary needs access to macOS system configuration APIs (`SCDynamicStore`) during initialization. Claude Code's sandbox blocks this, causing `codex exec` to crash with `panicked: Attempted to create a NULL object`. When running from within Claude Code, use `dangerouslyDisableSandbox: true` on the Bash tool call. This only bypasses CC's sandbox for the Codex process — Codex's own sandbox (`-s danger-full-access`) still applies. Known issue: [openai/codex#15640](https://github.com/openai/codex/issues/5914).
|
|
3134
3246
|
|
|
3135
3247
|
**Tool-agnostic principle:** The core idea is "use a different model as an independent reviewer." Codex CLI is the concrete example today, but any competing AI tool that can read files and produce structured feedback works. The value comes from the independence and different training, not the specific tool.
|
|
3136
3248
|
|
|
@@ -3302,7 +3414,7 @@ Walk through updates? (y/n)
|
|
|
3302
3414
|
Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
|
|
3303
3415
|
|
|
3304
3416
|
```markdown
|
|
3305
|
-
<!-- SDLC Wizard Version: 1.
|
|
3417
|
+
<!-- SDLC Wizard Version: 1.24.0 -->
|
|
3306
3418
|
<!-- Setup Date: 2026-01-24 -->
|
|
3307
3419
|
<!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
|
|
3308
3420
|
<!-- Git Workflow: PRs -->
|
package/README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
|
2
2
|
|
|
3
3
|
A **self-evolving Software Development Life Cycle (SDLC) enforcement system for AI coding agents**. Makes Claude plan before coding, test before shipping, and ask when uncertain. Measures itself getting better over time.
|
|
4
4
|
|
|
5
|
+
**Built on 15+ years of SDET and QA engineering experience** — battle-tested patterns from real production systems, baked into an AI agent that follows tried-and-true software quality practices so you don't have to enforce them manually.
|
|
6
|
+
|
|
5
7
|
## Install
|
|
6
8
|
|
|
7
9
|
**Requires [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview)** (Anthropic's CLI for Claude).
|
|
@@ -20,4 +20,47 @@ if [ -n "$MISSING" ]; then
|
|
|
20
20
|
echo "Invoke Skill tool, skill=\"setup-wizard\" to generate them."
|
|
21
21
|
fi
|
|
22
22
|
|
|
23
|
+
# Version update check (non-blocking, best-effort)
|
|
24
|
+
SDLC_MD="$PROJECT_DIR/SDLC.md"
|
|
25
|
+
if [ -f "$SDLC_MD" ]; then
|
|
26
|
+
INSTALLED_VERSION=$(grep -o 'SDLC Wizard Version: [0-9.]*' "$SDLC_MD" | head -1 | sed 's/SDLC Wizard Version: //')
|
|
27
|
+
if [ -n "$INSTALLED_VERSION" ] && command -v npm > /dev/null 2>&1; then
|
|
28
|
+
LATEST_VERSION=$(npm view agentic-sdlc-wizard version 2>/dev/null) || true
|
|
29
|
+
if [ -n "$LATEST_VERSION" ] && [ "$LATEST_VERSION" != "$INSTALLED_VERSION" ]; then
|
|
30
|
+
echo "SDLC Wizard update available: ${INSTALLED_VERSION} → ${LATEST_VERSION} (run /update-wizard)"
|
|
31
|
+
fi
|
|
32
|
+
fi
|
|
33
|
+
fi
|
|
34
|
+
|
|
35
|
+
# Cross-model review staleness check (non-blocking, best-effort)
|
|
36
|
+
if command -v codex > /dev/null 2>&1 && [ -d "$PROJECT_DIR/.reviews" ]; then
|
|
37
|
+
REVIEW_FILE="$PROJECT_DIR/.reviews/latest-review.md"
|
|
38
|
+
if [ -f "$REVIEW_FILE" ]; then
|
|
39
|
+
# Get file modification time (macOS stat -f %m, Linux stat -c %Y)
|
|
40
|
+
if stat -f %m "$REVIEW_FILE" > /dev/null 2>&1; then
|
|
41
|
+
REVIEW_MTIME=$(stat -f %m "$REVIEW_FILE")
|
|
42
|
+
else
|
|
43
|
+
REVIEW_MTIME=$(stat -c %Y "$REVIEW_FILE" 2>/dev/null || echo "0")
|
|
44
|
+
fi
|
|
45
|
+
NOW=$(date +%s)
|
|
46
|
+
REVIEW_AGE=$(( (NOW - REVIEW_MTIME) / 86400 ))
|
|
47
|
+
# Count commits since last review
|
|
48
|
+
COMMITS_SINCE=$(git -C "$PROJECT_DIR" log --oneline --after="@${REVIEW_MTIME}" 2>/dev/null | wc -l | tr -d ' ') || true
|
|
49
|
+
if [ "$REVIEW_AGE" -gt 3 ] && [ "${COMMITS_SINCE:-0}" -gt 5 ]; then
|
|
50
|
+
echo "WARNING: ${COMMITS_SINCE} commits over ${REVIEW_AGE}d since last cross-model review — reviews may not be running. Verify: codex exec \"echo test\""
|
|
51
|
+
fi
|
|
52
|
+
fi
|
|
53
|
+
fi
|
|
54
|
+
|
|
55
|
+
# Claude Code version check (non-blocking, best-effort)
|
|
56
|
+
if command -v claude > /dev/null 2>&1 && command -v npm > /dev/null 2>&1; then
|
|
57
|
+
CC_LOCAL=$(claude --version 2>/dev/null | grep -o '[0-9][0-9.]*' | head -1) || true
|
|
58
|
+
if [ -n "$CC_LOCAL" ]; then
|
|
59
|
+
CC_LATEST=$(npm view @anthropic-ai/claude-code version 2>/dev/null) || true
|
|
60
|
+
if [ -n "$CC_LATEST" ] && [ "$CC_LATEST" != "$CC_LOCAL" ]; then
|
|
61
|
+
echo "Claude Code update available: ${CC_LOCAL} → ${CC_LATEST} (run: npm install -g @anthropic-ai/claude-code)"
|
|
62
|
+
fi
|
|
63
|
+
fi
|
|
64
|
+
fi
|
|
65
|
+
|
|
23
66
|
exit 0
|
|
@@ -24,7 +24,7 @@ SDLC BASELINE:
|
|
|
24
24
|
5. ALL TESTS MUST PASS BEFORE COMMIT - NO EXCEPTIONS
|
|
25
25
|
|
|
26
26
|
AUTO-INVOKE SKILL (Claude MUST do this FIRST):
|
|
27
|
-
- implement/fix/refactor/feature/bug/build/test/TDD → Invoke: Skill tool, skill="sdlc"
|
|
27
|
+
- implement/fix/refactor/feature/bug/build/test/TDD/release/publish/deploy → Invoke: Skill tool, skill="sdlc"
|
|
28
28
|
- DON'T invoke for: questions, explanations, reading/exploring code, simple queries
|
|
29
29
|
- DON'T wait for user to type /sdlc - AUTO-INVOKE based on task type
|
|
30
30
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: sdlc
|
|
3
|
-
description: Full SDLC workflow for implementing features, fixing bugs, refactoring code, testing,
|
|
3
|
+
description: Full SDLC workflow for implementing features, fixing bugs, refactoring code, testing, releasing, publishing, and deploying. Use this skill when implementing, fixing, refactoring, testing, adding features, building new code, or releasing/publishing/deploying.
|
|
4
4
|
argument-hint: [task description]
|
|
5
5
|
effort: high
|
|
6
6
|
---
|
|
@@ -27,7 +27,7 @@ TodoWrite([
|
|
|
27
27
|
{ content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
|
|
28
28
|
{ content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
|
|
29
29
|
// TRANSITION PHASE (After plan mode)
|
|
30
|
-
{ content: "Doc sync: update feature docs
|
|
30
|
+
{ content: "Doc sync: update or create feature docs — MUST be current before commit", status: "pending", activeForm: "Syncing feature docs" },
|
|
31
31
|
// IMPLEMENTATION PHASE
|
|
32
32
|
{ content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
|
|
33
33
|
{ content: "TDD GREEN: Implement, verify test passes", status: "pending", activeForm: "Implementing feature" },
|
|
@@ -48,7 +48,8 @@ TodoWrite([
|
|
|
48
48
|
{ content: "Post-deploy verification (if deploy task — see Deployment Tasks)", status: "pending", activeForm: "Verifying deployment" },
|
|
49
49
|
// FINAL
|
|
50
50
|
{ content: "Present summary: changes, tests, CI status", status: "pending", activeForm: "Presenting final summary" },
|
|
51
|
-
{ content: "Capture learnings (if any — update TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" }
|
|
51
|
+
{ content: "Capture learnings (if any — update TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" },
|
|
52
|
+
{ content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending", activeForm: "Closing plan artifacts" }
|
|
52
53
|
])
|
|
53
54
|
```
|
|
54
55
|
|
|
@@ -72,6 +73,42 @@ Your work is scored on these criteria. **Critical** criteria are must-pass.
|
|
|
72
73
|
|
|
73
74
|
Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
|
|
74
75
|
|
|
76
|
+
## Test Failure Recovery (SDET Philosophy)
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
80
|
+
│ ALL TESTS MUST PASS. NO EXCEPTIONS. │
|
|
81
|
+
│ │
|
|
82
|
+
│ This is not negotiable. This is not flexible. This is absolute. │
|
|
83
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**Not acceptable:**
|
|
87
|
+
- "Those were already failing" → Fix them first
|
|
88
|
+
- "Not related to my changes" → Doesn't matter, fix it
|
|
89
|
+
- "It's flaky" → Flaky = bug, investigate
|
|
90
|
+
|
|
91
|
+
**Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
|
|
92
|
+
|
|
93
|
+
If tests fail:
|
|
94
|
+
1. Identify which test(s) failed
|
|
95
|
+
2. Diagnose WHY - this is the important part:
|
|
96
|
+
- Your code broke it? Fix your code (regression)
|
|
97
|
+
- Test is for deleted code? Delete the test
|
|
98
|
+
- Test has wrong assertions? Fix the test
|
|
99
|
+
- Test is "flaky"? Investigate - flakiness is just another word for bug
|
|
100
|
+
3. Fix appropriately (fix code, fix test, or delete dead test)
|
|
101
|
+
4. Run specific test individually first
|
|
102
|
+
5. Then run ALL tests
|
|
103
|
+
6. Still failing? ASK USER - don't spin your wheels
|
|
104
|
+
|
|
105
|
+
**Flaky tests are bugs, not mysteries:**
|
|
106
|
+
- Sometimes the bug is in app code (race condition, timing issue)
|
|
107
|
+
- Sometimes the bug is in test code (shared state, not parallel-safe)
|
|
108
|
+
- Sometimes the bug is in test environment (cleanup not proper)
|
|
109
|
+
|
|
110
|
+
Debug it. Find root cause. Fix it properly. Tests ARE code.
|
|
111
|
+
|
|
75
112
|
## New Pattern & Test Design Scrutiny (PLANNING)
|
|
76
113
|
|
|
77
114
|
**New design patterns require human approval:**
|
|
@@ -89,11 +126,12 @@ Critical miss on `tdd_red` or `self_review` = process failure regardless of tota
|
|
|
89
126
|
|
|
90
127
|
**Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
|
|
91
128
|
|
|
92
|
-
1. **
|
|
93
|
-
2. **
|
|
94
|
-
3. **If
|
|
95
|
-
4. **
|
|
96
|
-
5. **
|
|
129
|
+
1. **Absorption check:** Can this be added as a section in an existing skill instead of a new component? Default is YES — new skills/hooks need strong justification. Releasing is SDLC, not a separate skill. Debugging is SDLC, not a separate skill. Keep it lean
|
|
130
|
+
2. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
|
|
131
|
+
3. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
|
|
132
|
+
4. **If NO:** What gap does this fill? Is the gap real or theoretical?
|
|
133
|
+
5. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
|
|
134
|
+
6. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
|
|
97
135
|
|
|
98
136
|
**Existence tests are NOT quality tests:**
|
|
99
137
|
- BAD: "ci-analyzer skill file exists" — proves nothing about quality
|
|
@@ -131,9 +169,9 @@ Before presenting approach, STATE your confidence:
|
|
|
131
169
|
|-------|---------|--------|--------|
|
|
132
170
|
| HIGH (90%+) | Know exactly what to do | Present approach, proceed after approval | `high` (default) |
|
|
133
171
|
| MEDIUM (60-89%) | Solid approach, some uncertainty | Present approach, highlight uncertainties | `high` (default) |
|
|
134
|
-
| LOW (<60%) | Not sure | ASK USER
|
|
135
|
-
| FAILED 2x | Something's wrong |
|
|
136
|
-
| CONFUSED | Can't diagnose why something is failing | STOP. Describe what you tried, ask for help | Try `/effort max` |
|
|
172
|
+
| LOW (<60%) | Not sure | Do more research or try cross-model research (Codex) to get to 95%. If still LOW after research, ASK USER | Consider `/effort max` |
|
|
173
|
+
| FAILED 2x | Something's wrong | Try cross-model research (Codex) for a fresh perspective. If still stuck, STOP and ASK USER | Try `/effort max` |
|
|
174
|
+
| CONFUSED | Can't diagnose why something is failing | Try cross-model research (Codex). If still confused, STOP. Describe what you tried, ask for help | Try `/effort max` |
|
|
137
175
|
|
|
138
176
|
## Self-Review Loop (CRITICAL)
|
|
139
177
|
|
|
@@ -166,36 +204,78 @@ PLANNING -> DOCS -> TDD RED -> TDD GREEN -> Tests Pass -> Self-Review
|
|
|
166
204
|
|
|
167
205
|
**Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
|
|
168
206
|
|
|
169
|
-
|
|
207
|
+
**The core insight:** The review PROTOCOL is universal across domains. Only the review INSTRUCTIONS change. Code review is the default template below. For non-code domains (research, persuasion, medical content), adapt the `review_instructions` and `verification_checklist` fields while keeping the same handoff/dialogue/convergence loop.
|
|
170
208
|
|
|
171
|
-
|
|
172
|
-
```jsonc
|
|
173
|
-
{
|
|
174
|
-
"review_id": "feature-xyz-001",
|
|
175
|
-
"status": "PENDING_REVIEW",
|
|
176
|
-
"round": 1,
|
|
177
|
-
"files_changed": ["src/auth.ts", "tests/auth.test.ts"],
|
|
178
|
-
"review_instructions": "Review for security, edge cases, and correctness",
|
|
179
|
-
"artifact_path": ".reviews/feature-xyz-001/"
|
|
180
|
-
}
|
|
181
|
-
```
|
|
182
|
-
2. Run the independent reviewer:
|
|
183
|
-
```bash
|
|
184
|
-
codex exec \
|
|
185
|
-
-c 'model_reasoning_effort="xhigh"' \
|
|
186
|
-
-s danger-full-access \
|
|
187
|
-
-o .reviews/latest-review.md \
|
|
188
|
-
"You are an independent code reviewer. Read .reviews/handoff.json, \
|
|
189
|
-
review the listed files. Output each finding with: an ID (1, 2, ...), \
|
|
190
|
-
severity (P0/P1/P2), description, and a 'certify condition' stating \
|
|
191
|
-
what specific change would resolve it. \
|
|
192
|
-
End with CERTIFIED or NOT CERTIFIED."
|
|
193
|
-
```
|
|
194
|
-
3. If CERTIFIED → proceed to CI. If NOT CERTIFIED → go to Round 2.
|
|
209
|
+
### Step 0: Write Preflight Self-Review Doc
|
|
195
210
|
|
|
196
|
-
|
|
211
|
+
Before submitting to an external reviewer, document what YOU already checked. This is proven to reduce reviewer findings to 0-1 per round (evidence: anticheat repo preflight discipline).
|
|
197
212
|
|
|
198
|
-
|
|
213
|
+
Write `.reviews/preflight-{review_id}.md`:
|
|
214
|
+
```markdown
|
|
215
|
+
## Preflight Self-Review: {feature}
|
|
216
|
+
- [ ] Self-review via /code-review passed
|
|
217
|
+
- [ ] All tests passing
|
|
218
|
+
- [ ] Checked for: [specific concerns for this change]
|
|
219
|
+
- [ ] Verified: [what you manually confirmed]
|
|
220
|
+
- [ ] Known limitations: [what you couldn't verify]
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### Step 1: Write Mission-First Handoff
|
|
224
|
+
|
|
225
|
+
After self-review and preflight pass, write `.reviews/handoff.json`:
|
|
226
|
+
```jsonc
|
|
227
|
+
{
|
|
228
|
+
"review_id": "feature-xyz-001",
|
|
229
|
+
"status": "PENDING_REVIEW",
|
|
230
|
+
"round": 1,
|
|
231
|
+
"mission": "What changed and why — 2-3 sentences of context",
|
|
232
|
+
"success": "What 'correctly reviewed' looks like — the reviewer's goal",
|
|
233
|
+
"failure": "What gets missed if the reviewer is superficial",
|
|
234
|
+
"files_changed": ["src/auth.ts", "tests/auth.test.ts"],
|
|
235
|
+
"fixes_applied": [],
|
|
236
|
+
"previous_score": null,
|
|
237
|
+
"verification_checklist": [
|
|
238
|
+
"(a) Verify input validation at auth.ts:45 handles empty strings",
|
|
239
|
+
"(b) Verify test covers the null-token edge case",
|
|
240
|
+
"(c) Check no hardcoded secrets in diff"
|
|
241
|
+
],
|
|
242
|
+
"review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present until proven otherwise.",
|
|
243
|
+
"preflight_path": ".reviews/preflight-feature-xyz-001.md",
|
|
244
|
+
"artifact_path": ".reviews/feature-xyz-001/"
|
|
245
|
+
}
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
**Key fields explained:**
|
|
249
|
+
- `mission/success/failure` — Gives the reviewer context. Without this, you get generic "looks good" feedback. With it, reviewers read raw source files and verify specific claims (proven across 4 repos)
|
|
250
|
+
- `verification_checklist` — Specific things to verify with file:line references. NOT "review for correctness" — that's too vague. Each item is independently verifiable
|
|
251
|
+
- `preflight_path` — Shows the reviewer what you already checked, so they focus on what you might have missed
|
|
252
|
+
|
|
253
|
+
### Step 2: Run the Independent Reviewer
|
|
254
|
+
|
|
255
|
+
```bash
|
|
256
|
+
codex exec \
|
|
257
|
+
-c 'model_reasoning_effort="xhigh"' \
|
|
258
|
+
-s danger-full-access \
|
|
259
|
+
-o .reviews/latest-review.md \
|
|
260
|
+
"You are an independent code reviewer performing a certification audit. \
|
|
261
|
+
Read .reviews/handoff.json for full context — mission, success/failure \
|
|
262
|
+
conditions, and verification checklist. \
|
|
263
|
+
Verify each checklist item with evidence (file:line, grep results, test output). \
|
|
264
|
+
Output each finding with: ID (1, 2, ...), severity (P0/P1/P2), evidence, \
|
|
265
|
+
and a 'certify condition' (what specific change resolves it). \
|
|
266
|
+
Re-verify any prior-round passes still hold. \
|
|
267
|
+
End with: score (1-10), CERTIFIED or NOT CERTIFIED."
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
**Always use `xhigh` reasoning effort.** Lower settings miss subtle errors (wrong-generation references, stale pricing, cross-file inconsistencies).
|
|
271
|
+
|
|
272
|
+
**Sandbox note:** Codex's Rust binary requires access to macOS system configuration APIs (`SCDynamicStore`) during sandbox initialization. Claude Code's sandbox blocks this access, causing `codex exec` to crash with `system-configuration panicked: Attempted to create a NULL object`. When running `codex exec` from within Claude Code, you MUST use `dangerouslyDisableSandbox: true` on the Bash tool call. This is safe — Codex has its own sandbox (`-s danger-full-access` is already specified), and the CC sandbox bypass only affects the Codex process. This is a known Codex issue ([#15640](https://github.com/openai/codex/issues/15640)).
|
|
273
|
+
|
|
274
|
+
If CERTIFIED → proceed to CI. If NOT CERTIFIED → go to dialogue loop.
|
|
275
|
+
|
|
276
|
+
### Step 3: Dialogue Loop
|
|
277
|
+
|
|
278
|
+
Respond per-finding — don't silently fix everything:
|
|
199
279
|
|
|
200
280
|
1. Write `.reviews/response.json`:
|
|
201
281
|
```jsonc
|
|
@@ -205,16 +285,16 @@ When the reviewer finds issues, respond per-finding instead of silently fixing e
|
|
|
205
285
|
"responding_to": ".reviews/latest-review.md",
|
|
206
286
|
"responses": [
|
|
207
287
|
{ "finding": "1", "action": "FIXED", "summary": "Added missing validation" },
|
|
208
|
-
{ "finding": "2", "action": "DISPUTED", "justification": "
|
|
288
|
+
{ "finding": "2", "action": "DISPUTED", "justification": "Intentional — see CODE_REVIEW_EXCEPTIONS.md" },
|
|
209
289
|
{ "finding": "3", "action": "ACCEPTED", "summary": "Will add test coverage" }
|
|
210
290
|
]
|
|
211
291
|
}
|
|
212
292
|
```
|
|
213
|
-
- **FIXED**: "I fixed this. Here
|
|
214
|
-
- **DISPUTED**: "This is intentional/incorrect. Here
|
|
215
|
-
- **ACCEPTED**: "You
|
|
293
|
+
- **FIXED**: "I fixed this. Here's what changed." Reviewer verifies against certify condition.
|
|
294
|
+
- **DISPUTED**: "This is intentional/incorrect. Here's why." Reviewer accepts or rejects with reasoning.
|
|
295
|
+
- **ACCEPTED**: "You're right. Fixing now." (Same as FIXED, batched.)
|
|
216
296
|
|
|
217
|
-
2. Update `handoff.json`
|
|
297
|
+
2. Update `handoff.json`: increment `round`, set `"status": "PENDING_RECHECK"`, add `fixes_applied` list with numbered items and file:line references, update `previous_score`.
|
|
218
298
|
|
|
219
299
|
3. Run targeted recheck (NOT a full re-review):
|
|
220
300
|
```bash
|
|
@@ -222,86 +302,88 @@ When the reviewer finds issues, respond per-finding instead of silently fixing e
|
|
|
222
302
|
-c 'model_reasoning_effort="xhigh"' \
|
|
223
303
|
-s danger-full-access \
|
|
224
304
|
-o .reviews/latest-review.md \
|
|
225
|
-
"
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
FIXED → verify the fix against the original certify condition. \
|
|
230
|
-
DISPUTED → evaluate the justification (ACCEPT if sound, REJECT if not). \
|
|
305
|
+
"TARGETED RECHECK — not a full re-review. Read .reviews/handoff.json \
|
|
306
|
+
for previous_review path and response.json for the author's responses. \
|
|
307
|
+
For each finding: FIXED → verify against original certify condition. \
|
|
308
|
+
DISPUTED → evaluate justification (ACCEPT if sound, REJECT with reasoning). \
|
|
231
309
|
ACCEPTED → verify it was applied. \
|
|
232
310
|
Do NOT raise new findings unless P0 (critical/security). \
|
|
233
311
|
New observations go in 'Notes for next review' (non-blocking). \
|
|
234
|
-
|
|
312
|
+
Re-verify all prior passes still hold. \
|
|
313
|
+
End with: score (1-10), CERTIFIED or NOT CERTIFIED."
|
|
235
314
|
```
|
|
236
315
|
|
|
237
|
-
4. If CERTIFIED → done. If NOT CERTIFIED (rejected disputes or failed fixes) → fix rejected items and repeat.
|
|
238
|
-
|
|
239
316
|
### Convergence
|
|
240
317
|
|
|
241
|
-
|
|
318
|
+
**2 rounds is the sweet spot. 3 max.** Research across 14 repos and 7 papers confirms additional rounds beyond 3 produce <5% position shift.
|
|
319
|
+
|
|
320
|
+
Max 2 recheck rounds (3 total including initial review). If still NOT CERTIFIED after round 3, escalate to the user with a summary of open findings.
|
|
242
321
|
|
|
243
322
|
```
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
Reviewer: TARGETED RECHECK (previous findings only)
|
|
258
|
-
|
|
|
259
|
-
All resolved? → YES → CERTIFIED
|
|
260
|
-
|
|
|
261
|
-
NO → fix rejected items, repeat
|
|
262
|
-
(max 3 rechecks, then escalate to user)
|
|
323
|
+
Preflight → handoff.json (round 1) → FULL REVIEW
|
|
324
|
+
|
|
|
325
|
+
CERTIFIED? → YES → CI
|
|
326
|
+
|
|
|
327
|
+
NO (scored findings)
|
|
328
|
+
|
|
|
329
|
+
response.json (FIXED/DISPUTED/ACCEPTED)
|
|
330
|
+
|
|
|
331
|
+
handoff.json (round 2+) → TARGETED RECHECK
|
|
332
|
+
|
|
|
333
|
+
CERTIFIED? → YES → CI
|
|
334
|
+
|
|
|
335
|
+
NO → one more round, then escalate
|
|
263
336
|
```
|
|
264
337
|
|
|
265
338
|
**Tool-agnostic:** The value is adversarial diversity (different model, different blind spots), not the specific tool. Any competing AI reviewer works.
|
|
266
339
|
|
|
267
|
-
|
|
340
|
+
### Anti-Patterns to Avoid
|
|
341
|
+
|
|
342
|
+
- **"Find at least N problems"** — Incentivizes false positives. Use adversarial framing ("assume bugs may be present") instead
|
|
343
|
+
- **"Review this"** — Too vague, gets generic feedback. Use mission + verification checklist
|
|
344
|
+
- **Numeric 1-10 scales without criteria** — Unreliable. Decompose into specific checklist items
|
|
345
|
+
- **Letting reviewer see author's reasoning** — Causes anchoring bias. Let them form independent opinion from code
|
|
268
346
|
|
|
269
347
|
### Release Review Focus
|
|
270
348
|
|
|
271
|
-
Before any release/publish, add these to `
|
|
349
|
+
Before any release/publish, add these to `verification_checklist`:
|
|
272
350
|
- **CHANGELOG consistency** — all sections present, no lost entries during consolidation
|
|
273
351
|
- **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
|
|
274
352
|
- **Stale examples** — hardcoded version strings in docs match current release
|
|
275
353
|
- **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
|
|
276
354
|
- **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
|
|
277
355
|
|
|
278
|
-
Evidence: v1.20.0 cross-model review caught CHANGELOG section loss and stale wizard version examples that passed all tests and self-review. Tests catch version mismatches; cross-model review catches semantic issues tests cannot.
|
|
279
|
-
|
|
280
356
|
### Multiple Reviewers (N-Reviewer Pipeline)
|
|
281
357
|
|
|
282
358
|
When multiple reviewers comment on a PR (Claude PR review, Codex, human reviewers), address each reviewer independently:
|
|
283
359
|
|
|
284
360
|
1. **Read all reviews** — `gh api repos/OWNER/REPO/pulls/PR/comments` to get every reviewer's feedback
|
|
285
361
|
2. **Respond per-reviewer** — Each reviewer has different blind spots and priorities. Address each one's findings separately
|
|
286
|
-
3. **Resolve conflicts** — If reviewers disagree,
|
|
287
|
-
4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied
|
|
288
|
-
5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things
|
|
362
|
+
3. **Resolve conflicts** — If reviewers disagree, pick the stronger argument, note why
|
|
363
|
+
4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied
|
|
364
|
+
5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things, escalate to the user
|
|
289
365
|
|
|
290
|
-
|
|
366
|
+
### Adapting for Non-Code Domains
|
|
291
367
|
|
|
292
|
-
|
|
368
|
+
The handoff format and dialogue loop work for ANY domain. Only `review_instructions` and `verification_checklist` change:
|
|
369
|
+
|
|
370
|
+
| Domain | Instructions Focus | Checklist Example |
|
|
371
|
+
|--------|-------------------|-------------------|
|
|
372
|
+
| **Code (default)** | Security, logic bugs, test coverage | "Verify input validation at file:line" |
|
|
373
|
+
| **Research/Docs** | Factual accuracy, source verification, overclaims | "Verify $736-$804 appears in both docs, no stale $695-$723 remains" |
|
|
374
|
+
| **Persuasion** | Audience psychology, tone, trust | "If you were [audience], what's the moment you'd stop reading?" |
|
|
293
375
|
|
|
294
|
-
|
|
376
|
+
For non-code: add `"audience"` and `"stakes"` fields to handoff.json. For code, these are implied (audience = other developers, stakes = production impact).
|
|
295
377
|
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
378
|
+
### Custom Subagents (`.claude/agents/`)
|
|
379
|
+
|
|
380
|
+
Claude Code supports custom subagents in `.claude/agents/`:
|
|
299
381
|
|
|
300
|
-
|
|
301
|
-
-
|
|
302
|
-
-
|
|
382
|
+
- **`sdlc-reviewer`** — SDLC compliance review (planning, TDD, self-review checks)
|
|
383
|
+
- **`ci-debug`** — CI failure diagnosis (reads logs, identifies root cause, suggests fix)
|
|
384
|
+
- **`test-writer`** — Quality tests following TESTING.md philosophies
|
|
303
385
|
|
|
304
|
-
|
|
386
|
+
**Skills** guide Claude's behavior. **Agents** run autonomously and return results. Use agents for parallel work or fresh context windows.
|
|
305
387
|
|
|
306
388
|
## Test Review (Harder Than Implementation)
|
|
307
389
|
|
|
@@ -315,6 +397,16 @@ During self-review, critique tests HARDER than app code:
|
|
|
315
397
|
|
|
316
398
|
**Tests are the foundation.** Bad tests = false confidence = production bugs.
|
|
317
399
|
|
|
400
|
+
### Testing Diamond — Know Your Layers
|
|
401
|
+
|
|
402
|
+
| Layer | What It Tests | % of Suite | Key Trait |
|
|
403
|
+
|-------|--------------|------------|-----------|
|
|
404
|
+
| **E2E** | Full user flow through UI/browser (Playwright, Cypress) | ~5% | Slow, brittle, but proves the real thing works |
|
|
405
|
+
| **Integration** | Real systems via API without UI — real DB, real cache, real services | ~90% | **Best bang for buck.** Fast, stable, high confidence |
|
|
406
|
+
| **Unit** | Pure logic only — no DB, no API, no filesystem | ~5% | Fast but limited scope |
|
|
407
|
+
|
|
408
|
+
**The critical boundary:** E2E tests go through the user's actual UI/browser. Integration tests hit real systems via API but without UI. If your test doesn't open a browser or render a UI, it's not E2E — it's integration. This distinction matters because mislabeling integration tests as E2E leads to overinvestment in slow browser tests when fast API-level tests would suffice.
|
|
409
|
+
|
|
318
410
|
### Minimal Mocking Philosophy
|
|
319
411
|
|
|
320
412
|
| What | Mock? | Why |
|
|
@@ -366,42 +458,6 @@ If you notice something else that should be fixed:
|
|
|
366
458
|
|
|
367
459
|
**Why this matters:** AI agents can drift into "helpful" changes that weren't requested. This creates unexpected diffs, breaks unrelated things, and makes code review harder.
|
|
368
460
|
|
|
369
|
-
## Test Failure Recovery (SDET Philosophy)
|
|
370
|
-
|
|
371
|
-
```
|
|
372
|
-
┌─────────────────────────────────────────────────────────────────────┐
|
|
373
|
-
│ ALL TESTS MUST PASS. NO EXCEPTIONS. │
|
|
374
|
-
│ │
|
|
375
|
-
│ This is not negotiable. This is not flexible. This is absolute. │
|
|
376
|
-
└─────────────────────────────────────────────────────────────────────┘
|
|
377
|
-
```
|
|
378
|
-
|
|
379
|
-
**Not acceptable:**
|
|
380
|
-
- "Those were already failing" → Fix them first
|
|
381
|
-
- "Not related to my changes" → Doesn't matter, fix it
|
|
382
|
-
- "It's flaky" → Flaky = bug, investigate
|
|
383
|
-
|
|
384
|
-
**Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
|
|
385
|
-
|
|
386
|
-
If tests fail:
|
|
387
|
-
1. Identify which test(s) failed
|
|
388
|
-
2. Diagnose WHY - this is the important part:
|
|
389
|
-
- Your code broke it? Fix your code (regression)
|
|
390
|
-
- Test is for deleted code? Delete the test
|
|
391
|
-
- Test has wrong assertions? Fix the test
|
|
392
|
-
- Test is "flaky"? Investigate - flakiness is just another word for bug
|
|
393
|
-
3. Fix appropriately (fix code, fix test, or delete dead test)
|
|
394
|
-
4. Run specific test individually first
|
|
395
|
-
5. Then run ALL tests
|
|
396
|
-
6. Still failing? ASK USER - don't spin your wheels
|
|
397
|
-
|
|
398
|
-
**Flaky tests are bugs, not mysteries:**
|
|
399
|
-
- Sometimes the bug is in app code (race condition, timing issue)
|
|
400
|
-
- Sometimes the bug is in test code (shared state, not parallel-safe)
|
|
401
|
-
- Sometimes the bug is in test environment (cleanup not proper)
|
|
402
|
-
|
|
403
|
-
Debug it. Find root cause. Fix it properly. Tests ARE code.
|
|
404
|
-
|
|
405
461
|
## Debugging Workflow (Systematic Investigation)
|
|
406
462
|
|
|
407
463
|
When something breaks and the cause isn't obvious, follow this systematic debugging workflow:
|
|
@@ -451,25 +507,28 @@ Local tests pass -> Commit -> Push -> Watch CI
|
|
|
451
507
|
STOP and ASK USER
|
|
452
508
|
```
|
|
453
509
|
|
|
454
|
-
|
|
510
|
+
```
|
|
511
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
512
|
+
│ NEVER AUTO-MERGE. NO EXCEPTIONS. │
|
|
513
|
+
│ │
|
|
514
|
+
│ Do NOT run `gh pr merge --auto`. Ever. │
|
|
515
|
+
│ Auto-merge fires before you can read review feedback. │
|
|
516
|
+
│ The shepherd loop IS the process. Skipping it = shipping bugs. │
|
|
517
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
518
|
+
```
|
|
519
|
+
|
|
520
|
+
**The full shepherd sequence — every step is mandatory:**
|
|
455
521
|
1. Push changes to remote
|
|
456
|
-
2.
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
|
|
522
|
+
2. Watch CI: `gh pr checks --watch`
|
|
523
|
+
3. Read CI logs — **pass or fail**: `gh run view <RUN_ID> --log` (not just `--log-failed`). Passing CI can still hide warnings, skipped steps, or degraded scores. Don't just check the green checkmark
|
|
524
|
+
4. If CI fails → diagnose from logs, fix, push again (max 2 attempts)
|
|
525
|
+
5. If CI passes → read ALL review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
|
|
526
|
+
6. Fix valid suggestions, push, iterate until clean
|
|
527
|
+
7. Only then: explicit merge with `gh pr merge --squash`
|
|
460
528
|
|
|
461
|
-
|
|
462
|
-
gh pr checks
|
|
529
|
+
**Why this is non-negotiable:** PR #145 auto-merged a release before review feedback was read. CI reviewer found a P1 dead-code bug that shipped to main. The fix required a follow-up commit. Auto-merge cost more time than the shepherd loop would have taken.
|
|
463
530
|
|
|
464
|
-
|
|
465
|
-
gh run view <RUN_ID> --log-failed
|
|
466
|
-
```
|
|
467
|
-
3. If CI fails:
|
|
468
|
-
- Read failure logs: `gh run view <RUN_ID> --log-failed`
|
|
469
|
-
- Diagnose root cause (same philosophy as local test failures)
|
|
470
|
-
- Fix and push again
|
|
471
|
-
4. Max 2 fix attempts - if still failing, ASK USER
|
|
472
|
-
5. If CI passes - proceed to present final summary
|
|
531
|
+
**Why read passing logs:** v1.24.0 release only read logs on failure (round 1), then just checked the green checkmark on round 2. Passing CI can hide warnings, skipped steps, degraded E2E scores, or silent test exclusions. A green checkmark is necessary but not sufficient.
|
|
473
532
|
|
|
474
533
|
**Context GC (compact during idle):** While waiting for CI (typically 3-5 min), suggest `/compact` if the conversation is long. Think of it like a time-based garbage collector — idle time + high memory pressure = good time to collect. Don't suggest on short conversations.
|
|
475
534
|
|
|
@@ -518,6 +577,9 @@ CI passes -> Read review suggestions
|
|
|
518
577
|
- `/clear` after 2+ failed corrections (context polluted — start fresh with better prompt)
|
|
519
578
|
- Auto-compact fires at ~95% capacity — no manual management needed
|
|
520
579
|
- After committing a PR, `/clear` before starting the next feature
|
|
580
|
+
- **Autocompact tuning:** Set `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` to trigger compaction earlier (75% for 200K, 30% for 1M). On 1M models, the default fires at ~76K — set 30% or `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` to use the full context window. See wizard doc "Autocompact Tuning" for full details
|
|
581
|
+
|
|
582
|
+
**`--bare` mode (v2.1.81+):** `claude -p "prompt" --bare` skips ALL hooks, skills, LSP, and plugins. This is a complete wizard bypass — no SDLC enforcement, no TDD checks, no planning hooks. Use only for scripted headless calls (CI pipelines, automation) where you explicitly don't want wizard enforcement. Never use `--bare` for normal development work.
|
|
521
583
|
|
|
522
584
|
## DRY Principle
|
|
523
585
|
|
|
@@ -543,6 +605,24 @@ CI passes -> Read review suggestions
|
|
|
543
605
|
|
|
544
606
|
**If no DESIGN_SYSTEM.md exists:** Skip these checks (project has no documented design system).
|
|
545
607
|
|
|
608
|
+
## Release Planning (If Task Involves a Release)
|
|
609
|
+
|
|
610
|
+
**When to check:** Task mentions "release", "publish", "version bump", "npm publish", or multiple items being shipped together.
|
|
611
|
+
**When to skip:** Single feature implementation, bug fix, or anything that isn't a release.
|
|
612
|
+
|
|
613
|
+
Before implementing any release items:
|
|
614
|
+
|
|
615
|
+
1. **List all items** — Read ROADMAP.md (or equivalent), identify every item planned for this release
|
|
616
|
+
2. **Plan each at 95% confidence** — For each item: what files change, what tests prove it works, what's the blast radius. If confidence < 95% on any item, flag it
|
|
617
|
+
3. **Identify blocks** — Which items depend on others? What must go first?
|
|
618
|
+
4. **Present all plans together** — User reviews the complete batch, not one at a time. This catches conflicts, sequencing issues, and scope creep before any code is written
|
|
619
|
+
5. **Pre-release CI audit** — Before cutting the release, review CI runs across ALL PRs merged since last release. Look for: warnings in passing runs, degraded E2E scores, skipped test suites, silent failures masked by `continue-on-error`. Use `gh run list` + `gh run view <ID> --log` to audit. A green checkmark is necessary but not sufficient
|
|
620
|
+
6. **User approves, then implement** — Full SDLC per item (TDD RED → GREEN → self-review), in the prioritized order
|
|
621
|
+
|
|
622
|
+
**Why batch planning works:** Ad-hoc one-at-a-time implementation leads to unvalidated additions and scope creep. Batch planning catches problems early — if you can't plan it at 95%, you're not ready to ship it.
|
|
623
|
+
|
|
624
|
+
**Why pre-release CI audit:** v1.24.0 shipped without auditing CI logs across merged PRs #150-#152. Passing CI doesn't mean nothing fishy got through — warnings, degraded scores, and skipped steps can hide in green runs.
|
|
625
|
+
|
|
546
626
|
## Deployment Tasks (If Task Involves Deploy)
|
|
547
627
|
|
|
548
628
|
**When to check:** Task mentions "deploy", "release", "push to prod", "staging", etc.
|
|
@@ -588,14 +668,26 @@ CI passes -> Read review suggestions
|
|
|
588
668
|
|
|
589
669
|
**THE RULE:** Delete old code first. If it breaks, fix it properly.
|
|
590
670
|
|
|
591
|
-
## Documentation Sync (During Planning)
|
|
671
|
+
## Documentation Sync (REQUIRED — During Planning)
|
|
672
|
+
|
|
673
|
+
Feature docs MUST be current before commit. Docs are code — stale docs mislead future sessions, waste tokens, and cause wrong implementations.
|
|
592
674
|
|
|
593
|
-
|
|
675
|
+
**Standard pattern:** `*_DOCS.md` — living documents that grow with the feature (e.g., `AUTH_DOCS.md`, `PAYMENTS_DOCS.md`, `SEARCH_DOCS.md`). Same philosophy as `TESTING.md` and `ARCHITECTURE.md` — one source of truth per topic, kept current.
|
|
676
|
+
|
|
677
|
+
```
|
|
678
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
679
|
+
│ DOCS MUST BE CURRENT BEFORE COMMIT. │
|
|
680
|
+
│ │
|
|
681
|
+
│ Stale docs = wrong implementations = wasted sessions. │
|
|
682
|
+
│ If you changed the feature, update its doc. No exceptions. │
|
|
683
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
684
|
+
```
|
|
594
685
|
|
|
595
|
-
1. **During planning**, read feature docs for the area being changed (`*
|
|
596
|
-
2. If your code change contradicts what the doc says → update the doc
|
|
597
|
-
3. If your code change extends behavior the doc describes → add to the doc
|
|
598
|
-
4. If no
|
|
686
|
+
1. **During planning**, read feature docs for the area being changed (`*_DOCS.md`, `docs/features/`, `docs/decisions/`)
|
|
687
|
+
2. If your code change contradicts what the doc says → MUST update the doc
|
|
688
|
+
3. If your code change extends behavior the doc describes → MUST add to the doc
|
|
689
|
+
4. If no `*_DOCS.md` exists and the feature touches 3+ files → create one. Keep it simple: what the feature does, key decisions, gotchas. Same structure as TESTING.md (topic-focused, not exhaustive)
|
|
690
|
+
5. If the project has a `ROADMAP.md` → update it (mark items done, add new items). ROADMAP feeds CHANGELOG — keeping it current means releases write themselves
|
|
599
691
|
|
|
600
692
|
**Doc staleness signals:** Low confidence in an area often means the docs are stale, missing, or misleading. If you struggle during planning, check whether the docs match the actual code.
|
|
601
693
|
|
|
@@ -605,9 +697,29 @@ When a code change affects a documented feature, update the doc in the same PR:
|
|
|
605
697
|
|
|
606
698
|
If this session revealed insights, update the right place:
|
|
607
699
|
- **Testing patterns, gotchas** → `TESTING.md`
|
|
608
|
-
- **Feature-specific quirks** → Feature docs (`*
|
|
700
|
+
- **Feature-specific quirks** → Feature docs (`*_DOCS.md`, e.g., `AUTH_DOCS.md`)
|
|
609
701
|
- **Architecture decisions** → `docs/decisions/` (ADR format) or `ARCHITECTURE.md`
|
|
610
702
|
- **General project context** → `CLAUDE.md` (or `/revise-claude-md`)
|
|
703
|
+
- **Plan files** → If this session's work came from a plan file, delete it or mark it complete. Stale plans mislead future sessions into thinking work is still pending
|
|
704
|
+
|
|
705
|
+
## Post-Mortem: When Process Fails, Feed It Back
|
|
706
|
+
|
|
707
|
+
**Every process failure becomes an enforcement rule.** When you skip a step and it causes a problem, don't just fix the symptom — add a gate so it can't happen again.
|
|
708
|
+
|
|
709
|
+
```
|
|
710
|
+
Incident → Root Cause → New Rule → Test That Proves the Rule → Ship
|
|
711
|
+
```
|
|
712
|
+
|
|
713
|
+
**How to post-mortem a process failure:**
|
|
714
|
+
1. **What happened?** — Describe the incident (what went wrong, what was the impact)
|
|
715
|
+
2. **Root cause** — Not "I forgot" — what structurally allowed the skip? Was it guidance (easy to ignore) instead of a gate (impossible to skip)?
|
|
716
|
+
3. **New rule** — Turn the failure into an enforcement rule in the SDLC skill
|
|
717
|
+
4. **Test** — Write a test that proves the rule exists (TDD — the rule is code too)
|
|
718
|
+
5. **Evidence** — Reference the incident so future readers understand WHY the rule exists
|
|
719
|
+
|
|
720
|
+
**Example (real incident):** PR #145 auto-merged before CI review was read. Root cause: auto-merge was enabled by default, no enforcement gate existed. New rule: "NEVER AUTO-MERGE" block added to CI Shepherd section with the same weight as "ALL TESTS MUST PASS." Test: `test_never_auto_merge_gate` verifies the block exists.
|
|
721
|
+
|
|
722
|
+
**Industry pattern:** "Every mistake becomes a rule" — the best SDLC systems are built from accumulated incident learnings, not theoretical best practices.
|
|
611
723
|
|
|
612
724
|
---
|
|
613
725
|
|
|
@@ -171,6 +171,16 @@ Based on detected stack, suggest `allowedTools` entries for `.claude/settings.js
|
|
|
171
171
|
|
|
172
172
|
Present suggestions and let the user confirm.
|
|
173
173
|
|
|
174
|
+
### Step 9.5: Context Window Configuration
|
|
175
|
+
|
|
176
|
+
Recommend autocompact settings based on the user's context window:
|
|
177
|
+
|
|
178
|
+
- **200K models (default):** Suggest `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=75` — leaves room for implementation after planning
|
|
179
|
+
- **1M models:** Suggest `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` or `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` — the default fires at ~76K on 1M, wasting 92% of the window
|
|
180
|
+
- **CI pipelines:** Suggest 60% — short tasks, compact early
|
|
181
|
+
|
|
182
|
+
Tell the user to add the export to their shell profile (`~/.bashrc`, `~/.zshrc`) or project `.envrc`. This is guidance, not enforcement — the wizard doesn't write shell profiles.
|
|
183
|
+
|
|
174
184
|
### Step 10: Customize Hooks
|
|
175
185
|
|
|
176
186
|
Update `tdd-pretool-check.sh` with the actual source directory (replace generic `/src/` pattern).
|
|
@@ -45,13 +45,14 @@ Extract the latest version from the first `## [X.X.X]` line.
|
|
|
45
45
|
Parse all CHANGELOG entries between the user's installed version and the latest. Present a clear summary:
|
|
46
46
|
|
|
47
47
|
```
|
|
48
|
-
Installed: 1.
|
|
49
|
-
Latest: 1.
|
|
48
|
+
Installed: 1.22.0
|
|
49
|
+
Latest: 1.24.0
|
|
50
50
|
|
|
51
51
|
What changed:
|
|
52
|
+
- [1.24.0] Hook if conditionals, autocompact tuning + 1M/200K guidance, tdd_red fix, ...
|
|
53
|
+
- [1.23.0] Update notification hook, cross-model review standardization, ...
|
|
52
54
|
- [1.22.0] Plan auto-approval, debugging workflow, /feedback skill, BRANDING.md detection, ...
|
|
53
55
|
- [1.21.0] Confidence-driven setup, prove-it gate, cross-model release review, ...
|
|
54
|
-
- [1.20.0] Version-pinned CC update gate, Tier 1 flakiness fix, flaky test guidance, ...
|
|
55
56
|
```
|
|
56
57
|
|
|
57
58
|
**If versions match:** Say "You're up to date! (version X.X.X)" and stop.
|