@ai-dev-methodologies/rlp-desk 0.2.3 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.2.3",
3
+ "version": "0.3.0",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -24,6 +24,17 @@ Ask about these items one by one (or in small groups):
24
24
  1. **Slug** — short identifier (e.g., `auth-refactor`). Suggest one, ask if OK.
25
25
  2. **Objective** — what the loop achieves
26
26
  3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
27
+ - Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
28
+ - Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
29
+ ```
30
+ Given [precondition in domain language]
31
+ When [action in domain language]
32
+ Then [expected outcome with quantitative criteria]
33
+ ```
34
+ - Include at least 1 negative test per US ("must NOT happen").
35
+ - Include boundary cases per US (empty, max, zero, concurrent).
36
+ - **Task Type** per US: `code` | `visual` | `content` | `integration` | `infra`
37
+ - **Risk Level** per US (governance §1c): `LOW` | `MEDIUM` | `HIGH` | `CRITICAL`
27
38
  4. **Iteration Unit** — what one worker does per iteration. Explicitly ask:
28
39
  - "One US per iteration (bounded, incremental verification)?"
29
40
  - "All stories at once (faster, single verification)?"
@@ -42,8 +53,17 @@ Ask about these items one by one (or in small groups):
42
53
  11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
43
54
  12. **Max Iterations** — suggest based on story count, ask if OK.
44
55
 
45
- After all items are confirmed, present the full contract summary.
46
- On approval, offer to run `init`.
56
+ After all items are confirmed:
57
+
58
+ 1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
59
+ If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
60
+ If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
61
+ If all ACs score 10-12: **PASS** — clean.
62
+ Present the score table to the user before proceeding.
63
+ 2. Present the full contract summary.
64
+ 3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
65
+ 4. On approval, offer to run `init`.
66
+
47
67
  Do NOT create files during brainstorm.
48
68
  Do NOT auto-decide iteration unit — the user MUST explicitly choose.
49
69
 
@@ -97,7 +117,8 @@ Options (parse from `$ARGUMENTS`):
97
117
  - `--consensus-scope all|final-only` — when consensus runs (default: `all`)
98
118
  - `all`: consensus runs on every verify (current behavior)
99
119
  - `final-only`: consensus only on final ALL verify
100
- - `--debug` — enable debug logging (tmux mode only, writes to logs/<slug>/debug.log)
120
+ - `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
121
+ - `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
101
122
 
102
123
  ### Mode Selection
103
124
 
@@ -144,11 +165,14 @@ DEBUG=<1 if --debug, else 0> \
144
165
  1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
145
166
  2. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
146
167
  3. Clean previous `done-claim.json`, `verify-verdict.json`.
168
+ 4. If `--debug`: create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`.
147
169
 
148
170
  ### Leader Loop
149
171
 
150
172
  **CRITICAL: DO NOT STOP between iterations.** You MUST continue the loop automatically until a sentinel is written (COMPLETE or BLOCKED) or max_iter is reached. Do NOT pause to ask the user. Do NOT wait for confirmation. The loop is fully autonomous — just report each iteration result briefly and immediately proceed to the next iteration.
151
173
 
174
+ If `--debug`, at loop start debug_log: `[PLAN] slug=<slug> max_iter=<N> worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
175
+
152
176
  For each iteration (1 to max_iter):
153
177
 
154
178
  **① Check sentinels**
@@ -166,18 +190,27 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
166
190
  **② Read memory.md** → Stop Status, Next Iteration Contract
167
191
  - Also read **Completed Stories** → verified work so far
168
192
  - Also read **Key Decisions** → settled architectural choices
193
+ - If `--debug`: debug_log `[EXEC] iter=N phase=read_memory stop_status=<status> contract="<summary>"`
169
194
 
170
195
  **③ Decide model** (§4 of governance.md)
171
196
  - Previous iteration failed → upgrade model
172
197
  - Simple task → downgrade
173
198
  - User specified → use that
199
+ - If `--debug`: debug_log `[EXEC] iter=N phase=model_select worker_model=<model> reason=<reason>`
174
200
 
175
201
  **④ Build worker prompt**
176
202
  - Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
177
203
  - Combine with iteration number + memory contract
178
204
  - Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
205
+ - Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
206
+
207
+ **④½ Contract review** (agent mode only)
208
+ - Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
209
+ - If `--debug`: debug_log `[EXEC] iter=N phase=contract_review result=<ok|issues>`
210
+ - In tmux mode: skip (shell leader cannot reason). Log: `[EXEC] iter=N phase=contract_review skipped=tmux_mode`
179
211
 
180
212
  **⑤ Execute Worker**
213
+ - If `--debug`: debug_log `[EXEC] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
181
214
 
182
215
  If `--worker-engine claude` (default):
183
216
  ```
@@ -199,11 +232,14 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
199
232
  - Codex runs as a subprocess via Bash(), not Agent().
200
233
  - Each Bash() call = fresh context for codex.
201
234
 
235
+ - If `--debug`: debug_log `[EXEC] iter=N phase=worker_done engine=<engine>`
236
+
202
237
  **⑥ Read memory.md again** (Worker updated it)
203
238
  - `stop=continue` → go to ⑧
204
239
  - `stop=verify` → go to ⑦
205
240
  - `stop=blocked` → write BLOCKED sentinel, stop
206
241
  - Also read `iter-signal.json` for `us_id` field (which US was just completed)
242
+ - If `--debug`: debug_log `[EXEC] iter=N phase=worker_signal status=<stop_status> us_id=<us_id>`
207
243
 
208
244
  **⑦ Execute Verifier**
209
245
 
@@ -225,6 +261,8 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
225
261
  - Verifier checks all AC at once
226
262
 
227
263
  **⑦a Dispatch Verifier**
264
+ - Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
265
+ - If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
228
266
 
229
267
  If `--verifier-engine claude` (default):
230
268
  ```
@@ -259,10 +297,17 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
259
297
  1. Read `issues` array, sort by severity (`critical` → `major` → `minor`)
260
298
  2. Build structured fix contract with traceability rule
261
299
  3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
262
- 4. Increment `consecutive_failures` in `status.json`
263
- 5. Go to with fix contract as next Worker contract
300
+ 4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
301
+ 5. Increment `consecutive_failures` in `status.json`
302
+ 6. If `consecutive_failures >= 3` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
303
+ 7. Go to ⑧ with fix contract as next Worker contract
264
304
  - `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
265
305
  - `blocked` → write BLOCKED sentinel, stop
306
+ - If `--debug`: debug_log `[EXEC] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id>`
307
+ - If `--debug`: debug_log `[EXEC] iter=N phase=layer_check L1=<status> L2=<status> L3=<status> L4=<status>`
308
+ - If `--debug`: debug_log `[EXEC] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
309
+ - If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
310
+ - If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
266
311
 
267
312
  **⑧ Write result log and report to user, continue loop**
268
313
  - Write `logs/<slug>/iter-NNN.result.md`:
@@ -271,6 +316,67 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
271
316
  - Verifier verdict `[leader-measured]`
272
317
  - Write `status.json`
273
318
  - Report: iteration N, phase, model used, result
319
+ - If `--debug`: debug_log `[EXEC] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
320
+
321
+ At loop end (COMPLETE, BLOCKED, or TIMEOUT):
322
+ - If `--debug`: debug_log `[VALIDATE] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
323
+
324
+ **⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
325
+
326
+ After the loop ends, the Leader performs post-campaign analysis:
327
+
328
+ 1. **Collect data**: Read all archived `iter-NNN.result.md`, done-claim.json (with execution_steps), and verify-verdict.json (with reasoning) from `logs/<slug>/`
329
+ 2. **Write cumulative data**: `logs/<slug>/self-verification-data.json` — normalized iteration records
330
+ 3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
331
+ 4. **Report to user**: Display the full report content
332
+
333
+ Report template (9 sections):
334
+
335
+ ```
336
+ # Campaign Self-Verification Report: <slug>
337
+ Report Version: NNN | Generated: timestamp | Campaign: slug — objective
338
+ Schema Version: governance hash | Data Quality: N% iterations complete
339
+
340
+ ## 1. Automated Validation Summary
341
+ Table: Iter | US | Worker Verdict | Verifier Verdict | Outcome
342
+
343
+ ## 2. Failure Deep Dive (per failed iteration)
344
+ Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
345
+
346
+ ## 3. Worker Process Quality (§1f audit)
347
+ Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
348
+ Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
349
+ Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
350
+
351
+ ## 4. Verifier Judgment Quality (§1f audit)
352
+ Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?
353
+ Aggregate: Reasoning completeness %, Independent verification %, §1f category coverage %
354
+ Audit: verify all 5 mandatory check categories (IL-1, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit) are present
355
+
356
+ ## 5. AC Lifecycle
357
+ Table: US | AC | First Claimed (iter) | First Verified (iter) | Reopen Count | Final Status
358
+
359
+ ## 6. Test-Spec Adherence
360
+ Spec completeness (layers/commands/mappings present)
361
+ Spec execution fidelity (exact checks run and cited)
362
+
363
+ ## 7. Patterns: Strengths & Weaknesses
364
+ Strengths: what worked well
365
+ Weaknesses: systemic issues
366
+
367
+ ## 8. Recommendations for Next Cycle
368
+ ### Brainstorm (missing scenarios/constraints) — citing iter/AC
369
+ ### PRD (ambiguous or oversized ACs) — citing iter/AC
370
+ ### Test-Spec (missing layers, weak mappings) — citing iter/AC
371
+
372
+ ## 9. Blind Spots
373
+ What this report CANNOT prove from available data
374
+
375
+ ## Data Provenance Rule
376
+ Report content MUST be derivable from: done-claim.json (execution_steps), verify-verdict.json (reasoning),
377
+ PRD, and test-spec. Information from source code inspection that is not in these files must be excluded
378
+ or explicitly marked as "[source-inspection]" with justification.
379
+ ```
274
380
 
275
381
  ### Circuit Breaker
276
382
  - context-latest.md unchanged 3 iterations → BLOCKED
@@ -313,6 +419,8 @@ Remove:
313
419
  - `.claude/ralph-desk/logs/<slug>/session-config.json`
314
420
  - `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
315
421
  - `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
422
+ - `.claude/ralph-desk/memos/<slug>-escalation.md`
423
+ Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
316
424
 
317
425
  If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
318
426
  ```bash
@@ -342,7 +450,8 @@ Run options:
342
450
  --verify-mode per-us|batch Verification strategy (default: per-us)
343
451
  --verify-consensus Cross-engine consensus verification
344
452
  --consensus-scope SCOPE When consensus runs: all|final-only (default: all)
345
- --debug Debug logging (tmux mode only)
453
+ --debug Debug logging (logs/<slug>/debug.log)
454
+ --with-self-verification Campaign self-verification analysis (post-loop report)
346
455
  ```
347
456
 
348
457
  ## Architecture
package/src/governance.md CHANGED
@@ -10,10 +10,208 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
10
10
  - **Fresh context per iteration**: Worker/Verifier start fresh every time. No prior conversation.
11
11
  - **Filesystem = memory**: State exists only on the filesystem (PRD, memory, context, memos).
12
12
  - **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
13
+ - **Worker scope is bounded**: Worker implements only the contracted US per iteration (Scope Lock). Out-of-scope changes are flagged by the Verifier.
13
14
  - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
14
15
  - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
15
16
  - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
16
17
 
18
+ ## 1a. Iron Laws
19
+
20
+ Absolute rules that cannot be violated under any circumstance.
21
+
22
+ ```
23
+ IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
24
+ IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
25
+ IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
26
+ IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
27
+ ```
28
+
29
+ **IL-1: Evidence Mandate**
30
+ Required: every verdict must reference at least one command execution with its exit code.
31
+ A verdict without command output evidence is automatically invalid.
32
+ Additional signal: phrases such as "should pass", "probably works", "seems to",
33
+ "looks correct", "appears to" (including but not limited to) without command evidence
34
+ confirm the violation but are not the primary check — command output presence is.
35
+ `request_info` is not an escape from IL-1 — if evidence was collectible, it must be collected.
36
+
37
+ **IL-2: Ambiguity Gate**
38
+ Each AC is scored 0-12 on 6 dimensions (0/1/2 points each):
39
+ - Single behavior: 0=multiple behaviors, 1=mostly single, 2=exactly one testable behavior
40
+ - Domain language: 0=technical terms (class names, API paths, DB tables), 1=mixed, 2=pure domain language
41
+ - Stakeholder clarity: 0=unclear who benefits, 1=implied, 2=explicitly stated
42
+ - Portability: 0=tech-stack specific, 1=mostly portable, 2=fully stack-independent
43
+ - Concrete example: 0=vague ("some input"), 1=partial specifics, 2=exact values with expected results
44
+ - Independence: 0=requires another AC to pass first, 1=loosely coupled, 2=fully self-contained
45
+
46
+ Score interpretation:
47
+ - 0-5: **REJECT** — init blocked. ACs too ambiguous.
48
+ - 6-9: **WARN** — init proceeds with logged warning. Show which dimensions scored low.
49
+ - 10-12: **PASS** — clean.
50
+
51
+ IL-2 is a pre-run gate: scoring MUST happen during brainstorm or at init time.
52
+ In tmux mode, IL-2 must be satisfied before `/rlp-desk run` is invoked.
53
+
54
+ Calibration example:
55
+ - Score 5 (REJECT): "Given a user, When they log in, Then the system works correctly"
56
+ → single:1 (login is one behavior), domain:2 (domain terms only), stakeholder:1 (implied user), portability:1 (mostly portable), concrete:0 ("works correctly" is vague), independence:0 (implies registration AC) = 5
57
+ - Score 7 (WARN): "Given a registered user with email 'test@example.com' and valid password, When they submit the login form, Then they are redirected to the dashboard within 2 seconds"
58
+ → single:2 (exactly one: login redirect), domain:1 ("submit" is slightly technical), stakeholder:1 (implied end user), portability:1 (web-specific "form"), concrete:2 (specific email, 2s threshold), independence:0 (requires registration) = 7
59
+
60
+ **IL-3: Layer Completeness**
61
+ Verification layers:
62
+ - L1: Unit Test — function-level, mocks allowed. Always required.
63
+ - L2: Integration — real external services (DB, API, Redis). Required when external dependencies exist.
64
+ - L3: E2E Simulation — known input → full pipeline → output comparison. Always required.
65
+ - L4: Deploy Verify — production environment checks. Required when deploying.
66
+
67
+ Layer requirements per US are determined by risk classification (§1c).
68
+ Non-applicable layers must be explicitly marked "N/A — {reason}" in the test-spec.
69
+ Any required layer section with TODO or blank = automatic Verifier FAIL.
70
+ See §1d for full layer definitions.
71
+
72
+ **IL-4: Test Sufficiency**
73
+ Each AC must have >= 3 tests covering >= 2 of 3 categories (happy path, negative/error, boundary).
74
+ Tests must be mapped to ACs in the test-spec's criteria-to-test mapping table.
75
+ Only tests listed in this mapping count toward IL-4.
76
+ Count < 3 per any AC = FAIL.
77
+
78
+ ### Enforcement
79
+
80
+ | Iron Law | Checked by | When | Method |
81
+ |----------|-----------|------|--------|
82
+ | IL-1 | Verifier | verification time | mechanical (command output presence) |
83
+ | IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
84
+ | IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
85
+ | IL-4 | Verifier | verification time | scored (test count per AC) |
86
+
87
+ - Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
88
+ - When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
89
+ `request_info` remains valid only when the Verifier cannot determine whether an Iron Law
90
+ was violated (e.g., cannot access test files, command execution blocked).
91
+ - You (the Leader) cannot waive Iron Laws. Only the user can explicitly waive an Iron Law
92
+ for a specific US with documented justification in the PRD.
93
+
94
+ ## 1b. Evidence Gate
95
+
96
+ This section operationalizes IL-1 (Evidence Mandate) into a concrete step-by-step protocol.
97
+
98
+ Before any verdict, the Verifier MUST follow this 5-step process:
99
+
100
+ 1. **IDENTIFY**: What command proves this claim?
101
+ 2. **RUN**: Execute the command (fresh, not cached or recalled)
102
+ 3. **READ**: Full output + exit code + failure count
103
+ 4. **VERIFY**: Does output confirm the claim?
104
+ - YES → state claim WITH evidence (command + output + exit code)
105
+ - NO → state actual status with evidence
106
+ 5. **ONLY THEN**: Issue verdict
107
+
108
+ Skipping any step = invalid verification (IL-1 violation).
109
+
110
+ ### Forbidden Patterns
111
+ - "should pass", "probably works" without command output
112
+ - Trusting Worker's success reports without independent re-execution
113
+ - Partial verification ("linter passed" ≠ "tests passed" ≠ "all AC met")
114
+ - "Code inspection" as substitute for automated command execution
115
+ - Citing cached/prior results instead of fresh execution
116
+
117
+ ## 1c. Risk Classification
118
+
119
+ Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
120
+
121
+ | Level | Description | Required Layers | Extra Requirements |
122
+ |-------|-------------|-----------------|-------------------|
123
+ | LOW | Read-only, docs, config | L1 + L3 | — |
124
+ | MEDIUM | New feature, refactor | L1 + L2 (if external deps) + L3 | — |
125
+ | HIGH | Production deploy, data migration | L1 + L2 + L3 + L4 | — |
126
+ | CRITICAL | Financial, security, medical | L1 + L2 + L3 + L4 | consensus + mutation testing (when mutation testing tool is configured in test-spec) |
127
+
128
+ L2 is included in MEDIUM+ rows but is marked N/A when no external services exist (see §1d L2 "When N/A" clause).
129
+
130
+ ### Who Decides
131
+ - During brainstorm: user assigns risk level per US (Leader suggests, user confirms).
132
+ - If brainstorm was skipped: Leader assigns based on PRD content at first run iteration.
133
+ - Risk level is recorded in PRD per US. Cannot be downgraded without user approval.
134
+
135
+ ### Examples
136
+ - LOW: README update, adding comments, .env.example
137
+ - MEDIUM: REST API endpoint, React component, business rule
138
+ - HIGH: database migration, CI/CD change, deployment config
139
+ - CRITICAL: payment processing, auth, encryption, PII handling
140
+
141
+ ## 1d. Verification Layers
142
+
143
+ Four layers of verification, each targeting a different failure mode.
144
+
145
+ ### L1: Unit Test (always required)
146
+ - Scope: function/method level, isolated logic
147
+ - Mocks: allowed for external boundaries only
148
+ - Evidence: test runner output with pass/fail count + exit code
149
+
150
+ ### L2: Integration (required when external dependencies exist)
151
+ - Scope: interaction with real external services (database, API, message queue, cache)
152
+ - Mocks: NOT allowed — use real or containerized services
153
+ - Evidence: integration test output with connection confirmation + data verification
154
+ - When N/A: "N/A — no external services (pure computation/transformation)"
155
+
156
+ ### L3: E2E Simulation (always required)
157
+ - Scope: known input → full pipeline → quantitative output comparison
158
+ - Evidence: input data + actual output + expected output + comparison result
159
+ - For simple utilities: E2E = "run function with known input, verify output matches expected"
160
+
161
+ ### L4: Deploy Verify (required when deploying)
162
+ - Scope: production/staging environment health after deployment
163
+ - Evidence: health check response + deployment status + monitoring state
164
+ - When N/A: "N/A — no deployment (library/tool, local-only change)"
165
+
166
+ ### Rules
167
+ - L1 and L3: always required regardless of risk level.
168
+ - L2: required for MEDIUM+ risk when external services are involved.
169
+ - L4: required for HIGH+ risk when deployment occurs.
170
+ - Layer requirements per US are determined by risk classification (§1c).
171
+ - Non-required layers must be marked "N/A — {reason}" per IL-3. Blank or TODO = FAIL.
172
+
173
+ ## 1e. Verification Checkpoints
174
+
175
+ Verification occurs at two boundaries, not as a single final event.
176
+
177
+ ### Checkpoint 1: Story/Unit (per-US)
178
+ - Trigger: Worker signals verify with us_id = specific US
179
+ - Scope: that US's acceptance criteria (L1 pass is verified as part of layer enforcement in Verifier step 5)
180
+ - On fail: fix loop (§7½)
181
+
182
+ ### Checkpoint 2: Release Readiness (us_id=ALL)
183
+ - Trigger: all individual US pass Checkpoint 1 → Worker signals verify with us_id = "ALL"
184
+ - Scope: all AC + L2 integration (if applicable) + L3 E2E Simulation + L4 deploy (if applicable) + mutation score (if CRITICAL, when mutation testing tool is configured in test-spec)
185
+ - On fail: fix loop; escalation to user if 3 consecutive failures
186
+
187
+ ### Relationship to Existing Flow
188
+ - Checkpoint 1 = existing per-US verify (§7a). No change.
189
+ - Checkpoint 2 = existing "us_id=ALL final full verify" (§7a). Adds explicit layer scope.
190
+ - No new iteration steps are introduced.
191
+
192
+ ## 1f. Execution & Judgment Traceability
193
+
194
+ Every iteration, Worker and Verifier MUST record their process and reasoning — not just results.
195
+ This is the default behavior, not an optional flag. Without it, IL-1 (Evidence Mandate) is incomplete.
196
+
197
+ ### Worker: execution_steps in done-claim.json
198
+ Worker records what was done, in what order, with command evidence in `done-claim.json`:
199
+ - Each step includes: what action, which AC, command executed, exit code, summary
200
+ - Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`
201
+ - This proves the Worker followed test-first approach and did not skip steps
202
+
203
+ ### Verifier: reasoning in verify-verdict.json
204
+ Verifier records WHY each judgment was made in `verify-verdict.json`:
205
+ - Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
206
+ - Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit
207
+ - This proves the Verifier actually performed each check rather than rubber-stamping
208
+
209
+ ### Why This Is Default (Not Optional)
210
+ - IL-1 says "no claims without evidence" — this applies to Worker AND Verifier
211
+ - Without execution_steps, Worker's done-claim is an unsubstantiated assertion
212
+ - Without reasoning, Verifier's verdict is an unsubstantiated judgment
213
+ - Both are archived in `logs/<slug>/` per existing audit trail pattern
214
+
17
215
  ## 2. Roles
18
216
 
19
217
  ### Leader (current session)
@@ -182,6 +380,7 @@ Characteristics:
182
380
  │ ├── <slug>-done-claim.json # Worker's completion claim (runtime)
183
381
  │ ├── <slug>-iter-signal.json # Worker's iteration signal (runtime)
184
382
  │ ├── <slug>-verify-verdict.json # Verifier's verdict (runtime)
383
+ │ ├── <slug>-escalation.md # Architecture escalation report (tmux mode, §7¾)
185
384
  │ ├── <slug>-complete.md # SENTINEL (Leader only)
186
385
  │ └── <slug>-blocked.md # SENTINEL (Leader only)
187
386
  ├── plans/
@@ -191,6 +390,8 @@ Characteristics:
191
390
  ├── iter-NNN.worker-prompt.md # Audit trail prompt copy
192
391
  ├── iter-NNN.verifier-prompt.md # Audit trail prompt copy
193
392
  ├── iter-NNN.result.md # Iteration result (leader-measured + git-measured)
393
+ ├── self-verification-data.json # Cumulative campaign data (--with-self-verification)
394
+ ├── self-verification-report-NNN.md # Versioned campaign analysis report (--with-self-verification)
194
395
  └── status.json # Leader's loop state
195
396
  ```
196
397
 
@@ -311,12 +512,29 @@ Traceability: only changes that resolve a listed issue are allowed.
311
512
  Every change must be justified by the issue it addresses.
312
513
  ```
313
514
 
515
+ ## 7¾. Architecture Escalation
516
+
517
+ Note: Circuit Breaker (§8) fires first at 2 consecutive failures (model upgrade + retry). If the retry also fails (3rd consecutive failure), Architecture Escalation applies. The CB retry counts toward the consecutive_failures counter.
518
+
519
+ If 3+ consecutive fix attempts fail for the same US:
520
+
521
+ 1. **STOP fixing symptoms** — the problem is likely architectural, not a bug.
522
+ 2. **Leader reports to user**: "3 consecutive fix attempts failed for US-{id}. This suggests an architectural issue, not a simple bug."
523
+ 3. **Include in report**:
524
+ - What was attempted in each fix
525
+ - What specifically kept failing
526
+ - Hypothesis: why fixes are not sticking
527
+ 4. **Do NOT attempt fix #4** without user guidance.
528
+ 5. **Options**: refactor architecture, simplify the US, split the US, or mark BLOCKED.
529
+
530
+ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
531
+
314
532
  ## 8. Circuit Breaker
315
533
 
316
534
  | Condition | Verdict |
317
535
  |-----------|---------|
318
536
  | context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
319
- | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → BLOCKED |
537
+ | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → Architecture Escalation (§7¾) → BLOCKED |
320
538
  | 3 consecutive **fail** verdicts on 3 unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED |
321
539
  | max_iter reached | TIMEOUT (report to user) |
322
540