npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.2.3 → 0.3.0 - Mend

@ai-dev-methodologies/rlp-desk 0.2.3 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/docs/TODO-verification-next.md +59 -0
package/docs/internal/verification-policy-gap-analysis.md +523 -0
package/docs/internal/verification-strategy-research.md +2097 -0
package/package.json +1 -1
package/src/commands/rlp-desk.md +115 -6
package/src/governance.md +219 -1
package/src/scripts/init_ralph_desk.zsh +221 -12

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ai-dev-methodologies/rlp-desk",
-  "version": "0.2.3",
+  "version": "0.3.0",
   "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
   "scripts": {
     "postinstall": "node scripts/postinstall.js",

package/src/commands/rlp-desk.md CHANGED Viewed

@@ -24,6 +24,17 @@ Ask about these items one by one (or in small groups):
 1. **Slug** — short identifier (e.g., `auth-refactor`). Suggest one, ask if OK.
 2. **Objective** — what the loop achieves
 3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
+   - Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
+   - Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
+     ```
+     Given [precondition in domain language]
+     When [action in domain language]
+     Then [expected outcome with quantitative criteria]
+     ```
+   - Include at least 1 negative test per US ("must NOT happen").
+   - Include boundary cases per US (empty, max, zero, concurrent).
+   - **Task Type** per US: `code` | `visual` | `content` | `integration` | `infra`
+   - **Risk Level** per US (governance §1c): `LOW` | `MEDIUM` | `HIGH` | `CRITICAL`
 4. **Iteration Unit** — what one worker does per iteration. Explicitly ask:
    - "One US per iteration (bounded, incremental verification)?"
    - "All stories at once (faster, single verification)?"
@@ -42,8 +53,17 @@ Ask about these items one by one (or in small groups):
 11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
 12. **Max Iterations** — suggest based on story count, ask if OK.
-After all items are confirmed, present the full contract summary.
-On approval, offer to run `init`.
+After all items are confirmed:
+1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
+   If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
+   If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
+   If all ACs score 10-12: **PASS** — clean.
+   Present the score table to the user before proceeding.
+2. Present the full contract summary.
+3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
+4. On approval, offer to run `init`.
 Do NOT create files during brainstorm.
 Do NOT auto-decide iteration unit — the user MUST explicitly choose.
@@ -97,7 +117,8 @@ Options (parse from `$ARGUMENTS`):
 - `--consensus-scope all|final-only` — when consensus runs (default: `all`)
   - `all`: consensus runs on every verify (current behavior)
   - `final-only`: consensus only on final ALL verify
-- `--debug` — enable debug logging (tmux mode only, writes to logs/<slug>/debug.log)
+- `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
+- `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
 ### Mode Selection
@@ -144,11 +165,14 @@ DEBUG=<1 if --debug, else 0> \
 1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
 2. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
 3. Clean previous `done-claim.json`, `verify-verdict.json`.
+4. If `--debug`: create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`.
 ### Leader Loop
 **CRITICAL: DO NOT STOP between iterations.** You MUST continue the loop automatically until a sentinel is written (COMPLETE or BLOCKED) or max_iter is reached. Do NOT pause to ask the user. Do NOT wait for confirmation. The loop is fully autonomous — just report each iteration result briefly and immediately proceed to the next iteration.
+If `--debug`, at loop start debug_log: `[PLAN] slug=<slug> max_iter=<N> worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
 For each iteration (1 to max_iter):
 **① Check sentinels**
@@ -166,18 +190,27 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
 **② Read memory.md** → Stop Status, Next Iteration Contract
 - Also read **Completed Stories** → verified work so far
 - Also read **Key Decisions** → settled architectural choices
+- If `--debug`: debug_log `[EXEC] iter=N phase=read_memory stop_status=<status> contract="<summary>"`
 **③ Decide model** (§4 of governance.md)
 - Previous iteration failed → upgrade model
 - Simple task → downgrade
 - User specified → use that
+- If `--debug`: debug_log `[EXEC] iter=N phase=model_select worker_model=<model> reason=<reason>`
 **④ Build worker prompt**
 - Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
 - Combine with iteration number + memory contract
 - Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
+- Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
+**④½ Contract review** (agent mode only)
+- Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
+- If `--debug`: debug_log `[EXEC] iter=N phase=contract_review result=<ok|issues>`
+- In tmux mode: skip (shell leader cannot reason). Log: `[EXEC] iter=N phase=contract_review skipped=tmux_mode`
 **⑤ Execute Worker**
+- If `--debug`: debug_log `[EXEC] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
 If `--worker-engine claude` (default):
 ```
@@ -199,11 +232,14 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 - Codex runs as a subprocess via Bash(), not Agent().
 - Each Bash() call = fresh context for codex.
+- If `--debug`: debug_log `[EXEC] iter=N phase=worker_done engine=<engine>`
 **⑥ Read memory.md again** (Worker updated it)
 - `stop=continue` → go to ⑧
 - `stop=verify` → go to ⑦
 - `stop=blocked` → write BLOCKED sentinel, stop
 - Also read `iter-signal.json` for `us_id` field (which US was just completed)
+- If `--debug`: debug_log `[EXEC] iter=N phase=worker_signal status=<stop_status> us_id=<us_id>`
 **⑦ Execute Verifier**
@@ -225,6 +261,8 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 - Verifier checks all AC at once
 **⑦a Dispatch Verifier**
+- Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
+- If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
 If `--verifier-engine claude` (default):
 ```
@@ -259,10 +297,17 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
     1. Read `issues` array, sort by severity (`critical` → `major` → `minor`)
     2. Build structured fix contract with traceability rule
     3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
-    4. Increment `consecutive_failures` in `status.json`
-    5. Go to ⑧ with fix contract as next Worker contract
+    4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
+    5. Increment `consecutive_failures` in `status.json`
+    6. If `consecutive_failures >= 3` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
+    7. Go to ⑧ with fix contract as next Worker contract
   - `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
   - `blocked` → write BLOCKED sentinel, stop
+- If `--debug`: debug_log `[EXEC] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id>`
+- If `--debug`: debug_log `[EXEC] iter=N phase=layer_check L1=<status> L2=<status> L3=<status> L4=<status>`
+- If `--debug`: debug_log `[EXEC] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
+- If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
+- If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
 **⑧ Write result log and report to user, continue loop**
 - Write `logs/<slug>/iter-NNN.result.md`:
@@ -271,6 +316,67 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
   - Verifier verdict `[leader-measured]`
 - Write `status.json`
 - Report: iteration N, phase, model used, result
+- If `--debug`: debug_log `[EXEC] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
+At loop end (COMPLETE, BLOCKED, or TIMEOUT):
+- If `--debug`: debug_log `[VALIDATE] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
+**⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
+After the loop ends, the Leader performs post-campaign analysis:
+1. **Collect data**: Read all archived `iter-NNN.result.md`, done-claim.json (with execution_steps), and verify-verdict.json (with reasoning) from `logs/<slug>/`
+2. **Write cumulative data**: `logs/<slug>/self-verification-data.json` — normalized iteration records
+3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
+4. **Report to user**: Display the full report content
+Report template (9 sections):
+```
+# Campaign Self-Verification Report: <slug>
+Report Version: NNN | Generated: timestamp | Campaign: slug — objective
+Schema Version: governance hash | Data Quality: N% iterations complete
+## 1. Automated Validation Summary
+Table: Iter | US | Worker Verdict | Verifier Verdict | Outcome
+## 2. Failure Deep Dive (per failed iteration)
+Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
+## 3. Worker Process Quality (§1f audit)
+Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
+Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
+Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
+## 4. Verifier Judgment Quality (§1f audit)
+Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?
+Aggregate: Reasoning completeness %, Independent verification %, §1f category coverage %
+Audit: verify all 5 mandatory check categories (IL-1, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit) are present
+## 5. AC Lifecycle
+Table: US | AC | First Claimed (iter) | First Verified (iter) | Reopen Count | Final Status
+## 6. Test-Spec Adherence
+Spec completeness (layers/commands/mappings present)
+Spec execution fidelity (exact checks run and cited)
+## 7. Patterns: Strengths & Weaknesses
+Strengths: what worked well
+Weaknesses: systemic issues
+## 8. Recommendations for Next Cycle
+### Brainstorm (missing scenarios/constraints) — citing iter/AC
+### PRD (ambiguous or oversized ACs) — citing iter/AC
+### Test-Spec (missing layers, weak mappings) — citing iter/AC
+## 9. Blind Spots
+What this report CANNOT prove from available data
+## Data Provenance Rule
+Report content MUST be derivable from: done-claim.json (execution_steps), verify-verdict.json (reasoning),
+PRD, and test-spec. Information from source code inspection that is not in these files must be excluded
+or explicitly marked as "[source-inspection]" with justification.
+```
 ### Circuit Breaker
 - context-latest.md unchanged 3 iterations → BLOCKED
@@ -313,6 +419,8 @@ Remove:
 - `.claude/ralph-desk/logs/<slug>/session-config.json`
 - `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
 - `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
+- `.claude/ralph-desk/memos/<slug>-escalation.md`
+Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
 If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
 ```bash
@@ -342,7 +450,8 @@ Run options:
   --verify-mode per-us|batch Verification strategy (default: per-us)
   --verify-consensus         Cross-engine consensus verification
   --consensus-scope SCOPE    When consensus runs: all|final-only (default: all)
-  --debug                    Debug logging (tmux mode only)
+  --debug                    Debug logging (logs/<slug>/debug.log)
+  --with-self-verification   Campaign self-verification analysis (post-loop report)
 ```
 ## Architecture

package/src/governance.md CHANGED Viewed

@@ -10,10 +10,208 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
 - **Fresh context per iteration**: Worker/Verifier start fresh every time. No prior conversation.
 - **Filesystem = memory**: State exists only on the filesystem (PRD, memory, context, memos).
 - **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
+- **Worker scope is bounded**: Worker implements only the contracted US per iteration (Scope Lock). Out-of-scope changes are flagged by the Verifier.
 - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
 - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
 - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
+## 1a. Iron Laws
+Absolute rules that cannot be violated under any circumstance.
+```
+IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
+IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
+IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
+IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
+```
+**IL-1: Evidence Mandate**
+Required: every verdict must reference at least one command execution with its exit code.
+A verdict without command output evidence is automatically invalid.
+Additional signal: phrases such as "should pass", "probably works", "seems to",
+"looks correct", "appears to" (including but not limited to) without command evidence
+confirm the violation but are not the primary check — command output presence is.
+`request_info` is not an escape from IL-1 — if evidence was collectible, it must be collected.
+**IL-2: Ambiguity Gate**
+Each AC is scored 0-12 on 6 dimensions (0/1/2 points each):
+- Single behavior: 0=multiple behaviors, 1=mostly single, 2=exactly one testable behavior
+- Domain language: 0=technical terms (class names, API paths, DB tables), 1=mixed, 2=pure domain language
+- Stakeholder clarity: 0=unclear who benefits, 1=implied, 2=explicitly stated
+- Portability: 0=tech-stack specific, 1=mostly portable, 2=fully stack-independent
+- Concrete example: 0=vague ("some input"), 1=partial specifics, 2=exact values with expected results
+- Independence: 0=requires another AC to pass first, 1=loosely coupled, 2=fully self-contained
+Score interpretation:
+- 0-5: **REJECT** — init blocked. ACs too ambiguous.
+- 6-9: **WARN** — init proceeds with logged warning. Show which dimensions scored low.
+- 10-12: **PASS** — clean.
+IL-2 is a pre-run gate: scoring MUST happen during brainstorm or at init time.
+In tmux mode, IL-2 must be satisfied before `/rlp-desk run` is invoked.
+Calibration example:
+- Score 5 (REJECT): "Given a user, When they log in, Then the system works correctly"
+  → single:1 (login is one behavior), domain:2 (domain terms only), stakeholder:1 (implied user), portability:1 (mostly portable), concrete:0 ("works correctly" is vague), independence:0 (implies registration AC) = 5
+- Score 7 (WARN): "Given a registered user with email 'test@example.com' and valid password, When they submit the login form, Then they are redirected to the dashboard within 2 seconds"
+  → single:2 (exactly one: login redirect), domain:1 ("submit" is slightly technical), stakeholder:1 (implied end user), portability:1 (web-specific "form"), concrete:2 (specific email, 2s threshold), independence:0 (requires registration) = 7
+**IL-3: Layer Completeness**
+Verification layers:
+- L1: Unit Test — function-level, mocks allowed. Always required.
+- L2: Integration — real external services (DB, API, Redis). Required when external dependencies exist.
+- L3: E2E Simulation — known input → full pipeline → output comparison. Always required.
+- L4: Deploy Verify — production environment checks. Required when deploying.
+Layer requirements per US are determined by risk classification (§1c).
+Non-applicable layers must be explicitly marked "N/A — {reason}" in the test-spec.
+Any required layer section with TODO or blank = automatic Verifier FAIL.
+See §1d for full layer definitions.
+**IL-4: Test Sufficiency**
+Each AC must have >= 3 tests covering >= 2 of 3 categories (happy path, negative/error, boundary).
+Tests must be mapped to ACs in the test-spec's criteria-to-test mapping table.
+Only tests listed in this mapping count toward IL-4.
+Count < 3 per any AC = FAIL.
+### Enforcement
+| Iron Law | Checked by | When | Method |
+|----------|-----------|------|--------|
+| IL-1 | Verifier | verification time | mechanical (command output presence) |
+| IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
+| IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
+| IL-4 | Verifier | verification time | scored (test count per AC) |
+- Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
+- When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
+  `request_info` remains valid only when the Verifier cannot determine whether an Iron Law
+  was violated (e.g., cannot access test files, command execution blocked).
+- You (the Leader) cannot waive Iron Laws. Only the user can explicitly waive an Iron Law
+  for a specific US with documented justification in the PRD.
+## 1b. Evidence Gate
+This section operationalizes IL-1 (Evidence Mandate) into a concrete step-by-step protocol.
+Before any verdict, the Verifier MUST follow this 5-step process:
+1. **IDENTIFY**: What command proves this claim?
+2. **RUN**: Execute the command (fresh, not cached or recalled)
+3. **READ**: Full output + exit code + failure count
+4. **VERIFY**: Does output confirm the claim?
+   - YES → state claim WITH evidence (command + output + exit code)
+   - NO → state actual status with evidence
+5. **ONLY THEN**: Issue verdict
+Skipping any step = invalid verification (IL-1 violation).
+### Forbidden Patterns
+- "should pass", "probably works" without command output
+- Trusting Worker's success reports without independent re-execution
+- Partial verification ("linter passed" ≠ "tests passed" ≠ "all AC met")
+- "Code inspection" as substitute for automated command execution
+- Citing cached/prior results instead of fresh execution
+## 1c. Risk Classification
+Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
+| Level | Description | Required Layers | Extra Requirements |
+|-------|-------------|-----------------|-------------------|
+| LOW | Read-only, docs, config | L1 + L3 | — |
+| MEDIUM | New feature, refactor | L1 + L2 (if external deps) + L3 | — |
+| HIGH | Production deploy, data migration | L1 + L2 + L3 + L4 | — |
+| CRITICAL | Financial, security, medical | L1 + L2 + L3 + L4 | consensus + mutation testing (when mutation testing tool is configured in test-spec) |
+L2 is included in MEDIUM+ rows but is marked N/A when no external services exist (see §1d L2 "When N/A" clause).
+### Who Decides
+- During brainstorm: user assigns risk level per US (Leader suggests, user confirms).
+- If brainstorm was skipped: Leader assigns based on PRD content at first run iteration.
+- Risk level is recorded in PRD per US. Cannot be downgraded without user approval.
+### Examples
+- LOW: README update, adding comments, .env.example
+- MEDIUM: REST API endpoint, React component, business rule
+- HIGH: database migration, CI/CD change, deployment config
+- CRITICAL: payment processing, auth, encryption, PII handling
+## 1d. Verification Layers
+Four layers of verification, each targeting a different failure mode.
+### L1: Unit Test (always required)
+- Scope: function/method level, isolated logic
+- Mocks: allowed for external boundaries only
+- Evidence: test runner output with pass/fail count + exit code
+### L2: Integration (required when external dependencies exist)
+- Scope: interaction with real external services (database, API, message queue, cache)
+- Mocks: NOT allowed — use real or containerized services
+- Evidence: integration test output with connection confirmation + data verification
+- When N/A: "N/A — no external services (pure computation/transformation)"
+### L3: E2E Simulation (always required)
+- Scope: known input → full pipeline → quantitative output comparison
+- Evidence: input data + actual output + expected output + comparison result
+- For simple utilities: E2E = "run function with known input, verify output matches expected"
+### L4: Deploy Verify (required when deploying)
+- Scope: production/staging environment health after deployment
+- Evidence: health check response + deployment status + monitoring state
+- When N/A: "N/A — no deployment (library/tool, local-only change)"
+### Rules
+- L1 and L3: always required regardless of risk level.
+- L2: required for MEDIUM+ risk when external services are involved.
+- L4: required for HIGH+ risk when deployment occurs.
+- Layer requirements per US are determined by risk classification (§1c).
+- Non-required layers must be marked "N/A — {reason}" per IL-3. Blank or TODO = FAIL.
+## 1e. Verification Checkpoints
+Verification occurs at two boundaries, not as a single final event.
+### Checkpoint 1: Story/Unit (per-US)
+- Trigger: Worker signals verify with us_id = specific US
+- Scope: that US's acceptance criteria (L1 pass is verified as part of layer enforcement in Verifier step 5)
+- On fail: fix loop (§7½)
+### Checkpoint 2: Release Readiness (us_id=ALL)
+- Trigger: all individual US pass Checkpoint 1 → Worker signals verify with us_id = "ALL"
+- Scope: all AC + L2 integration (if applicable) + L3 E2E Simulation + L4 deploy (if applicable) + mutation score (if CRITICAL, when mutation testing tool is configured in test-spec)
+- On fail: fix loop; escalation to user if 3 consecutive failures
+### Relationship to Existing Flow
+- Checkpoint 1 = existing per-US verify (§7a). No change.
+- Checkpoint 2 = existing "us_id=ALL final full verify" (§7a). Adds explicit layer scope.
+- No new iteration steps are introduced.
+## 1f. Execution & Judgment Traceability
+Every iteration, Worker and Verifier MUST record their process and reasoning — not just results.
+This is the default behavior, not an optional flag. Without it, IL-1 (Evidence Mandate) is incomplete.
+### Worker: execution_steps in done-claim.json
+Worker records what was done, in what order, with command evidence in `done-claim.json`:
+- Each step includes: what action, which AC, command executed, exit code, summary
+- Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`
+- This proves the Worker followed test-first approach and did not skip steps
+### Verifier: reasoning in verify-verdict.json
+Verifier records WHY each judgment was made in `verify-verdict.json`:
+- Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
+- Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit
+- This proves the Verifier actually performed each check rather than rubber-stamping
+### Why This Is Default (Not Optional)
+- IL-1 says "no claims without evidence" — this applies to Worker AND Verifier
+- Without execution_steps, Worker's done-claim is an unsubstantiated assertion
+- Without reasoning, Verifier's verdict is an unsubstantiated judgment
+- Both are archived in `logs/<slug>/` per existing audit trail pattern
 ## 2. Roles
 ### Leader (current session)
@@ -182,6 +380,7 @@ Characteristics:
 │   ├── <slug>-done-claim.json       # Worker's completion claim (runtime)
 │   ├── <slug>-iter-signal.json      # Worker's iteration signal (runtime)
 │   ├── <slug>-verify-verdict.json   # Verifier's verdict (runtime)
+│   ├── <slug>-escalation.md          # Architecture escalation report (tmux mode, §7¾)
 │   ├── <slug>-complete.md           # SENTINEL (Leader only)
 │   └── <slug>-blocked.md            # SENTINEL (Leader only)
 ├── plans/
@@ -191,6 +390,8 @@ Characteristics:
     ├── iter-NNN.worker-prompt.md    # Audit trail prompt copy
     ├── iter-NNN.verifier-prompt.md  # Audit trail prompt copy
     ├── iter-NNN.result.md           # Iteration result (leader-measured + git-measured)
+    ├── self-verification-data.json              # Cumulative campaign data (--with-self-verification)
+    ├── self-verification-report-NNN.md          # Versioned campaign analysis report (--with-self-verification)
     └── status.json                  # Leader's loop state
 ```
@@ -311,12 +512,29 @@ Traceability: only changes that resolve a listed issue are allowed.
 Every change must be justified by the issue it addresses.
 ```
+## 7¾. Architecture Escalation
+Note: Circuit Breaker (§8) fires first at 2 consecutive failures (model upgrade + retry). If the retry also fails (3rd consecutive failure), Architecture Escalation applies. The CB retry counts toward the consecutive_failures counter.
+If 3+ consecutive fix attempts fail for the same US:
+1. **STOP fixing symptoms** — the problem is likely architectural, not a bug.
+2. **Leader reports to user**: "3 consecutive fix attempts failed for US-{id}. This suggests an architectural issue, not a simple bug."
+3. **Include in report**:
+   - What was attempted in each fix
+   - What specifically kept failing
+   - Hypothesis: why fixes are not sticking
+4. **Do NOT attempt fix #4** without user guidance.
+5. **Options**: refactor architecture, simplify the US, split the US, or mark BLOCKED.
+In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
 ## 8. Circuit Breaker
 | Condition | Verdict |
 |-----------|---------|
 | context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
-| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → BLOCKED |
+| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → Architecture Escalation (§7¾) → BLOCKED |
 | 3 consecutive **fail** verdicts on 3 unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED |
 | max_iter reached | TIMEOUT (report to user) |