npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.3.5 → 0.4.0 - Mend

@ai-dev-methodologies/rlp-desk 0.3.5 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/docs/blueprints/blueprint-v0.4-evolution.md +347 -0
package/docs/prompts/ralplan-codex-review.md +55 -0
package/package.json +1 -1
package/src/commands/rlp-desk.md +73 -23
package/src/governance.md +39 -22
package/src/scripts/init_ralph_desk.zsh +139 -4
package/src/scripts/run_ralph_desk.zsh +358 -80

package/docs/blueprints/blueprint-v0.4-evolution.md ADDED Viewed

@@ -0,0 +1,347 @@
+# Blueprint: rlp-desk v0.4 Evolution
+> Design blueprint for rlp-desk's next major direction.
+> Status: CONFIRMED (Deep Interview 4.9% ambiguity) | Author: kyjin | Date: 2026-03-26
+---
+## Vision
+rlp-desk is both a **task execution tool** and a **workflow generator**.
+Users start with unstructured work, iterate through self-verification cycles,
+and the accumulated process naturally becomes a reusable, formalized workflow.
+```
+[Unstructured]                    [Structured]
+brainstorm → run → verify          Workflow (skill + command composition)
+  → re-brainstorm → run → verify   Feedback loop enforcement
+  → run → verify                    Reproducible process
+  → final result            ──▶     (P3: determined after P0-P2 iteration)
+```
+---
+## 1. Debug (`--debug`) — Execution Trace
+### Purpose
+Trace rlp-desk's execution process. Not verbose data dumps — focused logging
+of whether rules were followed and options behaved as configured.
+### Two audiences
+| Audience | Use case |
+|----------|----------|
+| Developer (self) | Verify governance compliance, catch erroneous execution (e.g., codex consensus FAIL treated as PASS) |
+| External users (npm) | Run with `--debug`, attach debug.log + version to bug report |
+### Scope
+- Governance rule compliance (IL-1 through IL-5, checkpoint enforcement)
+- Option behavior verification (consensus, per-us, model routing)
+- Decision points (model upgrades, circuit breaker triggers)
+- NOT implementation details of Worker/Verifier content
+### Current state
+Basic implementation exists in v0.3.6 (debug.log with phase-level entries).
+Refine to match the scoped purpose above — no expansion needed, possibly trimming.
+### Versioning
+debug.log is versioned on re-execution: `debug-v1.log`, `debug-v2.log`, ...
+Preserved for bug tracking across versions.
+---
+## 2. Self-Verification (`--with-self-verification`) — Quality Feedback Loop
+### Purpose
+Evaluate whether the AI implementation meets quality expectations.
+When the user is unsatisfied, self-verification becomes the **input for the next
+execution cycle** — same goal, improved strategy.
+### Status
+Remains an **optional flag** (`--with-self-verification`). Not always-on.
+Simple tasks don't need the re-execution cycle.
+### Current vs. New
+| Aspect | Current (v0.3.x) | New vision |
+|--------|-------------------|------------|
+| Timing | Post-campaign report only | Input for re-execution cycle |
+| Output | Static report | Living document that drives next iteration |
+| Scope | Analysis of what happened | Analysis + recommendations that reshape execution plan |
+| PRD | Not touched | May be refined based on self-verification findings |
+### Re-execution Cycle (same slug)
+```
+brainstorm("auth-refactor")
+  → init → run → self-verification-v1 + campaign-report-v1 + debug-v1
+                              │
+              user: "not satisfied, re-run"
+                              │
+                              ▼
+re-brainstorm("auth-refactor")
+  │
+  ├─ PRD: single file, updated in place if needed (no versioning)
+  ├─ SV report: renamed to self-verification-v1.md (preserved)
+  ├─ Campaign report: renamed to campaign-report-v1.md (preserved)
+  ├─ Debug log: renamed to debug-v1.log (preserved)
+  ├─ Everything else: deleted (test-spec, prompts, context, memos, logs)
+  └─ Re-brainstorm: informed by self-verification-v1
+                              │
+                              ▼
+  → init → run → self-verification-v2 + campaign-report-v2 + debug-v2
+                              │
+                              ...
+```
+### Versioning Rules
+When re-running the same slug:
+1. **PRD** — single file (`prd-<slug>.md`), updated in place if needed.
+   No versioning. PRD is the single source of truth.
+2. **Versioned files (3 total)** — renamed with vN suffix before re-run:
+   - `self-verification-report.md` → `self-verification-v1.md`
+   - `campaign-report.md` → `campaign-report-v1.md`
+   - `debug.log` → `debug-v1.log`
+3. **Everything else** — deleted. Next run regenerates them automatically:
+   - `test-spec-<slug>.md`, `prompts/`, `context/`, `memos/`, `logs/<slug>/*`
+4. **Self-verification as the historical record** — each version's story
+   is told by its self-verification report. No need to preserve iteration
+   logs; SV summarizes what happened.
+### Re-execution Detection
+When brainstorm detects an existing slug:
+- Ask the user: "Improve based on previous results, or start fresh?"
+- If improve: version existing files, carry forward PRD, re-brainstorm with SV context
+- If start fresh: clean everything (equivalent to `clean` + new brainstorm)
+### Implementation: Shell Script
+Deterministic file operations (rename, delete, version detection) go in
+`init_ralph_desk.zsh`. AI handles judgment (PRD refinement, strategy changes).
+---
+## 3. Post-Run Reporting — Mandatory Completion Report
+### Purpose
+After `run` completes, the user MUST receive a comprehensive, templated report
+before deciding next steps. This is the **decision surface** for whether to
+re-brainstorm or accept the result.
+### Trigger
+Mandatory after every `run` completion (COMPLETE, BLOCKED, or TIMEOUT).
+Not optional. Not skippable. Applies regardless of SV flag.
+### Output
+- **File**: `logs/<slug>/campaign-report.md` (versioned on re-execution)
+- **Screen**: Full report displayed to user
+### Report Template
+```markdown
+# Campaign Report: <slug>
+## Objective
+<from PRD>
+## Execution Summary
+| Metric | Value |
+|--------|-------|
+| Total iterations | N |
+| Outcome | COMPLETE / BLOCKED / TIMEOUT |
+| Worker model | sonnet / opus |
+| Verifier model | opus |
+| Duration | Xm Ys |
+## User Stories Status
+| US | Description | Status | Iterations | Notes |
+|----|-------------|--------|------------|-------|
+| US-001 | ... | PASS | 2 | — |
+| US-002 | ... | PASS | 4 | 2 fix rounds |
+## Verification Results
+- L1 (Unit): PASS — N tests, N assertions
+- L2 (Integration): PASS / N/A
+- L3 (E2E): PASS — input/output comparison
+- L4 (Deploy): N/A
+## Issues Encountered
+<failures, fix loops, model upgrades, escalations>
+## Cost & Performance
+| Role | Model | Tokens | Duration | Source |
+|------|-------|--------|----------|--------|
+| Worker | sonnet | N | Xm Ys | measured/estimated |
+| Verifier | opus | N | Xm Ys | measured/estimated |
+| **Total** | | **N** | **Xm Ys** | |
+## Self-Verification Summary (if enabled)
+<from self-verification report — strengths, weaknesses, recommendations>
+## Files Changed
+<git diff --stat summary>
+```
+### Post-Report Flow
+```
+[All runs]
+  → Report displayed + saved to file
+[SV enabled only]
+  → "Would you like to re-brainstorm to improve the result?"
+  → Yes: trigger re-execution cycle (§2)
+  → No:  session ends
+[SV not enabled]
+  → Report displayed, session ends (no re-brainstorm question)
+```
+### Rules
+- Report content must reference actual data (status.json, iteration results,
+  self-verification if available) — no fabrication
+- Template is fixed; sections may show "N/A" but cannot be omitted
+- Re-brainstorm question only appears when SV is enabled
+---
+## 4. Workflow Generation — From Ad-hoc to Reproducible (P3, deferred)
+### Purpose
+After multiple self-verification cycles produce a final result,
+automatically generate a **formalized workflow** that captures the
+proven process as a reusable, enforceable process.
+### Status: DEFERRED
+P3 design will be determined after P0-P2 are working and tested through
+real usage. The user will iterate with the developer to find the right form.
+### Known Direction
+- Output: **skill + command composition** (not rlp-desk PRD format)
+- Invoking the command triggers a combination of skills
+- The command structure enforces feedback loops
+- Leverage existing skill ecosystem (`/find-skills`, etc.)
+- Trust AI models + well-structured skills + checklist-managed feedback loops
+### Success Criteria (confirmed)
+- Process must be reproducible: same workflow → same quality of results
+- Code implementation may differ, but behavior/quality must be equivalent
+- Feedback loop + template documents as structural components
+### Open Questions (to resolve after P0-P2)
+1. Separate subcommand (`/rlp-desk workflow <slug>`) or independent skill?
+2. Minimum self-verification versions required (2? 3?)
+3. How to validate generated workflow actually reproduces results?
+---
+## Feature Relationship
+```
+┌──────────────────────────────────────────────────────────────┐
+│                     rlp-desk execution                       │
+│                                                              │
+│  brainstorm → init → run ──┐                                 │
+│                             │                                │
+│              ┌──────────────┼──────────────────┐             │
+│              │              │                  │             │
+│           --debug    --with-self-verification   │             │
+│           (execution  (quality evaluation)      │             │
+│            trace)           │                   │             │
+│              │              ▼                   │             │
+│              │     self-verification report     │             │
+│              │              │                   │             │
+│              ▼              ▼                   │             │
+│         debug.log   Post-Run Report (mandatory) │             │
+│         (versioned)  + campaign-report.md        │             │
+│              │       (versioned)                 │             │
+│              │         │                        │             │
+│              │    [SV enabled?]                  │             │
+│              │      │         │                  │             │
+│              │     Yes        No                 │             │
+│              │      │         │                  │             │
+│              │      ▼         ▼                  │             │
+│              │  "Re-brainstorm?"  End            │             │
+│              │    │         │                    │             │
+│              │   Yes        No                   │             │
+│              │    │         │                    │             │
+│              │    ▼         ▼                    │             │
+│              │  Re-execute  Accept               │             │
+│              │  (vN+1)     result                │             │
+│              │    │         │                    │             │
+│              │    │    ┌────┘                    │             │
+│              │    ▼    ▼                         │             │
+│              │  Workflow Generation (P3)          │             │
+│              │  (deferred — after P0-P2)          │             │
+│              │                                   │             │
+│              └── Bug report (external users)     │             │
+└──────────────────────────────────────────────────────────────┘
+```
+---
+## Implementation Priority
+| Phase | Feature | Dependency | Scope |
+|-------|---------|------------|-------|
+| P0 | Debug refinement | None | rlp-desk.md, governance.md |
+| P1 | Post-Run Report | None | rlp-desk.md |
+| P2 | Self-Verification redesign | P1 | rlp-desk.md, init_ralph_desk.zsh, governance.md |
+| P3 | Workflow Generation | P2 + real usage | TBD after iteration |
+- P0 and P1 are independent, can be done in parallel.
+- P2 builds on P1 (report triggers re-brainstorm question).
+- P3 requires P2 (needs versioned self-verification data) + real-world testing.
+- Breaking changes allowed (0.x semver). Document in CHANGELOG.
+---
+## Design Decisions Log (from Deep Interview)
+| # | Decision | Rationale |
+|---|----------|-----------|
+| 1 | brainstorm auto-handles re-execution | Natural UX — same command, system detects context |
+| 2 | Mixed judgment (quantitative + user) | Pure metrics miss quality nuance; pure subjective misses patterns |
+| 3 | Breaking changes OK | 0.x semver, clean redesign over backward compat hacks |
+| 4 | P3 essential but deferred | Core vision requires it, but form emerges from P0-P2 usage |
+| 5 | Skill + command composition for P3 | Leverage existing ecosystem, not reinvent |
+| 6 | All features in rlp-desk | Connected UX > separate tools |
+| 7 | Shell script for deterministic ops | AI interpretation unreliable for file manipulation |
+| 8 | SV remains optional | Simple tasks don't need re-execution overhead |
+| 9 | Re-brainstorm only with SV | No SV = no improvement data = no point asking |
+| 10 | Report always mandatory | Users need decision surface regardless of SV |
+| 11 | PRD = single file, no versioning | Source of truth, updated in place |
+| 12 | Only 3 files versioned | SV report + campaign report + debug.log |
+| 13 | Report includes cost section | Token/time tracking for optimization decisions |
+| 14 | Ask user intent on slug reuse | "Improve?" vs "Start fresh?" — don't assume |
+---
+## Open Questions (P3 only)
+1. **Workflow as subcommand or separate skill?** — defer to P0-P2 experience
+2. **Minimum SV versions for generation** — need real data to determine
+3. **Reproducibility validation method** — how to test generated workflow works
+4. **Skill composition mechanics** — how feedback loop enforcement works in practice

package/docs/prompts/ralplan-codex-review.md ADDED Viewed

@@ -0,0 +1,55 @@
+# Ralplan + Codex Cross-Validation
+```
+/ralplan {{OBJECTIVE}}
+{{SCOPE}}
+Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
+If source documents are insufficient, identify gaps before proceeding.
+```
+---
+## Placeholders
+| Placeholder | Required | Description |
+|---|---|---|
+| `{{OBJECTIVE}}` | Yes | Planning goal. Naturally includes target documents, deliverable type, and context. |
+| `{{SCOPE}}` | No | Files or directories to change. Omit to target the entire current project. |
+## Prerequisites
+- [oh-my-claudecode](https://github.com/anthropics/oh-my-claudecode) installed and configured
+- Codex CLI installed and authenticated (`codex --version`)
+- Deep interview (`/deep-interview`) completed beforehand to clarify requirements
+## How It Works
+1. `/ralplan` runs the Planner -> Architect -> Critic consensus loop (built-in)
+2. After consensus APPROVE, codex independently reviews the plan
+3. If codex finds issues, the plan is revised and re-enters consensus
+4. Repeats until codex returns 0 issues
+## Why
+Iterating ralplan consensus with codex cross-validation produces plans robust enough
+to leverage rlp-desk's verification loop (self-verification, post-run report)
+effectively. Plans that survive both reviewers have fewer surprises during execution.
+## Examples
+### With scope
+```
+/ralplan Implementation plan based on blueprint-v0.4
+src/commands/rlp-desk.md, src/governance.md, src/scripts/init_ralph_desk.zsh
+Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
+If source documents are insufficient, identify gaps before proceeding.
+```
+### Without scope (defaults to current project)
+```
+/ralplan Auth module refactoring strategy
+Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
+If source documents are insufficient, identify gaps before proceeding.
+```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ai-dev-methodologies/rlp-desk",
-  "version": "0.3.5",
+  "version": "0.4.0",
   "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
   "scripts": {
     "postinstall": "node scripts/postinstall.js",

package/src/commands/rlp-desk.md CHANGED Viewed

@@ -62,7 +62,11 @@ After all items are confirmed:
    Present the score table to the user before proceeding.
 2. Present the full contract summary.
 3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
-4. On approval, offer to run `init`.
+4. **Re-execution check**: After slug is confirmed, check if `.claude/ralph-desk/plans/prd-<slug>.md` already exists. If a PRD already exists for this slug, ask: "A PRD already exists for this slug. Improve the existing PRD or start fresh (delete and recreate)?"
+   - "improve" → pass `--mode improve` to init
+   - "start fresh" → pass `--mode fresh` to init
+   - If no PRD exists: standard first-run (no --mode needed)
+5. On approval, offer to run `init`.
 Do NOT create files during brainstorm.
 Do NOT auto-decide iteration unit — the user MUST explicitly choose.
@@ -71,7 +75,7 @@ Do NOT auto-decide iteration unit — the user MUST explicitly choose.
 ## `init <slug> [objective]`
-Run: `~/.claude/ralph-desk/init_ralph_desk.zsh <slug> "<objective>"`
+Run: `~/.claude/ralph-desk/init_ralph_desk.zsh <slug> "<objective>" [--mode fresh|improve]`
 If brainstorm was done, auto-fill PRD and test-spec with the results.
 **After init completes, STOP. Do NOT auto-run the loop.**
@@ -133,6 +137,8 @@ Options (parse from `$ARGUMENTS`):
 - `--consensus-scope all|final-only` — when consensus runs (default: `all`)
   - `all`: consensus runs on every verify (current behavior)
   - `final-only`: consensus only on final ALL verify
+- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
+- `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
 - `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
 - `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
@@ -164,7 +170,10 @@ VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
 VERIFY_MODE=<--verify-mode value, default: per-us> \
 VERIFY_CONSENSUS=<1 if --verify-consensus, else 0> \
 CONSENSUS_SCOPE=<--consensus-scope value, default: all> \
+CB_THRESHOLD=<--cb-threshold value, default: 3> \
+ITER_TIMEOUT=<--iter-timeout value, default: 600> \
 DEBUG=<1 if --debug, else 0> \
+WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
   zsh ~/.claude/ralph-desk/run_ralph_desk.zsh
 ```
 6. **If the script exits with error (exit code 1)** — report the error to the user and STOP. Do NOT attempt to work around it. Do NOT create tmux sessions yourself. Do NOT re-launch the script in a different way. Just tell the user what went wrong and suggest using Agent mode instead.
@@ -174,6 +183,7 @@ DEBUG=<1 if --debug, else 0> \
 - Tmux mode requires the user to already be inside a tmux session. If the runner script rejects because $TMUX is not set, do NOT try to create a tmux session yourself. Tell the user: "Start tmux first, then retry."
 - Do NOT run the script in background (`&`, `run_in_background`). The script must run in foreground so panes remain visible to the user. The user needs to see Worker/Verifier panes in real-time.
 - Do NOT kill panes after completion. Panes stay alive for inspection. User cleans up with `/rlp-desk clean <slug> --kill-session`.
+- `--with-self-verification` is accepted in tmux mode but SV report generation is Agent-mode only (requires AI analysis). In tmux mode, the flag is recorded in session-config for post-hoc analysis. Use Agent mode for full SV report generation.
 #### Agent Mode (`--mode agent` or default)
@@ -183,6 +193,10 @@ DEBUG=<1 if --debug, else 0> \
 3. Clean previous `done-claim.json`, `verify-verdict.json`.
 4. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
 5. If `--debug`: also create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`. When `--debug` is active, debug.log contains all baseline.log fields plus detailed phase logs.
+   - **4-category log system**: all debug_log entries use exactly one of: `[GOV]` (governance checks: IL enforcement, CB triggers, scope lock, verdict evaluation), `[DECIDE]` (leader decisions: model selection, fix contracts, escalation), `[OPTION]` (configuration snapshot at loop start: thresholds, modes, models), `[FLOW]` (execution progress: worker/verifier dispatch, signal reads, phase transitions)
+   - **Re-execution versioning**: If `debug.log` already exists at `--debug` start, rename it to `debug-v{N}.log` (N = next available integer ≥ 1) before creating a fresh `debug.log`.
+   - **baseline.log lifecycle**: baseline.log is deleted on re-execution (when `init --mode improve` or `init --mode fresh` is run).
+6. Capture baseline commit: `Bash("git rev-parse HEAD 2>/dev/null || echo none")` → store as `BASELINE_COMMIT`. Include in the first `status.json` write as `baseline_commit` field.
 ### Leader Loop
@@ -194,7 +208,10 @@ DEBUG=<1 if --debug, else 0> \
 - **After every step result, IMMEDIATELY start the next step's tool call in the SAME response.** For example, after reading the verdict (⑦c), report via Bash("echo") AND start ⑧'s tool calls in one response.
 - If you output "Iter 1 complete, moving to iter 2" as plain text without a tool call, the turn terminates and the loop breaks. This is a platform constraint, not a compliance issue — no amount of "DO NOT STOP" text can override it.
-If `--debug`, at loop start debug_log: `[PLAN] slug=<slug> max_iter=<N> worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
+If `--debug`, at loop start debug_log the following (3 [OPTION] entries):
+- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
+- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N>`
+- `[OPTION] worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model>`
 For each iteration (1 to max_iter):
@@ -213,13 +230,13 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
 **② Read memory.md** → Stop Status, Next Iteration Contract
 - Also read **Completed Stories** → verified work so far
 - Also read **Key Decisions** → settled architectural choices
-- If `--debug`: debug_log `[EXEC] iter=N phase=read_memory stop_status=<status> contract="<summary>"`
+- If `--debug`: debug_log `[FLOW] iter=N phase=read_memory stop_status=<status> contract="<summary>"`
 **③ Decide model** (§4 of governance.md)
 - Previous iteration failed → upgrade model
 - Simple task → downgrade
 - User specified → use that
-- If `--debug`: debug_log `[EXEC] iter=N phase=model_select worker_model=<model> reason=<reason>`
+- If `--debug`: debug_log `[DECIDE] iter=N phase=model_select worker_model=<model> reason=<reason>`
 **④ Build worker prompt (Prompt Assembly Protocol)**
 1. Capture `WORKING_DIR` once: use `$PWD` from when `/rlp-desk run` was invoked. Store for all prompt construction.
@@ -232,11 +249,11 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
 **④½ Contract review** (agent mode only)
 - Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
-- If `--debug`: debug_log `[EXEC] iter=N phase=contract_review result=<ok|issues>`
-- In tmux mode: skip (shell leader cannot reason). Log: `[EXEC] iter=N phase=contract_review skipped=tmux_mode`
+- If `--debug`: debug_log `[GOV] iter=N phase=contract_review scope_lock=<us_id|null> ac_count=<N> result=<ok|issues>`
+- In tmux mode: skip (shell leader cannot reason). Log: `[FLOW] iter=N phase=contract_review skipped=tmux_mode`
 **⑤ Execute Worker**
-- If `--debug`: debug_log `[EXEC] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
+- If `--debug`: debug_log `[FLOW] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
 If `--worker-engine claude` (default):
 ```
@@ -258,14 +275,13 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 - Codex runs as a subprocess via Bash(), not Agent().
 - Each Bash() call = fresh context for codex.
-- If `--debug`: debug_log `[EXEC] iter=N phase=worker_done engine=<engine>`
 **⑥ Read memory.md again** (Worker updated it)
 - `stop=continue` → go to ⑧
 - `stop=verify` → go to ⑦
 - `stop=blocked` → write BLOCKED sentinel, stop
 - Also read `iter-signal.json` for `us_id` field (which US was just completed)
-- If `--debug`: debug_log `[EXEC] iter=N phase=worker_signal status=<stop_status> us_id=<us_id>`
+- If `--debug`: debug_log `[FLOW] iter=N phase=worker_done_signal engine=<engine> status=<stop_status> us_id=<us_id>`
 **CRITICAL: Immediately proceed to ⑦. Do NOT pause, do NOT ask the user, do NOT wait for confirmation. The loop is autonomous.**
@@ -291,7 +307,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 **⑦a Dispatch Verifier**
 - Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
 - **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
-- If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
+- If `--debug`: debug_log `[FLOW] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
 If `--verifier-engine claude` (default):
 ```
@@ -316,7 +332,7 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
 - Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
 - **Both pass** → proceed (next US or COMPLETE)
 - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
-- Max 3 consensus rounds per US. After 3 rounds → BLOCKED.
+- Max 6 consensus rounds per US. After 6 rounds → BLOCKED.
 **NO ENGINE PRIORITY (ABSOLUTE):** There is no primary or secondary engine. Claude and Codex have EQUAL weight. If one passes and the other fails, the verdict is FAIL — always. The Leader MUST NOT override, prioritize, or dismiss either engine's verdict. "Claude priority", "primary engine override", "infrastructure failure" (when a valid verdict file exists), or any similar rationalization = governance violation. Infrastructure failure means ONLY: CLI crash (exit ≠ 0), timeout, or verdict file not generated.
@@ -330,22 +346,26 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
     3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
     4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
     5. Increment `consecutive_failures` in `status.json`
-    6. If `consecutive_failures >= 3` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
+    6. If `consecutive_failures >= cb_threshold` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
+       - If `--debug`: debug_log `[GOV] iter=N phase=CB_trigger consecutive_failures=<N> us_id=<us_id> action=architecture_escalation`
     7. Go to ⑧ with fix contract as next Worker contract
   - `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
   - `blocked` → write BLOCKED sentinel, stop
-- If `--debug`: debug_log `[EXEC] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id>`
-- If `--debug`: debug_log `[EXEC] iter=N phase=layer_check L1=<status> L2=<status> L3=<status> L4=<status>`
-- If `--debug`: debug_log `[EXEC] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
-- If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
-- If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
+- If `--debug`: debug_log `[GOV] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id> L1=<status> L2=<status> L3=<status> L4=<status>`
+- If `--debug`: debug_log `[GOV] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
+**⑦d Archive iteration artifacts** (always — independent of --debug)
+After reading the verdict, archive to `logs/<slug>/`:
+- `iter-NNN-done-claim.json` ← copy from `memos/<slug>-done-claim.json`
+- `iter-NNN-verify-verdict.json` ← copy from `memos/<slug>-verify-verdict.json`
+(Preserved across clean; data source for campaign report generation and SV analysis.)
 **CRITICAL: Immediately proceed to ⑧. Do NOT pause, do NOT ask the user. Continue the loop.**
 **⑧ Write result log and report to user, continue loop**
 - Write `logs/<slug>/iter-NNN.result.md`:
   - Result status `[leader-measured]`
-  - Files changed via `git diff --stat HEAD~1 HEAD` `[git-measured]`
+  - Files changed: cumulative working tree state via `git diff --stat HEAD` `[git-measured]` (note: cumulative in tmux mode, not per-iteration delta)
   - Verifier verdict `[leader-measured]`
 - **Record cost & performance per iteration**:
   - Agent mode: record `total_tokens` and `duration_ms` from Agent() return metadata for both Worker and Verifier
@@ -354,10 +374,10 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
 - Write `status.json`
 - Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
 - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
-- If `--debug`: debug_log `[EXEC] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
+- If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
 At loop end (COMPLETE, BLOCKED, or TIMEOUT):
-- If `--debug`: debug_log `[VALIDATE] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
+- If `--debug`: debug_log `[FLOW] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
 **⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
@@ -421,6 +441,23 @@ PRD, and test-spec. Information from source code inspection that is not in these
 or explicitly marked as "[source-inspection]" with justification.
 ```
+**⑩ Campaign Report** (always — independent of `--debug` and `--with-self-verification`)
+After the loop ends (COMPLETE, BLOCKED, or TIMEOUT), generate `logs/<slug>/campaign-report.md`:
+1. If `campaign-report.md` already exists, rename it to `campaign-report-v{N}.md` (N = next available integer ≥ 1) before writing new.
+2. Generate report with 8 required sections:
+   - **Objective**: From PRD
+   - **Execution Summary**: Iterations run, terminal state (COMPLETE/BLOCKED/TIMEOUT), elapsed time
+   - **US Status**: Each US with final verified/failed/pending status (from `status.json`)
+   - **Verification Results**: Per-US and final verify outcomes (from archived iter artifacts)
+   - **Issues Encountered**: Fix contracts and failure verdicts from campaign
+   - **Cost & Performance**: Per-iter token/duration data from `status.json`
+   - **SV Summary**: If `--with-self-verification` ran, pointer to SV report file; otherwise "N/A — --with-self-verification not enabled"
+   - **Files Changed**: `git diff --stat <baseline_commit>` (working tree vs baseline, includes uncommitted changes and untracked files). Note: may include pre-existing uncommitted changes if the campaign started in a dirty worktree.
+3. Data sources: `status.json` (baseline_commit, per-iter data), archived `iter-NNN-done-claim.json` / `iter-NNN-verify-verdict.json`, PRD, git diff.
+4. If `--with-self-verification` was enabled: ⑨ SV report runs first, then ⑩ Campaign Report (which includes the SV Summary section pointing to the SV report file).
 ### Circuit Breaker
 - context-latest.md unchanged 3 iterations → BLOCKED
 - Same acceptance criterion fails 2 consecutive iterations → upgrade model, retry once, then BLOCKED
@@ -441,6 +478,7 @@ When `--verify-consensus` is enabled, also track in `status.json`:
 - YOU track iteration count
 - Write `status.json` after each iteration
 - Worker claim ≠ complete. Only YOU write COMPLETE sentinel after verifier pass.
+- **NEVER modify rlp-desk infrastructure files** (`~/.claude/ralph-desk/*`, `~/.claude/commands/rlp-desk.md`). If you or a Worker/Verifier discovers a bug in rlp-desk itself, write BLOCKED sentinel with reason `"rlp-desk bug: <description>"` and STOP. Do NOT attempt to fix rlp-desk — report the bug to the user.
 ---
@@ -463,11 +501,21 @@ Remove:
 - `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
 - `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
 - `.claude/ralph-desk/memos/<slug>-escalation.md`
-Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
+Note: `logs/<slug>/self-verification-data.json`, `self-verification-report-NNN.md`, `campaign-report.md`, `campaign-report-v{N}.md`, `iter-NNN-done-claim.json`, and `iter-NNN-verify-verdict.json` are intentionally preserved across clean for historical comparison.
-If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
+If `--kill-session` is passed, clean up ALL tmux artifacts:
 ```bash
+# Kill rlp-desk tmux sessions
 tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "^rlp-desk-<slug>-" | while read s; do tmux kill-session -t "$s"; done
+# Kill split panes in current window (Worker/Verifier panes from --mode tmux)
+# Find panes running claude/codex for this slug and kill them
+for pane_id in $(tmux list-panes -F '#{pane_id}:#{pane_current_command}' 2>/dev/null | grep -i 'claude\|codex' | cut -d: -f1); do
+  tmux kill-pane -t "$pane_id" 2>/dev/null
+done
+# Kill any remaining claude/codex processes for this campaign
+ps aux | grep -E "claude.*<slug>|codex.*<slug>" | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
 ```
 ## No args or `help`
@@ -493,6 +541,8 @@ Run options:
   --verify-mode per-us|batch Verification strategy (default: per-us)
   --verify-consensus         Cross-engine consensus verification
   --consensus-scope SCOPE    When consensus runs: all|final-only (default: all)
+  --cb-threshold N           CB threshold: consecutive failures before BLOCKED (default: 3)
+  --iter-timeout N           Per-iteration timeout in seconds, tmux mode only (default: 600)
   --debug                    Debug logging (logs/<slug>/debug.log)
   --with-self-verification   Campaign self-verification analysis (post-loop report)
 ```