npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.5.1 → 0.5.3 - Mend

@ai-dev-methodologies/rlp-desk 0.5.1 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +41 -27
package/docs/plans/frolicking-churning-honey.md +253 -0
package/install.sh +18 -17
package/package.json +1 -1
package/scripts/postinstall.js +32 -13
package/src/commands/rlp-desk.md +130 -109
package/src/governance.md +74 -23
package/src/scripts/lib_ralph_desk.zsh +41 -11
package/src/scripts/run_ralph_desk.zsh +72 -42

package/src/commands/rlp-desk.md CHANGED Viewed

@@ -25,6 +25,7 @@ Ask about these items one by one (or in small groups):
 2. **Objective** — what the loop achieves
 3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
    - Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
+   - **Task Sizing (governance §1c)**: Size each US within the Worker's comfortable zone — smaller than what the Worker can handle, not at its ceiling. Max 3-4 ACs, max 2 files. If a US feels "just barely doable" for the target model, split it further.
    - Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
      ```
      Given [precondition in domain language]
@@ -53,39 +54,47 @@ Ask about these items one by one (or in small groups):
    | External dependencies | none | 1-2 | 3+ | distributed |
    | Existing code impact | new only | modify | refactor | architecture |
-   **Model mapping** (Worker / Verifier):
-   - LOW → haiku / sonnet
-   - MEDIUM → sonnet / opus
-   - HIGH → opus / opus
-   - CRITICAL → opus / opus + require human review
+   **Codex Detection** — check if codex CLI is installed (`command -v codex`).
-   Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM → I suggest Worker: sonnet, Verifier: opus."
+   **Model mapping — Claude-only** (codex not installed):
-8. **Engine & Model** — For each role (Worker, Verifier):
-   - Engine: claude (default) or codex
-   - If claude: suggest model (haiku/sonnet/opus) based on task complexity
-   - If codex: suggest model (default: gpt-5.4) and reasoning effort (low/medium/high)
+   | Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
+   |------------|--------|-----------------|----------------|-----------|
+   | LOW | haiku | sonnet | opus | off |
+   | MEDIUM | sonnet | opus | opus | off |
+   | HIGH | opus | opus | opus | off |
+   | CRITICAL | opus | opus | opus + human | off |
-   **Codex Detection** — check if codex CLI is installed (`command -v codex`):
+   **Model mapping — Cross-engine** (codex installed, recommended):
-   **If codex IS installed** — recommend cross-engine Worker:
-   - Suggest: `--worker-model gpt-5.4:high --verify-consensus` (cross-engine + consensus)
-   - Alternative: `--worker-model gpt-5.3-codex-spark:high` (spark preset — note: 100k output token limit per request, best for smaller scope PRDs)
-   - Say: "Codex is installed. I recommend it as Worker for cost savings (codex tokens are cheaper than claude tokens for bulk iteration) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
+   | Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
+   |------------|--------|-----------------|----------------|-----------|
+   | LOW | spark:high | sonnet | opus | final-only |
+   | MEDIUM | spark:high | opus | opus | final-only |
+   | HIGH | gpt-5.4:high | opus | opus | all |
+   | CRITICAL | gpt-5.4:high | opus | opus + human | all |
-   **If codex is NOT installed** — recommend claude-only + install suggestion:
-   - Defaulting to claude-only Worker (sonnet).
-   - Say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
+   **Worker model selection** (cross-engine):
+   - **spark:high** — default recommendation (Pro token pool = cost savings). PRD AC count <= 15
+   - **gpt-5.4:high** — fallback when spark 100k output limit exceeded. PRD AC count > 15
-   AI should recommend: "For this task complexity, I suggest Worker: sonnet, Verifier: opus"
-   If codex selected: "For codex Worker, I suggest gpt-5.4 with high reasoning"
+   Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM."
+   **If codex IS installed** — say: "Codex is installed. I recommend cross-engine Worker for cost savings (Pro token pool separation) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
+   **If codex is NOT installed** — say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
+8. **Batch Capacity Check** — when verify-mode is batch and PRD is large:
+   - batch + spark + AC > 10 → warn "spark 100k output limit — consider wave split or switch to gpt-5.4"
+   - batch + gpt-5.4 + AC > 15 → warn "too many ACs for single batch — consider wave split (3-4 US per wave)"
+   - per-us → no warning (US-level processing, no limit concern)
 9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
-10. **Verify Consensus** — Ask: "Use cross-engine consensus verification? (Both claude and codex verify independently, both must pass.) Requires codex CLI." Default: no.
-11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
-12. **Max Iterations** — suggest based on story count, ask if OK.
+10. **Consensus** — Ask: "Use cross-engine consensus? off (single engine), final-only (cross-engine on final verify only), or all (cross-engine on every verify). Requires codex CLI." Default: off. Recommended: final-only when codex is installed.
+11. **Max Iterations** — suggest based on story count, ask if OK.
 After all items are confirmed:
+0. **SV Report Feedback** — If a prior campaign's self-verification report exists for this project (`~/.claude/ralph-desk/analytics/*/self-verification-report-*.md`), reference it to inform this brainstorm: which US types failed most, which model tiers underperformed, which AC patterns caused issues. Present relevant findings to the user. (governance §8½)
 1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
    If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
    If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
@@ -123,27 +132,33 @@ Tell the user:
    ```
    Available run commands (copy the one you want):
-   # Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
-   /rlp-desk run <actual-slug> --worker-model gpt-5.4:high --final-consensus --debug
+   # ★ Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
+   /rlp-desk run <actual-slug> --mode tmux --worker-model spark:high --consensus final-only --debug
-   # Spark Pro preset (fast codex worker, lower cost):
-   /rlp-desk run <actual-slug> --worker-model gpt-5.3-codex-spark:high --debug
+   # Large PRD (AC > 15, exceeds spark 100k limit):
+   /rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus final-only --debug
+   # Critical (full consensus on every verify):
+   /rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus all --debug
    # Claude-only:
    /rlp-desk run <actual-slug> --debug
-   # Basic agent:
-   /rlp-desk run <actual-slug>
    # Full options reference:
-   #   --mode agent|tmux                (default: agent)
-   #   --worker-model MODEL             haiku|sonnet|opus or gpt-5.4:low|medium|high (default: sonnet)
-   #   --verifier-model MODEL           haiku|sonnet|opus (default: opus)
-   #   --verify-consensus               both claude+codex must pass
-   #   --verify-mode per-us|batch       (default: per-us)
-   #   --max-iter N                     (default: 100)
-   #   --debug                          enable debug logging
-   #   --with-self-verification         post-campaign analysis report
+   #   --mode agent|tmux                      (default: agent)
+   #   --worker-model MODEL                   haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
+   #   --lock-worker-model                    disable auto model upgrade
+   #   --verifier-model MODEL                 per-US verifier (default: sonnet)
+   #   --final-verifier-model MODEL           final ALL verifier (default: opus)
+   #   --consensus off|all|final-only         cross-engine consensus (default: off)
+   #   --consensus-model MODEL                per-US cross-verifier (default: gpt-5.4:medium)
+   #   --final-consensus-model MODEL          final cross-verifier (default: gpt-5.4:high)
+   #   --verify-mode per-us|batch             (default: per-us)
+   #   --cb-threshold N                       (default: 6)
+   #   --max-iter N                           (default: 100)
+   #   --iter-timeout N                       tmux only (default: 600)
+   #   --debug                                debug logging
+   #   --with-self-verification               post-campaign SV report
    ```
    **If codex is NOT installed** — show claude-only presets + install recommendation:
@@ -151,7 +166,7 @@ Tell the user:
    ```
    Available run commands (copy the one you want):
-   # Recommended: tmux mode + claude-only (real-time visibility):
+   # ★ Recommended: tmux mode + claude-only (real-time visibility):
    /rlp-desk run <actual-slug> --mode tmux --debug
    # Agent mode:
@@ -161,13 +176,17 @@ Tell the user:
    npm install -g @openai/codex
    # Full options reference:
-   #   --mode agent|tmux                (default: agent)
-   #   --worker-model MODEL             haiku|sonnet|opus (default: sonnet)
-   #   --verifier-model MODEL           haiku|sonnet|opus (default: opus)
-   #   --verify-mode per-us|batch       (default: per-us)
-   #   --max-iter N                     (default: 100)
-   #   --debug                          enable debug logging
-   #   --with-self-verification         post-campaign analysis report
+   #   --mode agent|tmux                      (default: agent)
+   #   --worker-model MODEL                   haiku|sonnet|opus (default: haiku)
+   #   --lock-worker-model                    disable auto model upgrade
+   #   --verifier-model MODEL                 per-US verifier (default: sonnet)
+   #   --final-verifier-model MODEL           final ALL verifier (default: opus)
+   #   --verify-mode per-us|batch             (default: per-us)
+   #   --cb-threshold N                       (default: 6)
+   #   --max-iter N                           (default: 100)
+   #   --iter-timeout N                       tmux only (default: 600)
+   #   --debug                                debug logging
+   #   --with-self-verification               post-campaign SV report
    ```
    Replace `<actual-slug>` with the real slug from this init (e.g. `auth-refactor`).
@@ -182,24 +201,21 @@ Tell the user:
 Options (parse from `$ARGUMENTS`):
 - `--mode agent|tmux` (default: `agent`) — execution mode
-- `--max-iter N` (default: 100)
-- `--worker-model MODEL` (default: sonnet)
-- `--verifier-model MODEL` (default: opus)
-- `--worker-engine claude|codex` (default: `claude`) — engine for Worker
-- `--verifier-engine claude|codex` (default: `claude`) — engine for Verifier
-- `--worker-codex-model MODEL` (default: `gpt-5.4`) — codex model for Worker
-- `--worker-codex-reasoning low|medium|high` (default: `high`) — reasoning for Worker
-- `--verifier-codex-model MODEL` (default: `gpt-5.4`) — codex model for Verifier
-- `--verifier-codex-reasoning low|medium|high` (default: `high`) — reasoning for Verifier
+- `--worker-model MODEL` (default: `haiku`) — Worker model. Format: `model` = claude engine, `model:reasoning` = codex engine. Examples: `haiku`, `sonnet`, `opus`, `spark:high`, `gpt-5.4:high`. Parsed by `parse_model_flag()` which auto-splits engine/model/reasoning.
+- `--lock-worker-model` — disable automatic model upgrade on failure (check_model_upgrade). Worker stays on the specified model regardless of consecutive failures.
+- `--verifier-model MODEL` (default: `sonnet`) — per-US verification model. Campaign-fixed (no progressive upgrade). Lighter than final verifier.
+- `--final-verifier-model MODEL` (default: `opus`) — final ALL verification model. Independent from per-US verifier. Used only for the final full-AC verify pass.
+- `--consensus off|all|final-only` (default: `off`) — cross-engine consensus verification mode.
+  - `off`: single-engine verification only
+  - `all`: cross-engine consensus on every verify (per-US and final)
+  - `final-only`: cross-engine consensus only on the final ALL verify
+- `--consensus-model MODEL` (default: `gpt-5.4:medium`) — per-US cross-verifier model. Lighter weight for cost efficiency.
+- `--final-consensus-model MODEL` (default: `gpt-5.4:high`) — final cross-verifier model. Stricter. Note: spark is not allowed here (100k output limit).
 - `--verify-mode per-us|batch` (default: `per-us`) — verification strategy
   - `per-us`: verify after each US, then final full verify of all AC
   - `batch`: verify only after all US done (legacy behavior)
-- `--verify-consensus` — enable cross-engine consensus verification (both claude and codex verify independently; both must pass)
-- `--consensus-scope all|final-only` — when consensus runs (default: `all`)
-  - `all`: consensus runs on every verify (current behavior)
-  - `final-only`: consensus only on final ALL verify
-- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
-- `--consensus-fail-fast` — skip second verifier if first verifier fails (saves time/tokens in consensus mode)
+- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 6). When `--consensus` is not `off`, effective threshold is automatically doubled (e.g., default becomes 12).
+- `--max-iter N` (default: 100)
 - `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
 - `--debug` — enable debug logging (writes to ~/.claude/ralph-desk/analytics/<slug>/debug.log)
 - `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
@@ -208,7 +224,7 @@ Options (parse from `$ARGUMENTS`):
 When `--debug` or `--with-self-verification` is active, analytics data is written to a user-level directory for cross-project aggregation. Contents:
 - `metadata.json` — campaign metadata: slug, project_root, campaign_status, start_time, end_time
 - `debug.log` — debug output (versioned: `debug-v{N}.log` on re-execution)
-- `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_engine, claude_verdict, codex_verdict, consensus, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
+- `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_model, verifier_engine, consensus_mode, claude_verdict, codex_verdict, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
 - `self-verification-data.json` — cumulative SV records (agent-mode only, when `--with-self-verification`)
 - `self-verification-report-NNN.md` — versioned SV reports (when `--with-self-verification`)
@@ -232,17 +248,14 @@ LOOP_NAME="<slug>" \
 ROOT="$PWD" \
 MAX_ITER=<--max-iter value> \
 WORKER_MODEL=<--worker-model value> \
-VERIFIER_MODEL=<--verifier-model value> \
-WORKER_ENGINE=<--worker-engine value, default: claude> \
-VERIFIER_ENGINE=<--verifier-engine value, default: claude> \
-WORKER_CODEX_MODEL=<--worker-codex-model value, default: gpt-5.4> \
-WORKER_CODEX_REASONING=<--worker-codex-reasoning value, default: high> \
-VERIFIER_CODEX_MODEL=<--verifier-codex-model value, default: gpt-5.4> \
-VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
+LOCK_WORKER_MODEL=<1 if --lock-worker-model, else 0> \
+VERIFIER_MODEL=<--verifier-model value, default: sonnet> \
+FINAL_VERIFIER_MODEL=<--final-verifier-model value, default: opus> \
 VERIFY_MODE=<--verify-mode value, default: per-us> \
-VERIFY_CONSENSUS=<1 if --verify-consensus, else 0> \
-CONSENSUS_SCOPE=<--consensus-scope value, default: all> \
-CB_THRESHOLD=<--cb-threshold value, default: 3> \
+CONSENSUS_MODE=<--consensus value, default: off> \
+CONSENSUS_MODEL=<--consensus-model value, default: gpt-5.4:medium> \
+FINAL_CONSENSUS_MODEL=<--final-consensus-model value, default: gpt-5.4:high> \
+CB_THRESHOLD=<--cb-threshold value, default: 6> \
 ITER_TIMEOUT=<--iter-timeout value, default: 600> \
 DEBUG=<1 if --debug, else 0> \
 WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
@@ -269,7 +282,7 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
 ### Preparation
 1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
-2. **Codex CLI pre-validation**: If `--verify-consensus` is enabled OR `--worker-engine codex` / `--verifier-engine codex` is set, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
+2. **Codex CLI pre-validation**: If `--consensus` is not `off` OR `--worker-model` uses codex format (contains `:`) OR `--verifier-model` / `--final-verifier-model` / `--consensus-model` / `--final-consensus-model` uses codex format, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
 3. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
 4. Clean previous `done-claim.json`, `verify-verdict.json`.
 5. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
@@ -290,9 +303,9 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
 - If you output "Iter 1 complete, moving to iter 2" as plain text without a tool call, the turn terminates and the loop breaks. This is a platform constraint, not a compliance issue — no amount of "DO NOT STOP" text can override it.
 If `--debug`, at loop start debug_log the following (3 [OPTION] entries):
-- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
-- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N>`
-- `[OPTION] worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model>`
+- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus_mode=<off|all|final-only>`
+- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N> lock_worker_model=<0|1>`
+- `[OPTION] worker_model=<model> verifier_model=<model> final_verifier_model=<model> consensus_model=<model> final_consensus_model=<model>`
 For each iteration (1 to max_iter):
@@ -341,7 +354,9 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
 **⑤ Execute Worker**
 - If `--debug`: debug_log `[FLOW] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
-If `--worker-engine claude` (default):
+Determine engine from `--worker-model` format: plain name (e.g., `haiku`) = claude engine, `model:reasoning` format (e.g., `spark:high`) = codex engine. Use `parse_model_flag()` to split.
+If claude engine (default):
 ```
 Agent(
   description="rlp-desk worker iter-NNN",
@@ -354,9 +369,9 @@ Agent(
 - Agent returns synchronously. No polling needed.
 - Each Agent() = fresh context. Guaranteed.
-If `--worker-engine codex`:
+If codex engine:
 ```
-Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_reasoning> <full worker prompt text>")
+Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full worker prompt text>")
 ```
 - Codex runs as a subprocess via Bash(), not Agent().
 - Each Bash() call = fresh context for codex.
@@ -396,26 +411,35 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 - **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
 - If `--debug`: debug_log `[FLOW] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
-If `--verifier-engine claude` (default):
+Determine which verifier model to use based on scope:
+- If `us_id` is a specific story (per-US verify) → use `--verifier-model` (default: sonnet)
+- If `us_id` is "ALL" (final verify) → use `--final-verifier-model` (default: opus)
+Determine engine from the selected verifier model format (same as Worker): plain name = claude, `model:reasoning` = codex.
+If claude engine (default):
 ```
 Agent(
   description="rlp-desk verifier iter-NNN (us_id)",
   subagent_type="executor",
-  model=<verifier_model>,
+  model=<selected_verifier_model>,
   mode="bypassPermissions",
   prompt=<full verifier prompt text with US scope>
 )
 ```
-If `--verifier-engine codex`:
+If codex engine:
 ```
-Bash("codex exec --model <verifier_codex_model> --reasoning-effort <verifier_codex_reasoning> <full verifier prompt text>")
+Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full verifier prompt text>")
 ```
-**⑦b Consensus Verification** (when `--verify-consensus` is enabled):
-After the primary verifier runs, run a second verifier with the OTHER engine:
-- If primary engine is claude → run codex verifier
-- If primary engine is codex → run claude verifier
+**⑦b Consensus Verification** (when `--consensus` is `all`, or `final-only` and scope is ALL):
+After the primary verifier runs, run a cross-engine second verifier:
+- Determine cross-verifier model based on scope:
+  - per-US verify → use `--consensus-model` (default: gpt-5.4:medium)
+  - final ALL verify → use `--final-consensus-model` (default: gpt-5.4:high)
+- If primary engine is claude → cross-verifier uses codex (the consensus model)
+- If primary engine is codex → cross-verifier uses claude `opus` (fixed)
 - Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
 - **Both pass** → proceed (next US or COMPLETE)
 - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
@@ -461,7 +485,7 @@ After reading the verdict, archive to `logs/<slug>/`:
 - Write `status.json`
 - Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
 - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
-- **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
+- **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","consensus_mode":"off|all|final-only","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
 - If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
 At loop end (COMPLETE, BLOCKED, or TIMEOUT):
@@ -556,7 +580,7 @@ Track `consecutive_failures` in `status.json` (increment on `fail`, reset on `pa
 Track `verified_us` (array of US IDs that passed verification) in `status.json` when using `--verify-mode per-us`.
-When `--verify-consensus` is enabled, also track in `status.json`:
+When `--consensus` is not `off`, also track in `status.json`:
 - `consensus_round`: current consensus round for this US (resets per US)
 - `claude_verdict`: latest claude verifier verdict for this US
 - `codex_verdict`: latest codex verifier verdict for this US
@@ -577,8 +601,8 @@ Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` and display a detailed
 Campaign: <slug>
 Iteration: <iteration> / <max_iter>
 Phase: <phase> | Last Result: <last_result>
-Worker Model: <worker_model> (<worker_engine>) | Verifier Model: <verifier_model> (<verifier_engine>)
-Verify Mode: <verify_mode> | Consensus: <verify_consensus>
+Worker Model: <worker_model> | Verifier: <verifier_model> (per-US) / <final_verifier_model> (final)
+Verify Mode: <verify_mode> | Consensus: <consensus_mode>
 Consecutive Failures: <consecutive_failures>
 Verified US: <verified_us array, comma-separated>
 Updated: <updated_at_utc> (elapsed: now - updated_at)
@@ -652,7 +676,7 @@ Resume a previously interrupted campaign. Equivalent to `run <slug>` but explici
 1. Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` for `verified_us`, `iteration`, `consecutive_failures`
 2. Read `.claude/ralph-desk/memos/<slug>-memory.md` for completed stories and next iteration contract
 3. Check for sentinels (`complete.md`, `blocked.md`) — if present, inform user and stop
-4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `verify_mode`, `verify_consensus`)
+4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `final_verifier_model`, `verify_mode`, `consensus_mode`)
 5. The runner automatically restores `verified_us` from memory or status.json on startup
 Example:
@@ -670,23 +694,20 @@ Example:
 /rlp-desk clean <slug> [--kill-session]     Reset for re-run (--kill-session kills tmux)
 Run options:
-  --mode agent|tmux          Execution mode (default: agent)
-  --max-iter N               Max iterations (default: 100)
-  --worker-model MODEL       Worker model (default: sonnet)
-  --verifier-model MODEL     Verifier model (default: opus)
-  --worker-engine claude|codex   Worker engine (default: claude)
-  --verifier-engine claude|codex Verifier engine (default: claude)
-  --worker-codex-model MODEL          Worker codex model (default: gpt-5.4)
-  --worker-codex-reasoning LEVEL      Worker codex reasoning (default: high)
-  --verifier-codex-model MODEL        Verifier codex model (default: gpt-5.4)
-  --verifier-codex-reasoning LEVEL    Verifier codex reasoning (default: high)
-  --verify-mode per-us|batch Verification strategy (default: per-us)
-  --verify-consensus         Cross-engine consensus verification
-  --consensus-scope SCOPE    When consensus runs: all|final-only (default: all)
-  --cb-threshold N           CB threshold: consecutive failures before BLOCKED (default: 3)
-  --iter-timeout N           Per-iteration timeout in seconds, tmux mode only (default: 600)
-  --debug                    Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
-  --with-self-verification   Campaign self-verification analysis (post-loop report)
+  --mode agent|tmux                    Execution mode (default: agent)
+  --worker-model MODEL                 Worker model: haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
+  --lock-worker-model                  Disable auto model upgrade on failure
+  --verifier-model MODEL               per-US verifier (default: sonnet)
+  --final-verifier-model MODEL         Final ALL verifier (default: opus)
+  --consensus off|all|final-only       Cross-engine consensus (default: off)
+  --consensus-model MODEL              per-US cross-verifier (default: gpt-5.4:medium)
+  --final-consensus-model MODEL        Final cross-verifier (default: gpt-5.4:high)
+  --verify-mode per-us|batch           Verification strategy (default: per-us)
+  --cb-threshold N                     Consecutive failures before BLOCKED (default: 6)
+  --max-iter N                         Max iterations (default: 100)
+  --iter-timeout N                     Per-iteration timeout, tmux only (default: 600)
+  --debug                              Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
+  --with-self-verification             Campaign self-verification analysis (post-loop report)
 ```
 ## Architecture

package/src/governance.md CHANGED Viewed

@@ -14,7 +14,7 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
 - **Worker must NEVER modify Claude Code settings** (settings.json, settings.local.json). Permission prompts must be reported as blocked, not bypassed by editing settings.
 - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
 - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
-- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
+- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-model spark:high` or `--worker-model gpt-5.4:high`).
 ## 1a. Iron Laws
@@ -117,7 +117,21 @@ Skipping any step = invalid verification (IL-1 violation).
 - "Code inspection" as substitute for automated command execution
 - Citing cached/prior results instead of fresh execution
-## 1c. Risk Classification
+## 1c. Task Sizing Principle
+Tasks must be sized within the assigned Worker's comfortable zone — not at its ceiling.
+A task that pushes a Worker to its maximum capability will frequently fail in fresh-context execution,
+because context budget must also cover PRD reading, test writing, and evidence collection.
+Rules:
+- Each US: max 3-4 ACs, max 2 changed files, completable in 1-2 iterations.
+- If a task is at the edge of a Worker's capability, either split the task or upgrade the Worker model.
+- Leader model selection: choose a model that can succeed comfortably, not the minimum viable model.
+- During brainstorm: when proposing US splits, target "smaller than what the Worker can handle" not "as much as the Worker can handle."
+This aligns with the original Ralph Loop principle: small tasks succeed most of the time.
+## 1c½. Risk Classification
 Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
@@ -207,6 +221,12 @@ Worker records what was done, in what order, with command evidence in `done-clai
 ### Verifier: reasoning in verify-verdict.json
 Verifier records WHY each judgment was made in `verify-verdict.json`:
 - Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
+- **failure_category** (required on fail verdicts): Verifier classifies each issue's root cause as one of:
+  - `spec` — AC is ambiguous, contradictory, or untestable (suggests IL-2 re-assessment, not model upgrade)
+  - `implementation` — code logic error, missing case, wrong algorithm (model upgrade may help)
+  - `integration` — individual pieces work but interaction fails (suggests task split or architecture review)
+  - `flaky` — non-deterministic failure, timing, environment (suggests retry, not escalation)
+  Leader uses failure_category to decide between model upgrade, spec refinement, or architecture escalation.
 - Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit, Test Coverage Audit
 - This proves the Verifier actually performed each check rather than rubber-stamping
 - **Test Coverage Audit (mandatory)**: Verifier MUST check that tests cover ALL code paths, not just happy paths. Specifically:
@@ -263,25 +283,30 @@ RUNNING → DONE_CLAIMED → VERIFYING → COMPLETE | CONTINUE | BLOCKED
 | Role | Default Model | Override Criteria |
 |------|---------------|-------------------|
-| Worker (simple) | haiku | Single file, clear change |
-| Worker (standard) | sonnet | Most tasks (default) |
-| Worker (complex) | opus | Architecture changes, multi-file, prior iteration failure |
-| Verifier | opus | Independent verification requires thoroughness |
-| Verifier (lightweight) | sonnet | Simple, well-defined checks only |
+| Worker | haiku | Default; auto-upgrades on failure (sonnet → opus) |
+| Worker (locked) | haiku | `--lock-worker-model` disables auto-upgrade |
+| Verifier (per-US) | sonnet | Lightweight; campaign-fixed (no progressive upgrade) |
+| Verifier (final) | opus | Full rigor; independent of per-US model |
+**Worker auto-upgrade**: When a Worker fails, the Leader upgrades the model for the retry (haiku → sonnet → opus). This upgrade is Worker-only. Verifier model is campaign-fixed — it does not upgrade on failure.
+**Verifier model is campaign-fixed**: `--verifier-model` applies to all per-US verifications throughout the campaign. `--final-verifier-model` applies to the final ALL verification. Neither upgrades automatically.
 The Leader decides each iteration. Decision criteria:
-- Previous iteration failed → upgrade model
-- Simple repetitive task → downgrade model
+- Previous iteration failed → upgrade Worker model (unless `--lock-worker-model`)
+- Simple repetitive task → keep current Worker model
 - User explicitly specified → use as given
 ### Codex (opt-in engine)
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--codex-model` | `gpt-5.4` | Model passed to the `codex` CLI |
-| `--codex-reasoning` | `high` | Reasoning effort: `low`, `medium`, or `high` |
+Model routing uses `--worker-model` and `--verifier-model` with codex format: `spark:high` or `gpt-5.4:high`.
+```
+--worker-model spark:high        # codex worker, spark model, high reasoning
+--verifier-model gpt-5.4:high    # codex verifier, gpt-5.4, high reasoning
+```
-Model routing is static when using codex: the same model and reasoning effort apply to both Worker and Verifier. There is no dynamic upgrade path. Claude is the default engine; codex is explicitly opt-in.
+`parse_model_flag()` auto-detects engine from the model name: plain names (haiku, sonnet, opus) = claude; `name:reasoning` format = codex. Claude is the default engine; codex is explicitly opt-in.
 ## 5a. Execution: Agent() Approach (default) — "Smart Mode"
@@ -305,7 +330,7 @@ Agent(
 )
 ```
-If `--worker-engine codex` or `--verifier-engine codex` (opt-in):
+If `--worker-model` or `--verifier-model` uses codex format (e.g., `spark:high`, `gpt-5.4:high`) (opt-in):
 ```
 # Worker or Verifier (codex engine)
 Bash("codex -m <codex_model> -c model_reasoning_effort=<codex_reasoning> --dangerously-bypass-approvals-and-sandbox <prompt>")
@@ -460,7 +485,7 @@ for iteration in 1..max_iter:
   ⑦ Execute Verifier (see §7a for per-US and §7b for consensus details)
      - Build prompt (scoped to us_id if per-us mode) → log
      - Agent(subagent_type="executor", model=selected, prompt=prompt)
-     - If --verify-consensus: run second verifier with alternate engine (see §7b)
+     - If --consensus is not off: run second verifier with alternate engine (see §7b)
      - Read verify-verdict.json:
        • pass + specific US → add to verified_us, Worker does next US
        • pass + us_id=ALL or complete → write COMPLETE sentinel, stop
@@ -507,26 +532,44 @@ Worker completes US-001 → signal verify (us_id: "US-001")
 ## 7b. Cross-Engine Consensus Verification
-When `--verify-consensus` is enabled, after the primary verifier runs, a second verifier runs with the alternate engine:
+Controlled by `--consensus off|all|final-only` (default: `off`).
+- `off`: single engine verification only
+- `all`: cross-engine consensus on every per-US verify and final ALL verify
+- `final-only`: cross-engine consensus only on final ALL verify
+When consensus is active, after the primary verifier runs, a second verifier runs with the alternate engine:
 ```
 Worker completes US → signal verify
-  → Claude Verifier runs (checks AC)
-  → Codex Verifier runs (checks AC)
+  → Primary Verifier runs (checks AC)
+  → Cross Verifier runs (checks AC)
   → Both pass → proceed (next US or COMPLETE)
   → Either fails → combined issues → fix contract → Worker retry
   → Max 6 consensus rounds per US → BLOCKED if still disagreeing
-**NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
 ```
+**NO ENGINE PRIORITY:** Both verifiers have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
+### Consensus Model Routing
+| Scenario | Primary verifier | Cross verifier |
+|----------|-----------------|----------------|
+| per-US, primary=claude | `--verifier-model` (sonnet) | `--consensus-model` (gpt-5.4:medium) |
+| per-US, primary=codex | `--verifier-model` | claude opus (fixed) |
+| final, primary=claude | `--final-verifier-model` (opus) | `--final-consensus-model` (gpt-5.4:high) |
+| final, primary=codex | `--final-verifier-model` | claude opus (fixed) |
+- Both must pass. No engine priority.
+- spark is not allowed as a consensus cross verifier (100k output limit).
 **Key rules:**
 - Both claude and codex CLI must be installed
 - Verifiers run sequentially in the same Verifier pane (tmux) or as sequential calls (Agent mode)
 - Verdicts are saved as `verify-verdict-claude.json` and `verify-verdict-codex.json`
 - Combined fix contracts include issues from both engines
 - `status.json` includes `consensus_round`, `claude_verdict`, and `codex_verdict` fields
-- Consensus can be combined with per-US verification (each US gets consensus-verified)
+- Consensus can be combined with per-US verification (`--consensus all`: each US gets consensus-verified)
 ## 7½. Fix Loop Protocol
@@ -574,7 +617,7 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
 |-----------|---------|
 | context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
 | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
-| `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`) |
+| `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`; when `--consensus` is not `off`, effective threshold doubles automatically: default 6 → 12) |
 | max_iter reached | TIMEOUT (report to user) |
 The Leader tracks `consecutive_failures` in `status.json`:
@@ -582,6 +625,14 @@ The Leader tracks `consecutive_failures` in `status.json`:
 - "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
 - "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
+## 8½. Self-Verification Feedback Loop
+When `--with-self-verification` is enabled, the SV report feeds back into the next brainstorm cycle:
+- SV report identifies patterns: which US types fail most, which AC quality issues recur, which model tiers underperform.
+- Next brainstorm SHOULD reference the prior campaign's SV report (if available) to inform US sizing, model selection, and AC quality standards.
+- This creates an iterative improvement loop: campaign → SV report → next brainstorm → better campaign.
+- The loop operates whether the reviewer is human or system — readiness to iterate is what matters.
 ## 9. Change Policy
 - Changes to the shared workflow → modify this document