@ai-dev-methodologies/rlp-desk 0.5.1 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -25,6 +25,7 @@ Ask about these items one by one (or in small groups):
25
25
  2. **Objective** — what the loop achieves
26
26
  3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
27
27
  - Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
28
+ - **Task Sizing (governance §1c)**: Size each US within the Worker's comfortable zone — smaller than what the Worker can handle, not at its ceiling. Max 3-4 ACs, max 2 files. If a US feels "just barely doable" for the target model, split it further.
28
29
  - Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
29
30
  ```
30
31
  Given [precondition in domain language]
@@ -53,39 +54,47 @@ Ask about these items one by one (or in small groups):
53
54
  | External dependencies | none | 1-2 | 3+ | distributed |
54
55
  | Existing code impact | new only | modify | refactor | architecture |
55
56
 
56
- **Model mapping** (Worker / Verifier):
57
- - LOW → haiku / sonnet
58
- - MEDIUM → sonnet / opus
59
- - HIGH → opus / opus
60
- - CRITICAL → opus / opus + require human review
57
+ **Codex Detection** — check if codex CLI is installed (`command -v codex`).
61
58
 
62
- Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM → I suggest Worker: sonnet, Verifier: opus."
59
+ **Model mapping Claude-only** (codex not installed):
63
60
 
64
- 8. **Engine & Model** For each role (Worker, Verifier):
65
- - Engine: claude (default) or codex
66
- - If claude: suggest model (haiku/sonnet/opus) based on task complexity
67
- - If codex: suggest model (default: gpt-5.4) and reasoning effort (low/medium/high)
61
+ | Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
62
+ |------------|--------|-----------------|----------------|-----------|
63
+ | LOW | haiku | sonnet | opus | off |
64
+ | MEDIUM | sonnet | opus | opus | off |
65
+ | HIGH | opus | opus | opus | off |
66
+ | CRITICAL | opus | opus | opus + human | off |
68
67
 
69
- **Codex Detection**check if codex CLI is installed (`command -v codex`):
68
+ **Model mappingCross-engine** (codex installed, recommended):
70
69
 
71
- **If codex IS installed** recommend cross-engine Worker:
72
- - Suggest: `--worker-model gpt-5.4:high --verify-consensus` (cross-engine + consensus)
73
- - Alternative: `--worker-model gpt-5.3-codex-spark:high` (spark preset note: 100k output token limit per request, best for smaller scope PRDs)
74
- - Say: "Codex is installed. I recommend it as Worker for cost savings (codex tokens are cheaper than claude tokens for bulk iteration) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
70
+ | Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
71
+ |------------|--------|-----------------|----------------|-----------|
72
+ | LOW | spark:high | sonnet | opus | final-only |
73
+ | MEDIUM | spark:high | opus | opus | final-only |
74
+ | HIGH | gpt-5.4:high | opus | opus | all |
75
+ | CRITICAL | gpt-5.4:high | opus | opus + human | all |
75
76
 
76
- **If codex is NOT installed** — recommend claude-only + install suggestion:
77
- - Defaulting to claude-only Worker (sonnet).
78
- - Say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
77
+ **Worker model selection** (cross-engine):
78
+ - **spark:high** default recommendation (Pro token pool = cost savings). PRD AC count <= 15
79
+ - **gpt-5.4:high**fallback when spark 100k output limit exceeded. PRD AC count > 15
79
80
 
80
- AI should recommend: "For this task complexity, I suggest Worker: sonnet, Verifier: opus"
81
- If codex selected: "For codex Worker, I suggest gpt-5.4 with high reasoning"
81
+ Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM."
82
+
83
+ **If codex IS installed** — say: "Codex is installed. I recommend cross-engine Worker for cost savings (Pro token pool separation) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
84
+
85
+ **If codex is NOT installed** — say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
86
+
87
+ 8. **Batch Capacity Check** — when verify-mode is batch and PRD is large:
88
+ - batch + spark + AC > 10 → warn "spark 100k output limit — consider wave split or switch to gpt-5.4"
89
+ - batch + gpt-5.4 + AC > 15 → warn "too many ACs for single batch — consider wave split (3-4 US per wave)"
90
+ - per-us → no warning (US-level processing, no limit concern)
82
91
  9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
83
- 10. **Verify Consensus** — Ask: "Use cross-engine consensus verification? (Both claude and codex verify independently, both must pass.) Requires codex CLI." Default: no.
84
- 11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
85
- 12. **Max Iterations** — suggest based on story count, ask if OK.
92
+ 10. **Consensus** — Ask: "Use cross-engine consensus? off (single engine), final-only (cross-engine on final verify only), or all (cross-engine on every verify). Requires codex CLI." Default: off. Recommended: final-only when codex is installed.
93
+ 11. **Max Iterations** — suggest based on story count, ask if OK.
86
94
 
87
95
  After all items are confirmed:
88
96
 
97
+ 0. **SV Report Feedback** — If a prior campaign's self-verification report exists for this project (`~/.claude/ralph-desk/analytics/*/self-verification-report-*.md`), reference it to inform this brainstorm: which US types failed most, which model tiers underperformed, which AC patterns caused issues. Present relevant findings to the user. (governance §8½)
89
98
  1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
90
99
  If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
91
100
  If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
@@ -123,27 +132,33 @@ Tell the user:
123
132
  ```
124
133
  Available run commands (copy the one you want):
125
134
 
126
- # Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
127
- /rlp-desk run <actual-slug> --worker-model gpt-5.4:high --final-consensus --debug
135
+ # Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
136
+ /rlp-desk run <actual-slug> --mode tmux --worker-model spark:high --consensus final-only --debug
128
137
 
129
- # Spark Pro preset (fast codex worker, lower cost):
130
- /rlp-desk run <actual-slug> --worker-model gpt-5.3-codex-spark:high --debug
138
+ # Large PRD (AC > 15, exceeds spark 100k limit):
139
+ /rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus final-only --debug
140
+
141
+ # Critical (full consensus on every verify):
142
+ /rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus all --debug
131
143
 
132
144
  # Claude-only:
133
145
  /rlp-desk run <actual-slug> --debug
134
146
 
135
- # Basic agent:
136
- /rlp-desk run <actual-slug>
137
-
138
147
  # Full options reference:
139
- # --mode agent|tmux (default: agent)
140
- # --worker-model MODEL haiku|sonnet|opus or gpt-5.4:low|medium|high (default: sonnet)
141
- # --verifier-model MODEL haiku|sonnet|opus (default: opus)
142
- # --verify-consensus both claude+codex must pass
143
- # --verify-mode per-us|batch (default: per-us)
144
- # --max-iter N (default: 100)
145
- # --debug enable debug logging
146
- # --with-self-verification post-campaign analysis report
148
+ # --mode agent|tmux (default: agent)
149
+ # --worker-model MODEL haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
150
+ # --lock-worker-model disable auto model upgrade
151
+ # --verifier-model MODEL per-US verifier (default: sonnet)
152
+ # --final-verifier-model MODEL final ALL verifier (default: opus)
153
+ # --consensus off|all|final-only cross-engine consensus (default: off)
154
+ # --consensus-model MODEL per-US cross-verifier (default: gpt-5.4:medium)
155
+ # --final-consensus-model MODEL final cross-verifier (default: gpt-5.4:high)
156
+ # --verify-mode per-us|batch (default: per-us)
157
+ # --cb-threshold N (default: 6)
158
+ # --max-iter N (default: 100)
159
+ # --iter-timeout N tmux only (default: 600)
160
+ # --debug debug logging
161
+ # --with-self-verification post-campaign SV report
147
162
  ```
148
163
 
149
164
  **If codex is NOT installed** — show claude-only presets + install recommendation:
@@ -151,7 +166,7 @@ Tell the user:
151
166
  ```
152
167
  Available run commands (copy the one you want):
153
168
 
154
- # Recommended: tmux mode + claude-only (real-time visibility):
169
+ # Recommended: tmux mode + claude-only (real-time visibility):
155
170
  /rlp-desk run <actual-slug> --mode tmux --debug
156
171
 
157
172
  # Agent mode:
@@ -161,13 +176,17 @@ Tell the user:
161
176
  npm install -g @openai/codex
162
177
 
163
178
  # Full options reference:
164
- # --mode agent|tmux (default: agent)
165
- # --worker-model MODEL haiku|sonnet|opus (default: sonnet)
166
- # --verifier-model MODEL haiku|sonnet|opus (default: opus)
167
- # --verify-mode per-us|batch (default: per-us)
168
- # --max-iter N (default: 100)
169
- # --debug enable debug logging
170
- # --with-self-verification post-campaign analysis report
179
+ # --mode agent|tmux (default: agent)
180
+ # --worker-model MODEL haiku|sonnet|opus (default: haiku)
181
+ # --lock-worker-model disable auto model upgrade
182
+ # --verifier-model MODEL per-US verifier (default: sonnet)
183
+ # --final-verifier-model MODEL final ALL verifier (default: opus)
184
+ # --verify-mode per-us|batch (default: per-us)
185
+ # --cb-threshold N (default: 6)
186
+ # --max-iter N (default: 100)
187
+ # --iter-timeout N tmux only (default: 600)
188
+ # --debug debug logging
189
+ # --with-self-verification post-campaign SV report
171
190
  ```
172
191
 
173
192
  Replace `<actual-slug>` with the real slug from this init (e.g. `auth-refactor`).
@@ -182,24 +201,21 @@ Tell the user:
182
201
 
183
202
  Options (parse from `$ARGUMENTS`):
184
203
  - `--mode agent|tmux` (default: `agent`) — execution mode
185
- - `--max-iter N` (default: 100)
186
- - `--worker-model MODEL` (default: sonnet)
187
- - `--verifier-model MODEL` (default: opus)
188
- - `--worker-engine claude|codex` (default: `claude`) — engine for Worker
189
- - `--verifier-engine claude|codex` (default: `claude`) — engine for Verifier
190
- - `--worker-codex-model MODEL` (default: `gpt-5.4`) codex model for Worker
191
- - `--worker-codex-reasoning low|medium|high` (default: `high`) reasoning for Worker
192
- - `--verifier-codex-model MODEL` (default: `gpt-5.4`) codex model for Verifier
193
- - `--verifier-codex-reasoning low|medium|high` (default: `high`) — reasoning for Verifier
204
+ - `--worker-model MODEL` (default: `haiku`) — Worker model. Format: `model` = claude engine, `model:reasoning` = codex engine. Examples: `haiku`, `sonnet`, `opus`, `spark:high`, `gpt-5.4:high`. Parsed by `parse_model_flag()` which auto-splits engine/model/reasoning.
205
+ - `--lock-worker-model` — disable automatic model upgrade on failure (check_model_upgrade). Worker stays on the specified model regardless of consecutive failures.
206
+ - `--verifier-model MODEL` (default: `sonnet`) — per-US verification model. Campaign-fixed (no progressive upgrade). Lighter than final verifier.
207
+ - `--final-verifier-model MODEL` (default: `opus`) — final ALL verification model. Independent from per-US verifier. Used only for the final full-AC verify pass.
208
+ - `--consensus off|all|final-only` (default: `off`) — cross-engine consensus verification mode.
209
+ - `off`: single-engine verification only
210
+ - `all`: cross-engine consensus on every verify (per-US and final)
211
+ - `final-only`: cross-engine consensus only on the final ALL verify
212
+ - `--consensus-model MODEL` (default: `gpt-5.4:medium`) — per-US cross-verifier model. Lighter weight for cost efficiency.
213
+ - `--final-consensus-model MODEL` (default: `gpt-5.4:high`) — final cross-verifier model. Stricter. Note: spark is not allowed here (100k output limit).
194
214
  - `--verify-mode per-us|batch` (default: `per-us`) — verification strategy
195
215
  - `per-us`: verify after each US, then final full verify of all AC
196
216
  - `batch`: verify only after all US done (legacy behavior)
197
- - `--verify-consensus` — enable cross-engine consensus verification (both claude and codex verify independently; both must pass)
198
- - `--consensus-scope all|final-only` — when consensus runs (default: `all`)
199
- - `all`: consensus runs on every verify (current behavior)
200
- - `final-only`: consensus only on final ALL verify
201
- - `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
202
- - `--consensus-fail-fast` — skip second verifier if first verifier fails (saves time/tokens in consensus mode)
217
+ - `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 6). When `--consensus` is not `off`, effective threshold is automatically doubled (e.g., default becomes 12).
218
+ - `--max-iter N` (default: 100)
203
219
  - `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
204
220
  - `--debug` — enable debug logging (writes to ~/.claude/ralph-desk/analytics/<slug>/debug.log)
205
221
  - `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
@@ -208,7 +224,7 @@ Options (parse from `$ARGUMENTS`):
208
224
  When `--debug` or `--with-self-verification` is active, analytics data is written to a user-level directory for cross-project aggregation. Contents:
209
225
  - `metadata.json` — campaign metadata: slug, project_root, campaign_status, start_time, end_time
210
226
  - `debug.log` — debug output (versioned: `debug-v{N}.log` on re-execution)
211
- - `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_engine, claude_verdict, codex_verdict, consensus, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
227
+ - `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_model, verifier_engine, consensus_mode, claude_verdict, codex_verdict, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
212
228
  - `self-verification-data.json` — cumulative SV records (agent-mode only, when `--with-self-verification`)
213
229
  - `self-verification-report-NNN.md` — versioned SV reports (when `--with-self-verification`)
214
230
 
@@ -232,17 +248,14 @@ LOOP_NAME="<slug>" \
232
248
  ROOT="$PWD" \
233
249
  MAX_ITER=<--max-iter value> \
234
250
  WORKER_MODEL=<--worker-model value> \
235
- VERIFIER_MODEL=<--verifier-model value> \
236
- WORKER_ENGINE=<--worker-engine value, default: claude> \
237
- VERIFIER_ENGINE=<--verifier-engine value, default: claude> \
238
- WORKER_CODEX_MODEL=<--worker-codex-model value, default: gpt-5.4> \
239
- WORKER_CODEX_REASONING=<--worker-codex-reasoning value, default: high> \
240
- VERIFIER_CODEX_MODEL=<--verifier-codex-model value, default: gpt-5.4> \
241
- VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
251
+ LOCK_WORKER_MODEL=<1 if --lock-worker-model, else 0> \
252
+ VERIFIER_MODEL=<--verifier-model value, default: sonnet> \
253
+ FINAL_VERIFIER_MODEL=<--final-verifier-model value, default: opus> \
242
254
  VERIFY_MODE=<--verify-mode value, default: per-us> \
243
- VERIFY_CONSENSUS=<1 if --verify-consensus, else 0> \
244
- CONSENSUS_SCOPE=<--consensus-scope value, default: all> \
245
- CB_THRESHOLD=<--cb-threshold value, default: 3> \
255
+ CONSENSUS_MODE=<--consensus value, default: off> \
256
+ CONSENSUS_MODEL=<--consensus-model value, default: gpt-5.4:medium> \
257
+ FINAL_CONSENSUS_MODEL=<--final-consensus-model value, default: gpt-5.4:high> \
258
+ CB_THRESHOLD=<--cb-threshold value, default: 6> \
246
259
  ITER_TIMEOUT=<--iter-timeout value, default: 600> \
247
260
  DEBUG=<1 if --debug, else 0> \
248
261
  WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
@@ -269,7 +282,7 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
269
282
 
270
283
  ### Preparation
271
284
  1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
272
- 2. **Codex CLI pre-validation**: If `--verify-consensus` is enabled OR `--worker-engine codex` / `--verifier-engine codex` is set, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
285
+ 2. **Codex CLI pre-validation**: If `--consensus` is not `off` OR `--worker-model` uses codex format (contains `:`) OR `--verifier-model` / `--final-verifier-model` / `--consensus-model` / `--final-consensus-model` uses codex format, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
273
286
  3. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
274
287
  4. Clean previous `done-claim.json`, `verify-verdict.json`.
275
288
  5. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
@@ -290,9 +303,9 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
290
303
  - If you output "Iter 1 complete, moving to iter 2" as plain text without a tool call, the turn terminates and the loop breaks. This is a platform constraint, not a compliance issue — no amount of "DO NOT STOP" text can override it.
291
304
 
292
305
  If `--debug`, at loop start debug_log the following (3 [OPTION] entries):
293
- - `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
294
- - `[OPTION] cb_threshold=<N> effective_cb_threshold=<N>`
295
- - `[OPTION] worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model>`
306
+ - `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus_mode=<off|all|final-only>`
307
+ - `[OPTION] cb_threshold=<N> effective_cb_threshold=<N> lock_worker_model=<0|1>`
308
+ - `[OPTION] worker_model=<model> verifier_model=<model> final_verifier_model=<model> consensus_model=<model> final_consensus_model=<model>`
296
309
 
297
310
  For each iteration (1 to max_iter):
298
311
 
@@ -341,7 +354,9 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
341
354
  **⑤ Execute Worker**
342
355
  - If `--debug`: debug_log `[FLOW] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
343
356
 
344
- If `--worker-engine claude` (default):
357
+ Determine engine from `--worker-model` format: plain name (e.g., `haiku`) = claude engine, `model:reasoning` format (e.g., `spark:high`) = codex engine. Use `parse_model_flag()` to split.
358
+
359
+ If claude engine (default):
345
360
  ```
346
361
  Agent(
347
362
  description="rlp-desk worker iter-NNN",
@@ -354,9 +369,9 @@ Agent(
354
369
  - Agent returns synchronously. No polling needed.
355
370
  - Each Agent() = fresh context. Guaranteed.
356
371
 
357
- If `--worker-engine codex`:
372
+ If codex engine:
358
373
  ```
359
- Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_reasoning> <full worker prompt text>")
374
+ Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full worker prompt text>")
360
375
  ```
361
376
  - Codex runs as a subprocess via Bash(), not Agent().
362
377
  - Each Bash() call = fresh context for codex.
@@ -396,26 +411,35 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
396
411
  - **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
397
412
  - If `--debug`: debug_log `[FLOW] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
398
413
 
399
- If `--verifier-engine claude` (default):
414
+ Determine which verifier model to use based on scope:
415
+ - If `us_id` is a specific story (per-US verify) → use `--verifier-model` (default: sonnet)
416
+ - If `us_id` is "ALL" (final verify) → use `--final-verifier-model` (default: opus)
417
+
418
+ Determine engine from the selected verifier model format (same as Worker): plain name = claude, `model:reasoning` = codex.
419
+
420
+ If claude engine (default):
400
421
  ```
401
422
  Agent(
402
423
  description="rlp-desk verifier iter-NNN (us_id)",
403
424
  subagent_type="executor",
404
- model=<verifier_model>,
425
+ model=<selected_verifier_model>,
405
426
  mode="bypassPermissions",
406
427
  prompt=<full verifier prompt text with US scope>
407
428
  )
408
429
  ```
409
430
 
410
- If `--verifier-engine codex`:
431
+ If codex engine:
411
432
  ```
412
- Bash("codex exec --model <verifier_codex_model> --reasoning-effort <verifier_codex_reasoning> <full verifier prompt text>")
433
+ Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full verifier prompt text>")
413
434
  ```
414
435
 
415
- **⑦b Consensus Verification** (when `--verify-consensus` is enabled):
416
- After the primary verifier runs, run a second verifier with the OTHER engine:
417
- - If primary engine is claude → run codex verifier
418
- - If primary engine is codex run claude verifier
436
+ **⑦b Consensus Verification** (when `--consensus` is `all`, or `final-only` and scope is ALL):
437
+ After the primary verifier runs, run a cross-engine second verifier:
438
+ - Determine cross-verifier model based on scope:
439
+ - per-US verify use `--consensus-model` (default: gpt-5.4:medium)
440
+ - final ALL verify → use `--final-consensus-model` (default: gpt-5.4:high)
441
+ - If primary engine is claude → cross-verifier uses codex (the consensus model)
442
+ - If primary engine is codex → cross-verifier uses claude `opus` (fixed)
419
443
  - Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
420
444
  - **Both pass** → proceed (next US or COMPLETE)
421
445
  - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
@@ -461,7 +485,7 @@ After reading the verdict, archive to `logs/<slug>/`:
461
485
  - Write `status.json`
462
486
  - Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
463
487
  - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
464
- - **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
488
+ - **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","consensus_mode":"off|all|final-only","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
465
489
  - If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
466
490
 
467
491
  At loop end (COMPLETE, BLOCKED, or TIMEOUT):
@@ -556,7 +580,7 @@ Track `consecutive_failures` in `status.json` (increment on `fail`, reset on `pa
556
580
 
557
581
  Track `verified_us` (array of US IDs that passed verification) in `status.json` when using `--verify-mode per-us`.
558
582
 
559
- When `--verify-consensus` is enabled, also track in `status.json`:
583
+ When `--consensus` is not `off`, also track in `status.json`:
560
584
  - `consensus_round`: current consensus round for this US (resets per US)
561
585
  - `claude_verdict`: latest claude verifier verdict for this US
562
586
  - `codex_verdict`: latest codex verifier verdict for this US
@@ -577,8 +601,8 @@ Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` and display a detailed
577
601
  Campaign: <slug>
578
602
  Iteration: <iteration> / <max_iter>
579
603
  Phase: <phase> | Last Result: <last_result>
580
- Worker Model: <worker_model> (<worker_engine>) | Verifier Model: <verifier_model> (<verifier_engine>)
581
- Verify Mode: <verify_mode> | Consensus: <verify_consensus>
604
+ Worker Model: <worker_model> | Verifier: <verifier_model> (per-US) / <final_verifier_model> (final)
605
+ Verify Mode: <verify_mode> | Consensus: <consensus_mode>
582
606
  Consecutive Failures: <consecutive_failures>
583
607
  Verified US: <verified_us array, comma-separated>
584
608
  Updated: <updated_at_utc> (elapsed: now - updated_at)
@@ -652,7 +676,7 @@ Resume a previously interrupted campaign. Equivalent to `run <slug>` but explici
652
676
  1. Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` for `verified_us`, `iteration`, `consecutive_failures`
653
677
  2. Read `.claude/ralph-desk/memos/<slug>-memory.md` for completed stories and next iteration contract
654
678
  3. Check for sentinels (`complete.md`, `blocked.md`) — if present, inform user and stop
655
- 4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `verify_mode`, `verify_consensus`)
679
+ 4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `final_verifier_model`, `verify_mode`, `consensus_mode`)
656
680
  5. The runner automatically restores `verified_us` from memory or status.json on startup
657
681
 
658
682
  Example:
@@ -670,23 +694,20 @@ Example:
670
694
  /rlp-desk clean <slug> [--kill-session] Reset for re-run (--kill-session kills tmux)
671
695
 
672
696
  Run options:
673
- --mode agent|tmux Execution mode (default: agent)
674
- --max-iter N Max iterations (default: 100)
675
- --worker-model MODEL Worker model (default: sonnet)
676
- --verifier-model MODEL Verifier model (default: opus)
677
- --worker-engine claude|codex Worker engine (default: claude)
678
- --verifier-engine claude|codex Verifier engine (default: claude)
679
- --worker-codex-model MODEL Worker codex model (default: gpt-5.4)
680
- --worker-codex-reasoning LEVEL Worker codex reasoning (default: high)
681
- --verifier-codex-model MODEL Verifier codex model (default: gpt-5.4)
682
- --verifier-codex-reasoning LEVEL Verifier codex reasoning (default: high)
683
- --verify-mode per-us|batch Verification strategy (default: per-us)
684
- --verify-consensus Cross-engine consensus verification
685
- --consensus-scope SCOPE When consensus runs: all|final-only (default: all)
686
- --cb-threshold N CB threshold: consecutive failures before BLOCKED (default: 3)
687
- --iter-timeout N Per-iteration timeout in seconds, tmux mode only (default: 600)
688
- --debug Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
689
- --with-self-verification Campaign self-verification analysis (post-loop report)
697
+ --mode agent|tmux Execution mode (default: agent)
698
+ --worker-model MODEL Worker model: haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
699
+ --lock-worker-model Disable auto model upgrade on failure
700
+ --verifier-model MODEL per-US verifier (default: sonnet)
701
+ --final-verifier-model MODEL Final ALL verifier (default: opus)
702
+ --consensus off|all|final-only Cross-engine consensus (default: off)
703
+ --consensus-model MODEL per-US cross-verifier (default: gpt-5.4:medium)
704
+ --final-consensus-model MODEL Final cross-verifier (default: gpt-5.4:high)
705
+ --verify-mode per-us|batch Verification strategy (default: per-us)
706
+ --cb-threshold N Consecutive failures before BLOCKED (default: 6)
707
+ --max-iter N Max iterations (default: 100)
708
+ --iter-timeout N Per-iteration timeout, tmux only (default: 600)
709
+ --debug Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
710
+ --with-self-verification Campaign self-verification analysis (post-loop report)
690
711
  ```
691
712
 
692
713
  ## Architecture
package/src/governance.md CHANGED
@@ -14,7 +14,7 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
14
14
  - **Worker must NEVER modify Claude Code settings** (settings.json, settings.local.json). Permission prompts must be reported as blocked, not bypassed by editing settings.
15
15
  - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
16
16
  - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
17
- - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
17
+ - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-model spark:high` or `--worker-model gpt-5.4:high`).
18
18
 
19
19
  ## 1a. Iron Laws
20
20
 
@@ -117,7 +117,21 @@ Skipping any step = invalid verification (IL-1 violation).
117
117
  - "Code inspection" as substitute for automated command execution
118
118
  - Citing cached/prior results instead of fresh execution
119
119
 
120
- ## 1c. Risk Classification
120
+ ## 1c. Task Sizing Principle
121
+
122
+ Tasks must be sized within the assigned Worker's comfortable zone — not at its ceiling.
123
+ A task that pushes a Worker to its maximum capability will frequently fail in fresh-context execution,
124
+ because context budget must also cover PRD reading, test writing, and evidence collection.
125
+
126
+ Rules:
127
+ - Each US: max 3-4 ACs, max 2 changed files, completable in 1-2 iterations.
128
+ - If a task is at the edge of a Worker's capability, either split the task or upgrade the Worker model.
129
+ - Leader model selection: choose a model that can succeed comfortably, not the minimum viable model.
130
+ - During brainstorm: when proposing US splits, target "smaller than what the Worker can handle" not "as much as the Worker can handle."
131
+
132
+ This aligns with the original Ralph Loop principle: small tasks succeed most of the time.
133
+
134
+ ## 1c½. Risk Classification
121
135
 
122
136
  Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
123
137
 
@@ -207,6 +221,12 @@ Worker records what was done, in what order, with command evidence in `done-clai
207
221
  ### Verifier: reasoning in verify-verdict.json
208
222
  Verifier records WHY each judgment was made in `verify-verdict.json`:
209
223
  - Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
224
+ - **failure_category** (required on fail verdicts): Verifier classifies each issue's root cause as one of:
225
+ - `spec` — AC is ambiguous, contradictory, or untestable (suggests IL-2 re-assessment, not model upgrade)
226
+ - `implementation` — code logic error, missing case, wrong algorithm (model upgrade may help)
227
+ - `integration` — individual pieces work but interaction fails (suggests task split or architecture review)
228
+ - `flaky` — non-deterministic failure, timing, environment (suggests retry, not escalation)
229
+ Leader uses failure_category to decide between model upgrade, spec refinement, or architecture escalation.
210
230
  - Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit, Test Coverage Audit
211
231
  - This proves the Verifier actually performed each check rather than rubber-stamping
212
232
  - **Test Coverage Audit (mandatory)**: Verifier MUST check that tests cover ALL code paths, not just happy paths. Specifically:
@@ -263,25 +283,30 @@ RUNNING → DONE_CLAIMED → VERIFYING → COMPLETE | CONTINUE | BLOCKED
263
283
 
264
284
  | Role | Default Model | Override Criteria |
265
285
  |------|---------------|-------------------|
266
- | Worker (simple) | haiku | Single file, clear change |
267
- | Worker (standard) | sonnet | Most tasks (default) |
268
- | Worker (complex) | opus | Architecture changes, multi-file, prior iteration failure |
269
- | Verifier | opus | Independent verification requires thoroughness |
270
- | Verifier (lightweight) | sonnet | Simple, well-defined checks only |
286
+ | Worker | haiku | Default; auto-upgrades on failure (sonnet → opus) |
287
+ | Worker (locked) | haiku | `--lock-worker-model` disables auto-upgrade |
288
+ | Verifier (per-US) | sonnet | Lightweight; campaign-fixed (no progressive upgrade) |
289
+ | Verifier (final) | opus | Full rigor; independent of per-US model |
290
+
291
+ **Worker auto-upgrade**: When a Worker fails, the Leader upgrades the model for the retry (haiku → sonnet → opus). This upgrade is Worker-only. Verifier model is campaign-fixed — it does not upgrade on failure.
292
+
293
+ **Verifier model is campaign-fixed**: `--verifier-model` applies to all per-US verifications throughout the campaign. `--final-verifier-model` applies to the final ALL verification. Neither upgrades automatically.
271
294
 
272
295
  The Leader decides each iteration. Decision criteria:
273
- - Previous iteration failed → upgrade model
274
- - Simple repetitive task → downgrade model
296
+ - Previous iteration failed → upgrade Worker model (unless `--lock-worker-model`)
297
+ - Simple repetitive task → keep current Worker model
275
298
  - User explicitly specified → use as given
276
299
 
277
300
  ### Codex (opt-in engine)
278
301
 
279
- | Option | Default | Description |
280
- |--------|---------|-------------|
281
- | `--codex-model` | `gpt-5.4` | Model passed to the `codex` CLI |
282
- | `--codex-reasoning` | `high` | Reasoning effort: `low`, `medium`, or `high` |
302
+ Model routing uses `--worker-model` and `--verifier-model` with codex format: `spark:high` or `gpt-5.4:high`.
303
+
304
+ ```
305
+ --worker-model spark:high # codex worker, spark model, high reasoning
306
+ --verifier-model gpt-5.4:high # codex verifier, gpt-5.4, high reasoning
307
+ ```
283
308
 
284
- Model routing is static when using codex: the same model and reasoning effort apply to both Worker and Verifier. There is no dynamic upgrade path. Claude is the default engine; codex is explicitly opt-in.
309
+ `parse_model_flag()` auto-detects engine from the model name: plain names (haiku, sonnet, opus) = claude; `name:reasoning` format = codex. Claude is the default engine; codex is explicitly opt-in.
285
310
 
286
311
  ## 5a. Execution: Agent() Approach (default) — "Smart Mode"
287
312
 
@@ -305,7 +330,7 @@ Agent(
305
330
  )
306
331
  ```
307
332
 
308
- If `--worker-engine codex` or `--verifier-engine codex` (opt-in):
333
+ If `--worker-model` or `--verifier-model` uses codex format (e.g., `spark:high`, `gpt-5.4:high`) (opt-in):
309
334
  ```
310
335
  # Worker or Verifier (codex engine)
311
336
  Bash("codex -m <codex_model> -c model_reasoning_effort=<codex_reasoning> --dangerously-bypass-approvals-and-sandbox <prompt>")
@@ -460,7 +485,7 @@ for iteration in 1..max_iter:
460
485
  ⑦ Execute Verifier (see §7a for per-US and §7b for consensus details)
461
486
  - Build prompt (scoped to us_id if per-us mode) → log
462
487
  - Agent(subagent_type="executor", model=selected, prompt=prompt)
463
- - If --verify-consensus: run second verifier with alternate engine (see §7b)
488
+ - If --consensus is not off: run second verifier with alternate engine (see §7b)
464
489
  - Read verify-verdict.json:
465
490
  • pass + specific US → add to verified_us, Worker does next US
466
491
  • pass + us_id=ALL or complete → write COMPLETE sentinel, stop
@@ -507,26 +532,44 @@ Worker completes US-001 → signal verify (us_id: "US-001")
507
532
 
508
533
  ## 7b. Cross-Engine Consensus Verification
509
534
 
510
- When `--verify-consensus` is enabled, after the primary verifier runs, a second verifier runs with the alternate engine:
535
+ Controlled by `--consensus off|all|final-only` (default: `off`).
536
+
537
+ - `off`: single engine verification only
538
+ - `all`: cross-engine consensus on every per-US verify and final ALL verify
539
+ - `final-only`: cross-engine consensus only on final ALL verify
540
+
541
+ When consensus is active, after the primary verifier runs, a second verifier runs with the alternate engine:
511
542
 
512
543
  ```
513
544
  Worker completes US → signal verify
514
- Claude Verifier runs (checks AC)
515
- Codex Verifier runs (checks AC)
545
+ Primary Verifier runs (checks AC)
546
+ Cross Verifier runs (checks AC)
516
547
  → Both pass → proceed (next US or COMPLETE)
517
548
  → Either fails → combined issues → fix contract → Worker retry
518
549
  → Max 6 consensus rounds per US → BLOCKED if still disagreeing
519
-
520
- **NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
521
550
  ```
522
551
 
552
+ **NO ENGINE PRIORITY:** Both verifiers have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
553
+
554
+ ### Consensus Model Routing
555
+
556
+ | Scenario | Primary verifier | Cross verifier |
557
+ |----------|-----------------|----------------|
558
+ | per-US, primary=claude | `--verifier-model` (sonnet) | `--consensus-model` (gpt-5.4:medium) |
559
+ | per-US, primary=codex | `--verifier-model` | claude opus (fixed) |
560
+ | final, primary=claude | `--final-verifier-model` (opus) | `--final-consensus-model` (gpt-5.4:high) |
561
+ | final, primary=codex | `--final-verifier-model` | claude opus (fixed) |
562
+
563
+ - Both must pass. No engine priority.
564
+ - spark is not allowed as a consensus cross verifier (100k output limit).
565
+
523
566
  **Key rules:**
524
567
  - Both claude and codex CLI must be installed
525
568
  - Verifiers run sequentially in the same Verifier pane (tmux) or as sequential calls (Agent mode)
526
569
  - Verdicts are saved as `verify-verdict-claude.json` and `verify-verdict-codex.json`
527
570
  - Combined fix contracts include issues from both engines
528
571
  - `status.json` includes `consensus_round`, `claude_verdict`, and `codex_verdict` fields
529
- - Consensus can be combined with per-US verification (each US gets consensus-verified)
572
+ - Consensus can be combined with per-US verification (`--consensus all`: each US gets consensus-verified)
530
573
 
531
574
  ## 7½. Fix Loop Protocol
532
575
 
@@ -574,7 +617,7 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
574
617
  |-----------|---------|
575
618
  | context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
576
619
  | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
577
- | `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`) |
620
+ | `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`; when `--consensus` is not `off`, effective threshold doubles automatically: default 6 → 12) |
578
621
  | max_iter reached | TIMEOUT (report to user) |
579
622
 
580
623
  The Leader tracks `consecutive_failures` in `status.json`:
@@ -582,6 +625,14 @@ The Leader tracks `consecutive_failures` in `status.json`:
582
625
  - "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
583
626
  - "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
584
627
 
628
+ ## 8½. Self-Verification Feedback Loop
629
+
630
+ When `--with-self-verification` is enabled, the SV report feeds back into the next brainstorm cycle:
631
+ - SV report identifies patterns: which US types fail most, which AC quality issues recur, which model tiers underperform.
632
+ - Next brainstorm SHOULD reference the prior campaign's SV report (if available) to inform US sizing, model selection, and AC quality standards.
633
+ - This creates an iterative improvement loop: campaign → SV report → next brainstorm → better campaign.
634
+ - The loop operates whether the reviewer is human or system — readiness to iterate is what matters.
635
+
585
636
  ## 9. Change Policy
586
637
 
587
638
  - Changes to the shared workflow → modify this document