@ai-dev-methodologies/rlp-desk 0.1.2 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -134,6 +134,12 @@ for iteration in 1..max_iter:
134
134
  | `--worker-model MODEL` | sonnet | Worker model (haiku/sonnet/opus) |
135
135
  | `--verifier-model MODEL` | opus | Verifier model (haiku/sonnet/opus) |
136
136
  | `--mode agent\|tmux` | agent | Execution mode (see below) |
137
+ | `--worker-engine claude\|codex` | claude | Engine for Worker (claude uses Agent(), codex uses Bash CLI) |
138
+ | `--verifier-engine claude\|codex` | claude | Engine for Verifier |
139
+ | `--codex-model MODEL` | gpt-5.4 | Model passed to the Codex CLI (when engine=codex) |
140
+ | `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
141
+ | `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
142
+ | `--verify-consensus` | off | Cross-engine consensus verification (see below) |
137
143
 
138
144
  ## Execution Modes
139
145
 
@@ -193,6 +199,98 @@ To clean up tmux artifacts:
193
199
  /rlp-desk clean calculator --kill-session
194
200
  ```
195
201
 
202
+ ## Engine Support
203
+
204
+ RLP Desk supports two execution engines for Worker and Verifier. **Claude is the default.** Codex is opt-in.
205
+
206
+ ### Claude (default)
207
+
208
+ ```
209
+ /rlp-desk run calculator
210
+ /rlp-desk run calculator --worker-engine claude --verifier-engine claude
211
+ ```
212
+
213
+ Uses Claude Code's `Agent()` tool (agent mode) or `claude -p` CLI (tmux mode). Supports dynamic model routing (haiku/sonnet/opus).
214
+
215
+ ### Codex (opt-in)
216
+
217
+ ```bash
218
+ # Install codex CLI first
219
+ npm install -g @openai/codex
220
+
221
+ # Run with codex worker
222
+ /rlp-desk run calculator --worker-engine codex
223
+
224
+ # Customize model and reasoning effort
225
+ /rlp-desk run calculator --worker-engine codex --codex-model gpt-5.4 --codex-reasoning high
226
+
227
+ # Mix engines: codex worker, claude verifier
228
+ /rlp-desk run calculator --worker-engine codex --verifier-engine claude
229
+ ```
230
+
231
+ Uses the `codex` CLI via `Bash()` (agent mode) or as an interactive TUI (tmux mode). The `codex` binary is only required when an engine is set to `codex`.
232
+
233
+ | Engine | Agent Mode | Tmux Mode | Dynamic Routing |
234
+ |--------|-----------|-----------|-----------------|
235
+ | claude | `Agent()` tool | `claude -p` TUI | Yes (haiku/sonnet/opus) |
236
+ | codex | `Bash("codex ...")` | `codex` TUI | No (static model) |
237
+
238
+ ## Verification Modes
239
+
240
+ RLP Desk supports two verification strategies. **Per-US is the default.**
241
+
242
+ ### Per-US Verification (default)
243
+
244
+ ```
245
+ /rlp-desk run calculator
246
+ /rlp-desk run calculator --verify-mode per-us
247
+ ```
248
+
249
+ Each user story is verified independently after completion, then a final full verification runs after all stories pass:
250
+
251
+ ```
252
+ Worker: US-001 → Verifier: US-001 AC only → pass
253
+ Worker: US-002 → Verifier: US-002 AC only → pass
254
+ Worker: US-003 → Verifier: US-003 AC only → pass
255
+ Final full verify: ALL AC → pass → COMPLETE
256
+ ```
257
+
258
+ Benefits:
259
+ - Catch issues early, before later stories build on broken foundations
260
+ - Smaller verification scope = faster, more accurate checks
261
+ - Failed verification retries only the specific US
262
+
263
+ ### Batch Verification
264
+
265
+ ```
266
+ /rlp-desk run calculator --verify-mode batch
267
+ ```
268
+
269
+ Legacy behavior: Worker completes all stories, then a single verification checks all acceptance criteria at once.
270
+
271
+ ### Cross-Engine Consensus Verification
272
+
273
+ ```
274
+ /rlp-desk run calculator --verify-consensus
275
+ ```
276
+
277
+ When enabled, **both claude and codex verify independently**. Both must pass for verification to succeed.
278
+
279
+ ```
280
+ Worker completes US → Claude verifies → Codex verifies
281
+ Both pass → proceed
282
+ Either fails → combined fix contract → Worker retry
283
+ 3 rounds without consensus → BLOCKED
284
+ ```
285
+
286
+ Consensus can be combined with per-US mode for maximum rigor:
287
+
288
+ ```
289
+ /rlp-desk run calculator --verify-mode per-us --verify-consensus
290
+ ```
291
+
292
+ Prerequisites: Both `claude` and `codex` CLIs must be installed.
293
+
196
294
  ## Project Structure
197
295
 
198
296
  After `init`, your project gets this scaffold:
@@ -109,6 +109,7 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
109
109
  {
110
110
  "iteration": 3,
111
111
  "status": "continue|verify|blocked",
112
+ "us_id": "US-001",
112
113
  "summary": "Completed US-001, other stories remain",
113
114
  "timestamp": "2025-01-15T10:30:00Z"
114
115
  }
@@ -118,12 +119,13 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
118
119
  |-------|------|-------------|
119
120
  | `iteration` | number | Current iteration number |
120
121
  | `status` | string | One of: `continue`, `verify`, `blocked` |
122
+ | `us_id` | string\|null | US being verified: `"US-001"`, `"ALL"` (final full verify), or null (batch mode) |
121
123
  | `summary` | string | Brief description of what was accomplished |
122
124
  | `timestamp` | string | ISO 8601 UTC timestamp |
123
125
 
124
126
  **Status values:**
125
127
  - `continue` -- Current action done but more work remains. Leader proceeds to next iteration.
126
- - `verify` -- All work complete and done-claim written. Leader dispatches Verifier.
128
+ - `verify` -- Current US complete (per-US mode) or all work complete (batch mode). Leader dispatches Verifier scoped to `us_id`.
127
129
  - `blocked` -- Autonomous blocker encountered. Leader writes BLOCKED sentinel.
128
130
 
129
131
  **Usage by mode:**
@@ -160,6 +162,7 @@ Written by the Verifier after independent verification.
160
162
  ```json
161
163
  {
162
164
  "verdict": "pass|fail|request_info",
165
+ "us_id": "US-001",
163
166
  "verified_at_utc": "2025-01-15T10:35:00Z",
164
167
  "summary": "All criteria verified with fresh evidence",
165
168
  "criteria_results": [
@@ -185,7 +188,7 @@ Written by the Verifier after independent verification.
185
188
  ```
186
189
 
187
190
  **Verdict values:**
188
- - `pass`: all criteria met — Leader may write COMPLETE sentinel
191
+ - `pass`: all criteria met — Leader may write COMPLETE sentinel (or add US to `verified_us` in per-US mode)
189
192
  - `fail`: one or more criteria not met — Leader reads issues, builds next contract
190
193
  - `request_info`: Verifier cannot determine pass/fail without more information — summary contains specific questions; Leader decides outcome and may relay questions to Worker
191
194
 
@@ -363,14 +366,28 @@ Updated by the Leader after each iteration:
363
366
  "phase": "worker|verifier|complete|blocked|timeout",
364
367
  "worker_model": "sonnet",
365
368
  "verifier_model": "opus",
369
+ "worker_engine": "claude",
370
+ "verifier_engine": "claude",
371
+ "verify_mode": "per-us",
372
+ "verify_consensus": 0,
366
373
  "last_result": "continue|verify|pass|fail|blocked",
367
374
  "consecutive_failures": 0,
375
+ "verified_us": ["US-001", "US-002"],
376
+ "consensus_round": 0,
377
+ "claude_verdict": "",
378
+ "codex_verdict": "",
368
379
  "updated_at_utc": "2025-01-15T10:30:00Z"
369
380
  }
370
381
  ```
371
382
 
372
383
  - `consecutive_failures`: number of consecutive Verifier `fail` verdicts since the last `pass`. Reset to 0 on any `pass`. Unchanged by `request_info`. Used by the Circuit Breaker (see above).
373
384
  - `last_failing_criteria`: (optional) array of criterion IDs from recent `fail` verdicts, used by Leader to detect same-criterion and diverse-failure CB patterns. Leaders may add additional tracking fields as needed.
385
+ - `verified_us`: array of US IDs that have individually passed verification (per-US mode only). Empty in batch mode.
386
+ - `verify_mode`: `per-us` or `batch`. Controls the verification strategy.
387
+ - `verify_consensus`: `0` or `1`. Whether cross-engine consensus verification is enabled.
388
+ - `consensus_round`: current consensus round for the active US (resets per US). Only present when `verify_consensus=1`.
389
+ - `claude_verdict`: latest claude verifier verdict. Only present when `verify_consensus=1`.
390
+ - `codex_verdict`: latest codex verifier verdict. Only present when `verify_consensus=1`.
374
391
 
375
392
  ## Project Plans Files
376
393
 
@@ -390,7 +407,7 @@ The `quality-spec` file is not generated by `init`. Create it manually when a pr
390
407
  |---------|-----------|-------------|
391
408
  | `brainstorm` | `<description>` | Interactive planning before init |
392
409
  | `init` | `<slug> [objective]` | Create project scaffold |
393
- | `run` | `<slug> [--max-iter N] [--worker-model M] [--verifier-model M] [--mode agent\|tmux]` | Run the leader loop |
410
+ | `run` | `<slug> [--max-iter N] [--worker-model M] [--verifier-model M] [--mode agent\|tmux] [--worker-engine claude\|codex] [--verifier-engine claude\|codex] [--codex-model MODEL] [--codex-reasoning low\|medium\|high] [--verify-mode per-us\|batch] [--verify-consensus]` | Run the leader loop |
394
411
  | `status` | `<slug>` | Display current loop status |
395
412
  | `logs` | `<slug> [N]` | Show iteration logs |
396
413
  | `clean` | `<slug> [--kill-session]` | Remove runtime artifacts for re-run |
@@ -402,6 +419,76 @@ The `run` command accepts `--mode agent|tmux` (default: `agent`).
402
419
  - **`--mode agent`** (default): The current Claude Code session acts as the Leader, dispatching Workers and Verifiers via `Agent()`. Synchronous, no tmux required.
403
420
  - **`--mode tmux`**: Validates the scaffold, checks prerequisites (`tmux`, `jq`), then launches `run_ralph_desk.zsh` as the Leader. The LLM session exits after launching the script. The shell script runs independently in a tmux session.
404
421
 
422
+ ### Engine Options
423
+
424
+ The `run` command accepts engine flags to control which CLI executes Worker and Verifier prompts. **Claude is the default engine.**
425
+
426
+ | Flag | Default | Description |
427
+ |------|---------|-------------|
428
+ | `--worker-engine claude\|codex` | `claude` | Engine for Worker |
429
+ | `--verifier-engine claude\|codex` | `claude` | Engine for Verifier |
430
+ | `--codex-model MODEL` | `gpt-5.4` | Model passed to the `codex` CLI (when engine=codex) |
431
+ | `--codex-reasoning low\|medium\|high` | `high` | Reasoning effort for the `codex` CLI |
432
+
433
+ **Claude engine** (default): uses `Agent()` in agent mode, `claude -p` with `--dangerously-skip-permissions` in tmux mode.
434
+
435
+ **Codex engine** (opt-in): uses `Bash("codex ...")` in agent mode, interactive `codex` TUI in tmux mode. The `codex` binary must be installed separately (`npm install -g @openai/codex`) and is only required when an engine is set to `codex`.
436
+
437
+ Engine flags are passed to tmux mode via environment variables: `WORKER_ENGINE`, `VERIFIER_ENGINE`, `CODEX_MODEL`, `CODEX_REASONING`.
438
+
439
+ ### Verify Mode Options
440
+
441
+ The `run` command accepts `--verify-mode` to control the verification strategy. **Per-US is the default.**
442
+
443
+ | Flag | Default | Description |
444
+ |------|---------|-------------|
445
+ | `--verify-mode per-us\|batch` | `per-us` | Verification strategy |
446
+ | `--verify-consensus` | off | Cross-engine consensus verification |
447
+
448
+ **Per-US mode** (default): After each user story is completed, the Verifier checks only that story's acceptance criteria. After all stories individually pass, a final full verify checks all AC. The Leader tracks `verified_us` in `status.json`.
449
+
450
+ **Batch mode**: Legacy behavior. Worker completes all stories, then a single verification checks all AC at once.
451
+
452
+ **Consensus verification** (`--verify-consensus`): After the primary verifier runs, a second verifier runs with the alternate engine (claude or codex). Both must pass. If either fails, combined issues form a fix contract. Max 3 consensus rounds per US before BLOCKED. Requires both `claude` and `codex` CLIs.
453
+
454
+ Verify mode flags are passed to tmux mode via environment variables: `VERIFY_MODE`, `VERIFY_CONSENSUS`.
455
+
456
+ ### Iteration Signal (`us_id` field)
457
+
458
+ When using per-US verification, the Worker includes a `us_id` field in `iter-signal.json`:
459
+
460
+ ```json
461
+ {
462
+ "iteration": 3,
463
+ "status": "verify",
464
+ "us_id": "US-001",
465
+ "summary": "Completed US-001",
466
+ "timestamp": "2025-01-15T10:30:00Z"
467
+ }
468
+ ```
469
+
470
+ | Value | Meaning |
471
+ |-------|---------|
472
+ | `"US-001"` (specific) | Verify only this story's AC |
473
+ | `"ALL"` | Final full verify — check all AC |
474
+ | absent/null | Legacy batch mode — check all AC |
475
+
476
+ The Verifier's verdict JSON also includes a `us_id` field to confirm which scope was verified.
477
+
478
+ ### Consensus Verdict Fields
479
+
480
+ When `--verify-consensus` is enabled, `status.json` includes additional fields:
481
+
482
+ ```json
483
+ {
484
+ "consensus_round": 1,
485
+ "claude_verdict": "pass",
486
+ "codex_verdict": "pass"
487
+ }
488
+ ```
489
+
490
+ Individual engine verdicts are saved as `verify-verdict-claude.json` and `verify-verdict-codex.json` in the logs directory.
491
+
405
492
  ### `--kill-session` Flag
406
493
 
407
494
  The `clean` command accepts `--kill-session` to kill any tmux sessions matching the slug pattern (`rlp-desk-<slug>-*`) in addition to removing runtime files.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.1.2",
3
+ "version": "0.2.1",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -31,7 +31,16 @@ Ask about these items one by one (or in small groups):
31
31
  5. **Verification Commands** — build, test, lint commands
32
32
  6. **Completion / Blocked Criteria**
33
33
  7. **Worker / Verifier Model** — haiku, sonnet, opus. Suggest defaults (worker: sonnet, verifier: opus), ask if OK.
34
- 8. **Max Iterations** — suggest based on story count, ask if OK.
34
+ 8. **Engine & Model** — For each role (Worker, Verifier):
35
+ - Engine: claude (default) or codex
36
+ - If claude: suggest model (haiku/sonnet/opus) based on task complexity
37
+ - If codex: suggest model (default: gpt-5.4) and reasoning effort (low/medium/high)
38
+ - AI should recommend: "For this task complexity, I suggest Worker: sonnet, Verifier: opus"
39
+ - If codex selected: "For codex Worker, I suggest gpt-5.4 with high reasoning"
40
+ 9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
41
+ 10. **Verify Consensus** — Ask: "Use cross-engine consensus verification? (Both claude and codex verify independently, both must pass.) Requires codex CLI." Default: no.
42
+ 11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
43
+ 12. **Max Iterations** — suggest based on story count, ask if OK.
35
44
 
36
45
  After all items are confirmed, present the full contract summary.
37
46
  On approval, offer to run `init`.
@@ -56,6 +65,19 @@ Options (parse from `$ARGUMENTS`):
56
65
  - `--max-iter N` (default: 100)
57
66
  - `--worker-model MODEL` (default: sonnet)
58
67
  - `--verifier-model MODEL` (default: opus)
68
+ - `--worker-engine claude|codex` (default: `claude`) — engine for Worker
69
+ - `--verifier-engine claude|codex` (default: `claude`) — engine for Verifier
70
+ - `--worker-codex-model MODEL` (default: `gpt-5.4`) — codex model for Worker
71
+ - `--worker-codex-reasoning low|medium|high` (default: `high`) — reasoning for Worker
72
+ - `--verifier-codex-model MODEL` (default: `gpt-5.4`) — codex model for Verifier
73
+ - `--verifier-codex-reasoning low|medium|high` (default: `high`) — reasoning for Verifier
74
+ - `--verify-mode per-us|batch` (default: `per-us`) — verification strategy
75
+ - `per-us`: verify after each US, then final full verify of all AC
76
+ - `batch`: verify only after all US done (legacy behavior)
77
+ - `--verify-consensus` — enable cross-engine consensus verification (both claude and codex verify independently; both must pass)
78
+ - `--consensus-scope all|final-only` — when consensus runs (default: `all`)
79
+ - `all`: consensus runs on every verify (current behavior)
80
+ - `final-only`: consensus only on final ALL verify
59
81
  - `--debug` — enable debug logging (tmux mode only, writes to logs/<slug>/debug.log)
60
82
 
61
83
  ### Mode Selection
@@ -77,13 +99,25 @@ ROOT="$PWD" \
77
99
  MAX_ITER=<--max-iter value> \
78
100
  WORKER_MODEL=<--worker-model value> \
79
101
  VERIFIER_MODEL=<--verifier-model value> \
102
+ WORKER_ENGINE=<--worker-engine value, default: claude> \
103
+ VERIFIER_ENGINE=<--verifier-engine value, default: claude> \
104
+ WORKER_CODEX_MODEL=<--worker-codex-model value, default: gpt-5.4> \
105
+ WORKER_CODEX_REASONING=<--worker-codex-reasoning value, default: high> \
106
+ VERIFIER_CODEX_MODEL=<--verifier-codex-model value, default: gpt-5.4> \
107
+ VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
108
+ VERIFY_MODE=<--verify-mode value, default: per-us> \
109
+ VERIFY_CONSENSUS=<1 if --verify-consensus, else 0> \
110
+ CONSENSUS_SCOPE=<--consensus-scope value, default: all> \
80
111
  DEBUG=<1 if --debug, else 0> \
81
112
  zsh ~/.claude/ralph-desk/run_ralph_desk.zsh
82
113
  ```
83
114
  6. **If the script exits with error (exit code 1)** — report the error to the user and STOP. Do NOT attempt to work around it. Do NOT create tmux sessions yourself. Do NOT re-launch the script in a different way. Just tell the user what went wrong and suggest using Agent mode instead.
84
115
  7. **If successful** — tell the user the tmux session has been started. The shell script takes over as the deterministic Leader. No Agent() calls are made in tmux mode.
85
116
 
86
- **IMPORTANT:** Tmux mode requires the user to already be inside a tmux session. If the runner script rejects because $TMUX is not set, do NOT try to create a tmux session yourself. Tell the user: "Start tmux first, then retry."
117
+ **IMPORTANT RULES:**
118
+ - Tmux mode requires the user to already be inside a tmux session. If the runner script rejects because $TMUX is not set, do NOT try to create a tmux session yourself. Tell the user: "Start tmux first, then retry."
119
+ - Do NOT run the script in background (`&`, `run_in_background`). The script must run in foreground so panes remain visible to the user. The user needs to see Worker/Verifier panes in real-time.
120
+ - Do NOT kill panes after completion. Panes stay alive for inspection. User cleans up with `/rlp-desk clean <slug> --kill-session`.
87
121
 
88
122
  #### Agent Mode (`--mode agent` or default)
89
123
 
@@ -124,7 +158,9 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
124
158
  - Combine with iteration number + memory contract
125
159
  - Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
126
160
 
127
- **⑤ Execute Worker via Agent()**
161
+ **⑤ Execute Worker**
162
+
163
+ If `--worker-engine claude` (default):
128
164
  ```
129
165
  Agent(
130
166
  description="rlp-desk worker iter-NNN",
@@ -137,24 +173,69 @@ Agent(
137
173
  - Agent returns synchronously. No polling needed.
138
174
  - Each Agent() = fresh context. Guaranteed.
139
175
 
176
+ If `--worker-engine codex`:
177
+ ```
178
+ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_reasoning> <full worker prompt text>")
179
+ ```
180
+ - Codex runs as a subprocess via Bash(), not Agent().
181
+ - Each Bash() call = fresh context for codex.
182
+
140
183
  **⑥ Read memory.md again** (Worker updated it)
141
184
  - `stop=continue` → go to ⑧
142
185
  - `stop=verify` → go to ⑦
143
186
  - `stop=blocked` → write BLOCKED sentinel, stop
144
-
145
- **⑦ Execute Verifier via Agent()**
146
- - Build verifier prompt, write to `iter-NNN.verifier-prompt.md`
187
+ - Also read `iter-signal.json` for `us_id` field (which US was just completed)
188
+
189
+ **⑦ Execute Verifier**
190
+
191
+ **Per-US mode** (default, `--verify-mode per-us`):
192
+ - Read `us_id` from `iter-signal.json` (e.g., "US-001" or "ALL")
193
+ - Build verifier prompt scoped to `us_id`:
194
+ - If `us_id` is a specific story: "Verify ONLY the acceptance criteria for {us_id}"
195
+ - If `us_id` is "ALL": "Verify ALL acceptance criteria (final full verify)"
196
+ - Write to `iter-NNN.verifier-prompt.md`
197
+ - Track verified US in `status.json` field `verified_us` (array)
198
+ - After verifier passes a specific US:
199
+ - Add that US to `verified_us` in status.json
200
+ - If more US remain → Worker does next US → verify → ...
201
+ - If all US individually passed → signal final full verify (us_id=ALL)
202
+ - After final full verify passes → COMPLETE
203
+
204
+ **Batch mode** (`--verify-mode batch`):
205
+ - Legacy behavior: verify only when Worker signals all work is done
206
+ - Verifier checks all AC at once
207
+
208
+ **⑦a Dispatch Verifier**
209
+
210
+ If `--verifier-engine claude` (default):
147
211
  ```
148
212
  Agent(
149
- description="rlp-desk verifier iter-NNN",
213
+ description="rlp-desk verifier iter-NNN (us_id)",
150
214
  subagent_type="executor",
151
215
  model=<verifier_model>,
152
216
  mode="bypassPermissions",
153
- prompt=<full verifier prompt text>
217
+ prompt=<full verifier prompt text with US scope>
154
218
  )
155
219
  ```
156
- - Read `verify-verdict.json`:
220
+
221
+ If `--verifier-engine codex`:
222
+ ```
223
+ Bash("codex exec --model <verifier_codex_model> --reasoning-effort <verifier_codex_reasoning> <full verifier prompt text>")
224
+ ```
225
+
226
+ **⑦b Consensus Verification** (when `--verify-consensus` is enabled):
227
+ After the primary verifier runs, run a second verifier with the OTHER engine:
228
+ - If primary engine is claude → run codex verifier
229
+ - If primary engine is codex → run claude verifier
230
+ - Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
231
+ - **Both pass** → proceed (next US or COMPLETE)
232
+ - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
233
+ - Max 3 consensus rounds per US. After 3 rounds → BLOCKED.
234
+
235
+ **⑦c Read verdict(s)**
236
+ - Read `verify-verdict.json` (or both `-claude.json` and `-codex.json` if consensus):
157
237
  - `pass` + `complete` → write COMPLETE sentinel, report done!
238
+ - `pass` + specific US → add to `verified_us`, Worker does next US
158
239
  - `fail` + `continue` → **run Fix Loop** (governance.md §7½):
159
240
  1. Read `issues` array, sort by severity (`critical` → `major` → `minor`)
160
241
  2. Build structured fix contract with traceability rule
@@ -180,6 +261,13 @@ Agent(
180
261
 
181
262
  Track `consecutive_failures` in `status.json` (increment on `fail`, reset on `pass`, unchanged by `request_info`). Only **fail** verdicts count for CB chains — `request_info` does not break or contribute.
182
263
 
264
+ Track `verified_us` (array of US IDs that passed verification) in `status.json` when using `--verify-mode per-us`.
265
+
266
+ When `--verify-consensus` is enabled, also track in `status.json`:
267
+ - `consensus_round`: current consensus round for this US (resets per US)
268
+ - `claude_verdict`: latest claude verifier verdict for this US
269
+ - `codex_verdict`: latest codex verifier verdict for this US
270
+
183
271
  ### Important Rules
184
272
  - Each Agent() = new process = fresh context
185
273
  - YOU track iteration count
@@ -216,10 +304,26 @@ tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "^rlp-desk-<slug>-" |
216
304
  ```
217
305
  /rlp-desk brainstorm <description> Plan before init (interactive)
218
306
  /rlp-desk init <slug> [objective] Create project scaffold
219
- /rlp-desk run <slug> [--mode agent|tmux] Run loop (agent=LLM leader, tmux=shell leader)
307
+ /rlp-desk run <slug> [options] Run loop (agent=LLM leader, tmux=shell leader)
220
308
  /rlp-desk status <slug> Show loop status
221
309
  /rlp-desk logs <slug> [N] Show iteration log
222
310
  /rlp-desk clean <slug> [--kill-session] Reset for re-run (--kill-session kills tmux)
311
+
312
+ Run options:
313
+ --mode agent|tmux Execution mode (default: agent)
314
+ --max-iter N Max iterations (default: 100)
315
+ --worker-model MODEL Worker model (default: sonnet)
316
+ --verifier-model MODEL Verifier model (default: opus)
317
+ --worker-engine claude|codex Worker engine (default: claude)
318
+ --verifier-engine claude|codex Verifier engine (default: claude)
319
+ --worker-codex-model MODEL Worker codex model (default: gpt-5.4)
320
+ --worker-codex-reasoning LEVEL Worker codex reasoning (default: high)
321
+ --verifier-codex-model MODEL Verifier codex model (default: gpt-5.4)
322
+ --verifier-codex-reasoning LEVEL Verifier codex reasoning (default: high)
323
+ --verify-mode per-us|batch Verification strategy (default: per-us)
324
+ --verify-consensus Cross-engine consensus verification
325
+ --consensus-scope SCOPE When consensus runs: all|final-only (default: all)
326
+ --debug Debug logging (tmux mode only)
223
327
  ```
224
328
 
225
329
  ## Architecture
package/src/governance.md CHANGED
@@ -12,7 +12,7 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
12
12
  - **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
13
13
  - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
14
14
  - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
15
- - **Claude models only**: haiku, sonnet, opus.
15
+ - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
16
16
 
17
17
  ## 2. Roles
18
18
 
@@ -43,7 +43,9 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
43
43
  RUNNING → DONE_CLAIMED → VERIFYING → COMPLETE | CONTINUE | BLOCKED
44
44
  ```
45
45
 
46
- ## 4. Model Routing (Claude only)
46
+ ## 4. Model Routing
47
+
48
+ ### Claude (default engine)
47
49
 
48
50
  | Role | Default Model | Override Criteria |
49
51
  |------|---------------|-------------------|
@@ -58,12 +60,21 @@ The Leader decides each iteration. Decision criteria:
58
60
  - Simple repetitive task → downgrade model
59
61
  - User explicitly specified → use as given
60
62
 
63
+ ### Codex (opt-in engine)
64
+
65
+ | Option | Default | Description |
66
+ |--------|---------|-------------|
67
+ | `--codex-model` | `gpt-5.4` | Model passed to the `codex` CLI |
68
+ | `--codex-reasoning` | `high` | Reasoning effort: `low`, `medium`, or `high` |
69
+
70
+ Model routing is static when using codex: the same model and reasoning effort apply to both Worker and Verifier. There is no dynamic upgrade path. Claude is the default engine; codex is explicitly opt-in.
71
+
61
72
  ## 5a. Execution: Agent() Approach (default) — "Smart Mode"
62
73
 
63
74
  All environments (Claude Code, OpenCode) use the same Agent tool.
64
75
 
65
76
  ```
66
- # Worker
77
+ # Worker (claude engine, default)
67
78
  Agent(
68
79
  subagent_type="executor",
69
80
  model="sonnet",
@@ -71,7 +82,7 @@ Agent(
71
82
  mode="bypassPermissions"
72
83
  )
73
84
 
74
- # Verifier
85
+ # Verifier (claude engine, default)
75
86
  Agent(
76
87
  subagent_type="executor",
77
88
  model="sonnet",
@@ -80,6 +91,15 @@ Agent(
80
91
  )
81
92
  ```
82
93
 
94
+ If `--worker-engine codex` or `--verifier-engine codex` (opt-in):
95
+ ```
96
+ # Worker or Verifier (codex engine)
97
+ Bash("codex -m <codex_model> -c model_reasoning_effort=<codex_reasoning> --dangerously-bypass-approvals-and-sandbox <prompt>")
98
+ ```
99
+ - Codex runs as a subprocess via `Bash()`, not `Agent()` — the Agent tool is Claude-specific.
100
+ - Each `Bash()` call = fresh context for codex.
101
+ - Claude is the default engine. Codex is explicitly opt-in.
102
+
83
103
  Characteristics:
84
104
  - Each call = fresh context (new subprocess)
85
105
  - Synchronous return. No polling or signal files needed.
@@ -106,14 +126,25 @@ The tmux runner (`run_ralph_desk.zsh`) creates a tmux session with three panes:
106
126
  - **Worker pane** — receives `claude -p` invocations via trigger scripts
107
127
  - **Verifier pane** — receives `claude -p` invocations via trigger scripts
108
128
 
109
- All `claude` CLI calls use `--dangerously-skip-permissions`:
129
+ By default, `claude` CLI calls use `--dangerously-skip-permissions`:
110
130
  ```bash
131
+ # claude engine (default)
111
132
  claude -p "$(cat /path/to/prompt.md)" \
112
133
  --model sonnet \
113
134
  --dangerously-skip-permissions
114
135
  ```
115
136
 
116
- **Security implication:** `--dangerously-skip-permissions` allows the CLI to execute code without user confirmation. The tmux runner requires this because there is no interactive user to approve each action. Only run tmux mode in trusted environments with trusted prompts.
137
+ When `WORKER_ENGINE=codex` or `VERIFIER_ENGINE=codex`, the `codex` CLI is used instead:
138
+ ```bash
139
+ # codex engine (opt-in)
140
+ codex -m gpt-5.4 \
141
+ -c model_reasoning_effort="high" \
142
+ --dangerously-bypass-approvals-and-sandbox \
143
+ "$(cat /path/to/prompt.md)"
144
+ ```
145
+ The codex CLI is only required when an engine is set to `codex`. Claude remains the default engine throughout.
146
+
147
+ **Security implication:** Both `--dangerously-skip-permissions` (claude) and `--dangerously-bypass-approvals-and-sandbox` (codex) allow the CLI to execute code without user confirmation. The tmux runner requires this because there is no interactive user to approve each action. Only run tmux mode in trusted environments with trusted prompts.
117
148
 
118
149
  Characteristics:
119
150
  - Leader is a shell script, not an LLM — zero tokens consumed for orchestration.
@@ -193,17 +224,19 @@ for iteration in 1..max_iter:
193
224
 
194
225
  ⑥ Read memory.md again → check Worker's updated state
195
226
  - "continue" → go to ⑧
196
- - "verify" → go to ⑦
227
+ - "verify" → go to ⑦ (also read iter-signal.json for us_id)
197
228
  - "blocked" → write BLOCKED sentinel, stop
198
229
  Note: In tmux mode, the Leader polls `<slug>-iter-signal.json` instead of
199
230
  parsing memory.md. In Agent() mode, the Leader MAY read iter-signal.json
200
231
  as a structured alternative to parsing the Stop Status from memory.md.
201
232
 
202
- ⑦ Execute Verifier
203
- - Build prompt log to logs/<slug>/iter-NNN.verifier-prompt.md
233
+ ⑦ Execute Verifier (see §7a for per-US and §7b for consensus details)
234
+ - Build prompt (scoped to us_id if per-us mode) → log
204
235
  - Agent(subagent_type="executor", model=selected, prompt=prompt)
236
+ - If --verify-consensus: run second verifier with alternate engine (see §7b)
205
237
  - Read verify-verdict.json:
206
- • pass + completewrite COMPLETE sentinel, stop
238
+ • pass + specific US add to verified_us, Worker does next US
239
+ • pass + us_id=ALL or complete → write COMPLETE sentinel, stop
207
240
  • fail + continue → go to ⑧
208
241
  • blocked → write BLOCKED sentinel, stop
209
242
 
@@ -211,6 +244,50 @@ for iteration in 1..max_iter:
211
244
  Update status.json, report to user, continue to next iteration
212
245
  ```
213
246
 
247
+ ## 7a. Per-US Verification
248
+
249
+ By default (`--verify-mode per-us`), each user story is verified independently before proceeding to the next:
250
+
251
+ ```
252
+ Worker completes US-001 → signal verify (us_id: "US-001")
253
+ → Verifier checks ONLY US-001 AC → pass
254
+ → Worker completes US-002 → signal verify (us_id: "US-002")
255
+ → Verifier checks ONLY US-002 AC → pass
256
+ → ...
257
+ → All US individually pass → signal verify (us_id: "ALL")
258
+ → Verifier runs FINAL FULL VERIFY (all AC) → pass → COMPLETE
259
+ ```
260
+
261
+ **Key rules:**
262
+ - Worker signals `verify` after each US with `us_id` set in `iter-signal.json`
263
+ - Verifier checks only the scoped US acceptance criteria (or all if us_id=ALL)
264
+ - Leader tracks `verified_us` array in `status.json`
265
+ - If a per-US verify fails, the Worker retries that specific US (fix loop)
266
+ - Final full verify ensures nothing was broken by later changes
267
+
268
+ **Batch mode** (`--verify-mode batch`) preserves legacy behavior: Worker signals `verify` only after all work is done, and the Verifier checks all AC at once.
269
+
270
+ ## 7b. Cross-Engine Consensus Verification
271
+
272
+ When `--verify-consensus` is enabled, after the primary verifier runs, a second verifier runs with the alternate engine:
273
+
274
+ ```
275
+ Worker completes US → signal verify
276
+ → Claude Verifier runs (checks AC)
277
+ → Codex Verifier runs (checks AC)
278
+ → Both pass → proceed (next US or COMPLETE)
279
+ → Either fails → combined issues → fix contract → Worker retry
280
+ → Max 3 consensus rounds per US → BLOCKED if still disagreeing
281
+ ```
282
+
283
+ **Key rules:**
284
+ - Both claude and codex CLI must be installed
285
+ - Verifiers run sequentially in the same Verifier pane (tmux) or as sequential calls (Agent mode)
286
+ - Verdicts are saved as `verify-verdict-claude.json` and `verify-verdict-codex.json`
287
+ - Combined fix contracts include issues from both engines
288
+ - `status.json` includes `consensus_round`, `claude_verdict`, and `codex_verdict` fields
289
+ - Consensus can be combined with per-US verification (each US gets consensus-verified)
290
+
214
291
  ## 7½. Fix Loop Protocol
215
292
 
216
293
  When the Verifier returns `fail`, the Leader runs the Fix Loop before issuing the next Worker contract: