@ai-dev-methodologies/rlp-desk 0.5.1 → 0.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +41 -27
- package/docs/plans/frolicking-churning-honey.md +253 -0
- package/install.sh +18 -17
- package/package.json +1 -1
- package/scripts/postinstall.js +32 -13
- package/src/commands/rlp-desk.md +130 -109
- package/src/governance.md +74 -23
- package/src/scripts/lib_ralph_desk.zsh +41 -11
- package/src/scripts/run_ralph_desk.zsh +72 -42
package/src/commands/rlp-desk.md
CHANGED
|
@@ -25,6 +25,7 @@ Ask about these items one by one (or in small groups):
|
|
|
25
25
|
2. **Objective** — what the loop achieves
|
|
26
26
|
3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
|
|
27
27
|
- Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
|
|
28
|
+
- **Task Sizing (governance §1c)**: Size each US within the Worker's comfortable zone — smaller than what the Worker can handle, not at its ceiling. Max 3-4 ACs, max 2 files. If a US feels "just barely doable" for the target model, split it further.
|
|
28
29
|
- Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
|
|
29
30
|
```
|
|
30
31
|
Given [precondition in domain language]
|
|
@@ -53,39 +54,47 @@ Ask about these items one by one (or in small groups):
|
|
|
53
54
|
| External dependencies | none | 1-2 | 3+ | distributed |
|
|
54
55
|
| Existing code impact | new only | modify | refactor | architecture |
|
|
55
56
|
|
|
56
|
-
**
|
|
57
|
-
- LOW → haiku / sonnet
|
|
58
|
-
- MEDIUM → sonnet / opus
|
|
59
|
-
- HIGH → opus / opus
|
|
60
|
-
- CRITICAL → opus / opus + require human review
|
|
57
|
+
**Codex Detection** — check if codex CLI is installed (`command -v codex`).
|
|
61
58
|
|
|
62
|
-
|
|
59
|
+
**Model mapping — Claude-only** (codex not installed):
|
|
63
60
|
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
61
|
+
| Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
|
|
62
|
+
|------------|--------|-----------------|----------------|-----------|
|
|
63
|
+
| LOW | haiku | sonnet | opus | off |
|
|
64
|
+
| MEDIUM | sonnet | opus | opus | off |
|
|
65
|
+
| HIGH | opus | opus | opus | off |
|
|
66
|
+
| CRITICAL | opus | opus | opus + human | off |
|
|
68
67
|
|
|
69
|
-
**
|
|
68
|
+
**Model mapping — Cross-engine** (codex installed, recommended):
|
|
70
69
|
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
70
|
+
| Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
|
|
71
|
+
|------------|--------|-----------------|----------------|-----------|
|
|
72
|
+
| LOW | spark:high | sonnet | opus | final-only |
|
|
73
|
+
| MEDIUM | spark:high | opus | opus | final-only |
|
|
74
|
+
| HIGH | gpt-5.4:high | opus | opus | all |
|
|
75
|
+
| CRITICAL | gpt-5.4:high | opus | opus + human | all |
|
|
75
76
|
|
|
76
|
-
**
|
|
77
|
-
-
|
|
78
|
-
-
|
|
77
|
+
**Worker model selection** (cross-engine):
|
|
78
|
+
- **spark:high** — default recommendation (Pro token pool = cost savings). PRD AC count <= 15
|
|
79
|
+
- **gpt-5.4:high** — fallback when spark 100k output limit exceeded. PRD AC count > 15
|
|
79
80
|
|
|
80
|
-
|
|
81
|
-
|
|
81
|
+
Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM."
|
|
82
|
+
|
|
83
|
+
**If codex IS installed** — say: "Codex is installed. I recommend cross-engine Worker for cost savings (Pro token pool separation) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
|
|
84
|
+
|
|
85
|
+
**If codex is NOT installed** — say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
|
|
86
|
+
|
|
87
|
+
8. **Batch Capacity Check** — when verify-mode is batch and PRD is large:
|
|
88
|
+
- batch + spark + AC > 10 → warn "spark 100k output limit — consider wave split or switch to gpt-5.4"
|
|
89
|
+
- batch + gpt-5.4 + AC > 15 → warn "too many ACs for single batch — consider wave split (3-4 US per wave)"
|
|
90
|
+
- per-us → no warning (US-level processing, no limit concern)
|
|
82
91
|
9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
|
|
83
|
-
10. **
|
|
84
|
-
11. **
|
|
85
|
-
12. **Max Iterations** — suggest based on story count, ask if OK.
|
|
92
|
+
10. **Consensus** — Ask: "Use cross-engine consensus? off (single engine), final-only (cross-engine on final verify only), or all (cross-engine on every verify). Requires codex CLI." Default: off. Recommended: final-only when codex is installed.
|
|
93
|
+
11. **Max Iterations** — suggest based on story count, ask if OK.
|
|
86
94
|
|
|
87
95
|
After all items are confirmed:
|
|
88
96
|
|
|
97
|
+
0. **SV Report Feedback** — If a prior campaign's self-verification report exists for this project (`~/.claude/ralph-desk/analytics/*/self-verification-report-*.md`), reference it to inform this brainstorm: which US types failed most, which model tiers underperformed, which AC patterns caused issues. Present relevant findings to the user. (governance §8½)
|
|
89
98
|
1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
|
|
90
99
|
If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
|
|
91
100
|
If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
|
|
@@ -123,27 +132,33 @@ Tell the user:
|
|
|
123
132
|
```
|
|
124
133
|
Available run commands (copy the one you want):
|
|
125
134
|
|
|
126
|
-
# Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
|
|
127
|
-
/rlp-desk run <actual-slug> --worker-model
|
|
135
|
+
# ★ Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
|
|
136
|
+
/rlp-desk run <actual-slug> --mode tmux --worker-model spark:high --consensus final-only --debug
|
|
128
137
|
|
|
129
|
-
#
|
|
130
|
-
/rlp-desk run <actual-slug> --worker-model gpt-5.
|
|
138
|
+
# Large PRD (AC > 15, exceeds spark 100k limit):
|
|
139
|
+
/rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus final-only --debug
|
|
140
|
+
|
|
141
|
+
# Critical (full consensus on every verify):
|
|
142
|
+
/rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus all --debug
|
|
131
143
|
|
|
132
144
|
# Claude-only:
|
|
133
145
|
/rlp-desk run <actual-slug> --debug
|
|
134
146
|
|
|
135
|
-
# Basic agent:
|
|
136
|
-
/rlp-desk run <actual-slug>
|
|
137
|
-
|
|
138
147
|
# Full options reference:
|
|
139
|
-
# --mode agent|tmux
|
|
140
|
-
# --worker-model MODEL
|
|
141
|
-
# --
|
|
142
|
-
# --
|
|
143
|
-
# --
|
|
144
|
-
# --
|
|
145
|
-
# --
|
|
146
|
-
# --
|
|
148
|
+
# --mode agent|tmux (default: agent)
|
|
149
|
+
# --worker-model MODEL haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
|
|
150
|
+
# --lock-worker-model disable auto model upgrade
|
|
151
|
+
# --verifier-model MODEL per-US verifier (default: sonnet)
|
|
152
|
+
# --final-verifier-model MODEL final ALL verifier (default: opus)
|
|
153
|
+
# --consensus off|all|final-only cross-engine consensus (default: off)
|
|
154
|
+
# --consensus-model MODEL per-US cross-verifier (default: gpt-5.4:medium)
|
|
155
|
+
# --final-consensus-model MODEL final cross-verifier (default: gpt-5.4:high)
|
|
156
|
+
# --verify-mode per-us|batch (default: per-us)
|
|
157
|
+
# --cb-threshold N (default: 6)
|
|
158
|
+
# --max-iter N (default: 100)
|
|
159
|
+
# --iter-timeout N tmux only (default: 600)
|
|
160
|
+
# --debug debug logging
|
|
161
|
+
# --with-self-verification post-campaign SV report
|
|
147
162
|
```
|
|
148
163
|
|
|
149
164
|
**If codex is NOT installed** — show claude-only presets + install recommendation:
|
|
@@ -151,7 +166,7 @@ Tell the user:
|
|
|
151
166
|
```
|
|
152
167
|
Available run commands (copy the one you want):
|
|
153
168
|
|
|
154
|
-
# Recommended: tmux mode + claude-only (real-time visibility):
|
|
169
|
+
# ★ Recommended: tmux mode + claude-only (real-time visibility):
|
|
155
170
|
/rlp-desk run <actual-slug> --mode tmux --debug
|
|
156
171
|
|
|
157
172
|
# Agent mode:
|
|
@@ -161,13 +176,17 @@ Tell the user:
|
|
|
161
176
|
npm install -g @openai/codex
|
|
162
177
|
|
|
163
178
|
# Full options reference:
|
|
164
|
-
# --mode agent|tmux
|
|
165
|
-
# --worker-model MODEL
|
|
166
|
-
# --
|
|
167
|
-
# --
|
|
168
|
-
# --
|
|
169
|
-
# --
|
|
170
|
-
# --
|
|
179
|
+
# --mode agent|tmux (default: agent)
|
|
180
|
+
# --worker-model MODEL haiku|sonnet|opus (default: haiku)
|
|
181
|
+
# --lock-worker-model disable auto model upgrade
|
|
182
|
+
# --verifier-model MODEL per-US verifier (default: sonnet)
|
|
183
|
+
# --final-verifier-model MODEL final ALL verifier (default: opus)
|
|
184
|
+
# --verify-mode per-us|batch (default: per-us)
|
|
185
|
+
# --cb-threshold N (default: 6)
|
|
186
|
+
# --max-iter N (default: 100)
|
|
187
|
+
# --iter-timeout N tmux only (default: 600)
|
|
188
|
+
# --debug debug logging
|
|
189
|
+
# --with-self-verification post-campaign SV report
|
|
171
190
|
```
|
|
172
191
|
|
|
173
192
|
Replace `<actual-slug>` with the real slug from this init (e.g. `auth-refactor`).
|
|
@@ -182,24 +201,21 @@ Tell the user:
|
|
|
182
201
|
|
|
183
202
|
Options (parse from `$ARGUMENTS`):
|
|
184
203
|
- `--mode agent|tmux` (default: `agent`) — execution mode
|
|
185
|
-
- `--
|
|
186
|
-
- `--worker-model
|
|
187
|
-
- `--verifier-model MODEL` (default:
|
|
188
|
-
- `--
|
|
189
|
-
- `--
|
|
190
|
-
-
|
|
191
|
-
-
|
|
192
|
-
-
|
|
193
|
-
- `--
|
|
204
|
+
- `--worker-model MODEL` (default: `haiku`) — Worker model. Format: `model` = claude engine, `model:reasoning` = codex engine. Examples: `haiku`, `sonnet`, `opus`, `spark:high`, `gpt-5.4:high`. Parsed by `parse_model_flag()` which auto-splits engine/model/reasoning.
|
|
205
|
+
- `--lock-worker-model` — disable automatic model upgrade on failure (check_model_upgrade). Worker stays on the specified model regardless of consecutive failures.
|
|
206
|
+
- `--verifier-model MODEL` (default: `sonnet`) — per-US verification model. Campaign-fixed (no progressive upgrade). Lighter than final verifier.
|
|
207
|
+
- `--final-verifier-model MODEL` (default: `opus`) — final ALL verification model. Independent from per-US verifier. Used only for the final full-AC verify pass.
|
|
208
|
+
- `--consensus off|all|final-only` (default: `off`) — cross-engine consensus verification mode.
|
|
209
|
+
- `off`: single-engine verification only
|
|
210
|
+
- `all`: cross-engine consensus on every verify (per-US and final)
|
|
211
|
+
- `final-only`: cross-engine consensus only on the final ALL verify
|
|
212
|
+
- `--consensus-model MODEL` (default: `gpt-5.4:medium`) — per-US cross-verifier model. Lighter weight for cost efficiency.
|
|
213
|
+
- `--final-consensus-model MODEL` (default: `gpt-5.4:high`) — final cross-verifier model. Stricter. Note: spark is not allowed here (100k output limit).
|
|
194
214
|
- `--verify-mode per-us|batch` (default: `per-us`) — verification strategy
|
|
195
215
|
- `per-us`: verify after each US, then final full verify of all AC
|
|
196
216
|
- `batch`: verify only after all US done (legacy behavior)
|
|
197
|
-
- `--
|
|
198
|
-
- `--
|
|
199
|
-
- `all`: consensus runs on every verify (current behavior)
|
|
200
|
-
- `final-only`: consensus only on final ALL verify
|
|
201
|
-
- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
|
|
202
|
-
- `--consensus-fail-fast` — skip second verifier if first verifier fails (saves time/tokens in consensus mode)
|
|
217
|
+
- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 6). When `--consensus` is not `off`, effective threshold is automatically doubled (e.g., default becomes 12).
|
|
218
|
+
- `--max-iter N` (default: 100)
|
|
203
219
|
- `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
|
|
204
220
|
- `--debug` — enable debug logging (writes to ~/.claude/ralph-desk/analytics/<slug>/debug.log)
|
|
205
221
|
- `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
|
|
@@ -208,7 +224,7 @@ Options (parse from `$ARGUMENTS`):
|
|
|
208
224
|
When `--debug` or `--with-self-verification` is active, analytics data is written to a user-level directory for cross-project aggregation. Contents:
|
|
209
225
|
- `metadata.json` — campaign metadata: slug, project_root, campaign_status, start_time, end_time
|
|
210
226
|
- `debug.log` — debug output (versioned: `debug-v{N}.log` on re-execution)
|
|
211
|
-
- `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_engine, claude_verdict, codex_verdict,
|
|
227
|
+
- `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_model, verifier_engine, consensus_mode, claude_verdict, codex_verdict, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
|
|
212
228
|
- `self-verification-data.json` — cumulative SV records (agent-mode only, when `--with-self-verification`)
|
|
213
229
|
- `self-verification-report-NNN.md` — versioned SV reports (when `--with-self-verification`)
|
|
214
230
|
|
|
@@ -232,17 +248,14 @@ LOOP_NAME="<slug>" \
|
|
|
232
248
|
ROOT="$PWD" \
|
|
233
249
|
MAX_ITER=<--max-iter value> \
|
|
234
250
|
WORKER_MODEL=<--worker-model value> \
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
WORKER_CODEX_MODEL=<--worker-codex-model value, default: gpt-5.4> \
|
|
239
|
-
WORKER_CODEX_REASONING=<--worker-codex-reasoning value, default: high> \
|
|
240
|
-
VERIFIER_CODEX_MODEL=<--verifier-codex-model value, default: gpt-5.4> \
|
|
241
|
-
VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
|
|
251
|
+
LOCK_WORKER_MODEL=<1 if --lock-worker-model, else 0> \
|
|
252
|
+
VERIFIER_MODEL=<--verifier-model value, default: sonnet> \
|
|
253
|
+
FINAL_VERIFIER_MODEL=<--final-verifier-model value, default: opus> \
|
|
242
254
|
VERIFY_MODE=<--verify-mode value, default: per-us> \
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
255
|
+
CONSENSUS_MODE=<--consensus value, default: off> \
|
|
256
|
+
CONSENSUS_MODEL=<--consensus-model value, default: gpt-5.4:medium> \
|
|
257
|
+
FINAL_CONSENSUS_MODEL=<--final-consensus-model value, default: gpt-5.4:high> \
|
|
258
|
+
CB_THRESHOLD=<--cb-threshold value, default: 6> \
|
|
246
259
|
ITER_TIMEOUT=<--iter-timeout value, default: 600> \
|
|
247
260
|
DEBUG=<1 if --debug, else 0> \
|
|
248
261
|
WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
|
|
@@ -269,7 +282,7 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
|
|
|
269
282
|
|
|
270
283
|
### Preparation
|
|
271
284
|
1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
|
|
272
|
-
2. **Codex CLI pre-validation**: If `--
|
|
285
|
+
2. **Codex CLI pre-validation**: If `--consensus` is not `off` OR `--worker-model` uses codex format (contains `:`) OR `--verifier-model` / `--final-verifier-model` / `--consensus-model` / `--final-consensus-model` uses codex format, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
|
|
273
286
|
3. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
|
|
274
287
|
4. Clean previous `done-claim.json`, `verify-verdict.json`.
|
|
275
288
|
5. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
|
|
@@ -290,9 +303,9 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
|
|
|
290
303
|
- If you output "Iter 1 complete, moving to iter 2" as plain text without a tool call, the turn terminates and the loop breaks. This is a platform constraint, not a compliance issue — no amount of "DO NOT STOP" text can override it.
|
|
291
304
|
|
|
292
305
|
If `--debug`, at loop start debug_log the following (3 [OPTION] entries):
|
|
293
|
-
- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode>
|
|
294
|
-
- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N>`
|
|
295
|
-
- `[OPTION]
|
|
306
|
+
- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus_mode=<off|all|final-only>`
|
|
307
|
+
- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N> lock_worker_model=<0|1>`
|
|
308
|
+
- `[OPTION] worker_model=<model> verifier_model=<model> final_verifier_model=<model> consensus_model=<model> final_consensus_model=<model>`
|
|
296
309
|
|
|
297
310
|
For each iteration (1 to max_iter):
|
|
298
311
|
|
|
@@ -341,7 +354,9 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
|
|
|
341
354
|
**⑤ Execute Worker**
|
|
342
355
|
- If `--debug`: debug_log `[FLOW] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
|
|
343
356
|
|
|
344
|
-
|
|
357
|
+
Determine engine from `--worker-model` format: plain name (e.g., `haiku`) = claude engine, `model:reasoning` format (e.g., `spark:high`) = codex engine. Use `parse_model_flag()` to split.
|
|
358
|
+
|
|
359
|
+
If claude engine (default):
|
|
345
360
|
```
|
|
346
361
|
Agent(
|
|
347
362
|
description="rlp-desk worker iter-NNN",
|
|
@@ -354,9 +369,9 @@ Agent(
|
|
|
354
369
|
- Agent returns synchronously. No polling needed.
|
|
355
370
|
- Each Agent() = fresh context. Guaranteed.
|
|
356
371
|
|
|
357
|
-
If
|
|
372
|
+
If codex engine:
|
|
358
373
|
```
|
|
359
|
-
Bash("codex exec --model <
|
|
374
|
+
Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full worker prompt text>")
|
|
360
375
|
```
|
|
361
376
|
- Codex runs as a subprocess via Bash(), not Agent().
|
|
362
377
|
- Each Bash() call = fresh context for codex.
|
|
@@ -396,26 +411,35 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
|
|
|
396
411
|
- **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
|
|
397
412
|
- If `--debug`: debug_log `[FLOW] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
|
|
398
413
|
|
|
399
|
-
|
|
414
|
+
Determine which verifier model to use based on scope:
|
|
415
|
+
- If `us_id` is a specific story (per-US verify) → use `--verifier-model` (default: sonnet)
|
|
416
|
+
- If `us_id` is "ALL" (final verify) → use `--final-verifier-model` (default: opus)
|
|
417
|
+
|
|
418
|
+
Determine engine from the selected verifier model format (same as Worker): plain name = claude, `model:reasoning` = codex.
|
|
419
|
+
|
|
420
|
+
If claude engine (default):
|
|
400
421
|
```
|
|
401
422
|
Agent(
|
|
402
423
|
description="rlp-desk verifier iter-NNN (us_id)",
|
|
403
424
|
subagent_type="executor",
|
|
404
|
-
model=<
|
|
425
|
+
model=<selected_verifier_model>,
|
|
405
426
|
mode="bypassPermissions",
|
|
406
427
|
prompt=<full verifier prompt text with US scope>
|
|
407
428
|
)
|
|
408
429
|
```
|
|
409
430
|
|
|
410
|
-
If
|
|
431
|
+
If codex engine:
|
|
411
432
|
```
|
|
412
|
-
Bash("codex exec --model <
|
|
433
|
+
Bash("codex exec --model <codex_model> --reasoning-effort <codex_reasoning> <full verifier prompt text>")
|
|
413
434
|
```
|
|
414
435
|
|
|
415
|
-
**⑦b Consensus Verification** (when `--
|
|
416
|
-
After the primary verifier runs, run a second verifier
|
|
417
|
-
-
|
|
418
|
-
-
|
|
436
|
+
**⑦b Consensus Verification** (when `--consensus` is `all`, or `final-only` and scope is ALL):
|
|
437
|
+
After the primary verifier runs, run a cross-engine second verifier:
|
|
438
|
+
- Determine cross-verifier model based on scope:
|
|
439
|
+
- per-US verify → use `--consensus-model` (default: gpt-5.4:medium)
|
|
440
|
+
- final ALL verify → use `--final-consensus-model` (default: gpt-5.4:high)
|
|
441
|
+
- If primary engine is claude → cross-verifier uses codex (the consensus model)
|
|
442
|
+
- If primary engine is codex → cross-verifier uses claude `opus` (fixed)
|
|
419
443
|
- Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
|
|
420
444
|
- **Both pass** → proceed (next US or COMPLETE)
|
|
421
445
|
- **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
|
|
@@ -461,7 +485,7 @@ After reading the verdict, archive to `logs/<slug>/`:
|
|
|
461
485
|
- Write `status.json`
|
|
462
486
|
- Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
|
|
463
487
|
- **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
|
|
464
|
-
- **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
|
|
488
|
+
- **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","consensus_mode":"off|all|final-only","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
|
|
465
489
|
- If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
|
|
466
490
|
|
|
467
491
|
At loop end (COMPLETE, BLOCKED, or TIMEOUT):
|
|
@@ -556,7 +580,7 @@ Track `consecutive_failures` in `status.json` (increment on `fail`, reset on `pa
|
|
|
556
580
|
|
|
557
581
|
Track `verified_us` (array of US IDs that passed verification) in `status.json` when using `--verify-mode per-us`.
|
|
558
582
|
|
|
559
|
-
When `--
|
|
583
|
+
When `--consensus` is not `off`, also track in `status.json`:
|
|
560
584
|
- `consensus_round`: current consensus round for this US (resets per US)
|
|
561
585
|
- `claude_verdict`: latest claude verifier verdict for this US
|
|
562
586
|
- `codex_verdict`: latest codex verifier verdict for this US
|
|
@@ -577,8 +601,8 @@ Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` and display a detailed
|
|
|
577
601
|
Campaign: <slug>
|
|
578
602
|
Iteration: <iteration> / <max_iter>
|
|
579
603
|
Phase: <phase> | Last Result: <last_result>
|
|
580
|
-
Worker Model: <worker_model>
|
|
581
|
-
Verify Mode: <verify_mode> | Consensus: <
|
|
604
|
+
Worker Model: <worker_model> | Verifier: <verifier_model> (per-US) / <final_verifier_model> (final)
|
|
605
|
+
Verify Mode: <verify_mode> | Consensus: <consensus_mode>
|
|
582
606
|
Consecutive Failures: <consecutive_failures>
|
|
583
607
|
Verified US: <verified_us array, comma-separated>
|
|
584
608
|
Updated: <updated_at_utc> (elapsed: now - updated_at)
|
|
@@ -652,7 +676,7 @@ Resume a previously interrupted campaign. Equivalent to `run <slug>` but explici
|
|
|
652
676
|
1. Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` for `verified_us`, `iteration`, `consecutive_failures`
|
|
653
677
|
2. Read `.claude/ralph-desk/memos/<slug>-memory.md` for completed stories and next iteration contract
|
|
654
678
|
3. Check for sentinels (`complete.md`, `blocked.md`) — if present, inform user and stop
|
|
655
|
-
4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `verify_mode`, `
|
|
679
|
+
4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `final_verifier_model`, `verify_mode`, `consensus_mode`)
|
|
656
680
|
5. The runner automatically restores `verified_us` from memory or status.json on startup
|
|
657
681
|
|
|
658
682
|
Example:
|
|
@@ -670,23 +694,20 @@ Example:
|
|
|
670
694
|
/rlp-desk clean <slug> [--kill-session] Reset for re-run (--kill-session kills tmux)
|
|
671
695
|
|
|
672
696
|
Run options:
|
|
673
|
-
--mode agent|tmux
|
|
674
|
-
--
|
|
675
|
-
--worker-model
|
|
676
|
-
--verifier-model MODEL
|
|
677
|
-
--
|
|
678
|
-
--
|
|
679
|
-
--
|
|
680
|
-
--
|
|
681
|
-
--
|
|
682
|
-
--
|
|
683
|
-
--
|
|
684
|
-
--
|
|
685
|
-
--
|
|
686
|
-
--
|
|
687
|
-
--iter-timeout N Per-iteration timeout in seconds, tmux mode only (default: 600)
|
|
688
|
-
--debug Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
|
|
689
|
-
--with-self-verification Campaign self-verification analysis (post-loop report)
|
|
697
|
+
--mode agent|tmux Execution mode (default: agent)
|
|
698
|
+
--worker-model MODEL Worker model: haiku|sonnet|opus or spark:high|gpt-5.4:high (default: haiku)
|
|
699
|
+
--lock-worker-model Disable auto model upgrade on failure
|
|
700
|
+
--verifier-model MODEL per-US verifier (default: sonnet)
|
|
701
|
+
--final-verifier-model MODEL Final ALL verifier (default: opus)
|
|
702
|
+
--consensus off|all|final-only Cross-engine consensus (default: off)
|
|
703
|
+
--consensus-model MODEL per-US cross-verifier (default: gpt-5.4:medium)
|
|
704
|
+
--final-consensus-model MODEL Final cross-verifier (default: gpt-5.4:high)
|
|
705
|
+
--verify-mode per-us|batch Verification strategy (default: per-us)
|
|
706
|
+
--cb-threshold N Consecutive failures before BLOCKED (default: 6)
|
|
707
|
+
--max-iter N Max iterations (default: 100)
|
|
708
|
+
--iter-timeout N Per-iteration timeout, tmux only (default: 600)
|
|
709
|
+
--debug Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
|
|
710
|
+
--with-self-verification Campaign self-verification analysis (post-loop report)
|
|
690
711
|
```
|
|
691
712
|
|
|
692
713
|
## Architecture
|
package/src/governance.md
CHANGED
|
@@ -14,7 +14,7 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
|
|
|
14
14
|
- **Worker must NEVER modify Claude Code settings** (settings.json, settings.local.json). Permission prompts must be reported as blocked, not bypassed by editing settings.
|
|
15
15
|
- **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
|
|
16
16
|
- **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
|
|
17
|
-
- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-
|
|
17
|
+
- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-model spark:high` or `--worker-model gpt-5.4:high`).
|
|
18
18
|
|
|
19
19
|
## 1a. Iron Laws
|
|
20
20
|
|
|
@@ -117,7 +117,21 @@ Skipping any step = invalid verification (IL-1 violation).
|
|
|
117
117
|
- "Code inspection" as substitute for automated command execution
|
|
118
118
|
- Citing cached/prior results instead of fresh execution
|
|
119
119
|
|
|
120
|
-
## 1c.
|
|
120
|
+
## 1c. Task Sizing Principle
|
|
121
|
+
|
|
122
|
+
Tasks must be sized within the assigned Worker's comfortable zone — not at its ceiling.
|
|
123
|
+
A task that pushes a Worker to its maximum capability will frequently fail in fresh-context execution,
|
|
124
|
+
because context budget must also cover PRD reading, test writing, and evidence collection.
|
|
125
|
+
|
|
126
|
+
Rules:
|
|
127
|
+
- Each US: max 3-4 ACs, max 2 changed files, completable in 1-2 iterations.
|
|
128
|
+
- If a task is at the edge of a Worker's capability, either split the task or upgrade the Worker model.
|
|
129
|
+
- Leader model selection: choose a model that can succeed comfortably, not the minimum viable model.
|
|
130
|
+
- During brainstorm: when proposing US splits, target "smaller than what the Worker can handle" not "as much as the Worker can handle."
|
|
131
|
+
|
|
132
|
+
This aligns with the original Ralph Loop principle: small tasks succeed most of the time.
|
|
133
|
+
|
|
134
|
+
## 1c½. Risk Classification
|
|
121
135
|
|
|
122
136
|
Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
|
|
123
137
|
|
|
@@ -207,6 +221,12 @@ Worker records what was done, in what order, with command evidence in `done-clai
|
|
|
207
221
|
### Verifier: reasoning in verify-verdict.json
|
|
208
222
|
Verifier records WHY each judgment was made in `verify-verdict.json`:
|
|
209
223
|
- Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
|
|
224
|
+
- **failure_category** (required on fail verdicts): Verifier classifies each issue's root cause as one of:
|
|
225
|
+
- `spec` — AC is ambiguous, contradictory, or untestable (suggests IL-2 re-assessment, not model upgrade)
|
|
226
|
+
- `implementation` — code logic error, missing case, wrong algorithm (model upgrade may help)
|
|
227
|
+
- `integration` — individual pieces work but interaction fails (suggests task split or architecture review)
|
|
228
|
+
- `flaky` — non-deterministic failure, timing, environment (suggests retry, not escalation)
|
|
229
|
+
Leader uses failure_category to decide between model upgrade, spec refinement, or architecture escalation.
|
|
210
230
|
- Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit, Test Coverage Audit
|
|
211
231
|
- This proves the Verifier actually performed each check rather than rubber-stamping
|
|
212
232
|
- **Test Coverage Audit (mandatory)**: Verifier MUST check that tests cover ALL code paths, not just happy paths. Specifically:
|
|
@@ -263,25 +283,30 @@ RUNNING → DONE_CLAIMED → VERIFYING → COMPLETE | CONTINUE | BLOCKED
|
|
|
263
283
|
|
|
264
284
|
| Role | Default Model | Override Criteria |
|
|
265
285
|
|------|---------------|-------------------|
|
|
266
|
-
| Worker
|
|
267
|
-
| Worker (
|
|
268
|
-
|
|
|
269
|
-
| Verifier | opus |
|
|
270
|
-
|
|
286
|
+
| Worker | haiku | Default; auto-upgrades on failure (sonnet → opus) |
|
|
287
|
+
| Worker (locked) | haiku | `--lock-worker-model` disables auto-upgrade |
|
|
288
|
+
| Verifier (per-US) | sonnet | Lightweight; campaign-fixed (no progressive upgrade) |
|
|
289
|
+
| Verifier (final) | opus | Full rigor; independent of per-US model |
|
|
290
|
+
|
|
291
|
+
**Worker auto-upgrade**: When a Worker fails, the Leader upgrades the model for the retry (haiku → sonnet → opus). This upgrade is Worker-only. Verifier model is campaign-fixed — it does not upgrade on failure.
|
|
292
|
+
|
|
293
|
+
**Verifier model is campaign-fixed**: `--verifier-model` applies to all per-US verifications throughout the campaign. `--final-verifier-model` applies to the final ALL verification. Neither upgrades automatically.
|
|
271
294
|
|
|
272
295
|
The Leader decides each iteration. Decision criteria:
|
|
273
|
-
- Previous iteration failed → upgrade model
|
|
274
|
-
- Simple repetitive task →
|
|
296
|
+
- Previous iteration failed → upgrade Worker model (unless `--lock-worker-model`)
|
|
297
|
+
- Simple repetitive task → keep current Worker model
|
|
275
298
|
- User explicitly specified → use as given
|
|
276
299
|
|
|
277
300
|
### Codex (opt-in engine)
|
|
278
301
|
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
302
|
+
Model routing uses `--worker-model` and `--verifier-model` with codex format: `spark:high` or `gpt-5.4:high`.
|
|
303
|
+
|
|
304
|
+
```
|
|
305
|
+
--worker-model spark:high # codex worker, spark model, high reasoning
|
|
306
|
+
--verifier-model gpt-5.4:high # codex verifier, gpt-5.4, high reasoning
|
|
307
|
+
```
|
|
283
308
|
|
|
284
|
-
|
|
309
|
+
`parse_model_flag()` auto-detects engine from the model name: plain names (haiku, sonnet, opus) = claude; `name:reasoning` format = codex. Claude is the default engine; codex is explicitly opt-in.
|
|
285
310
|
|
|
286
311
|
## 5a. Execution: Agent() Approach (default) — "Smart Mode"
|
|
287
312
|
|
|
@@ -305,7 +330,7 @@ Agent(
|
|
|
305
330
|
)
|
|
306
331
|
```
|
|
307
332
|
|
|
308
|
-
If `--worker-
|
|
333
|
+
If `--worker-model` or `--verifier-model` uses codex format (e.g., `spark:high`, `gpt-5.4:high`) (opt-in):
|
|
309
334
|
```
|
|
310
335
|
# Worker or Verifier (codex engine)
|
|
311
336
|
Bash("codex -m <codex_model> -c model_reasoning_effort=<codex_reasoning> --dangerously-bypass-approvals-and-sandbox <prompt>")
|
|
@@ -460,7 +485,7 @@ for iteration in 1..max_iter:
|
|
|
460
485
|
⑦ Execute Verifier (see §7a for per-US and §7b for consensus details)
|
|
461
486
|
- Build prompt (scoped to us_id if per-us mode) → log
|
|
462
487
|
- Agent(subagent_type="executor", model=selected, prompt=prompt)
|
|
463
|
-
- If --
|
|
488
|
+
- If --consensus is not off: run second verifier with alternate engine (see §7b)
|
|
464
489
|
- Read verify-verdict.json:
|
|
465
490
|
• pass + specific US → add to verified_us, Worker does next US
|
|
466
491
|
• pass + us_id=ALL or complete → write COMPLETE sentinel, stop
|
|
@@ -507,26 +532,44 @@ Worker completes US-001 → signal verify (us_id: "US-001")
|
|
|
507
532
|
|
|
508
533
|
## 7b. Cross-Engine Consensus Verification
|
|
509
534
|
|
|
510
|
-
|
|
535
|
+
Controlled by `--consensus off|all|final-only` (default: `off`).
|
|
536
|
+
|
|
537
|
+
- `off`: single engine verification only
|
|
538
|
+
- `all`: cross-engine consensus on every per-US verify and final ALL verify
|
|
539
|
+
- `final-only`: cross-engine consensus only on final ALL verify
|
|
540
|
+
|
|
541
|
+
When consensus is active, after the primary verifier runs, a second verifier runs with the alternate engine:
|
|
511
542
|
|
|
512
543
|
```
|
|
513
544
|
Worker completes US → signal verify
|
|
514
|
-
→
|
|
515
|
-
→
|
|
545
|
+
→ Primary Verifier runs (checks AC)
|
|
546
|
+
→ Cross Verifier runs (checks AC)
|
|
516
547
|
→ Both pass → proceed (next US or COMPLETE)
|
|
517
548
|
→ Either fails → combined issues → fix contract → Worker retry
|
|
518
549
|
→ Max 6 consensus rounds per US → BLOCKED if still disagreeing
|
|
519
|
-
|
|
520
|
-
**NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
|
|
521
550
|
```
|
|
522
551
|
|
|
552
|
+
**NO ENGINE PRIORITY:** Both verifiers have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
|
|
553
|
+
|
|
554
|
+
### Consensus Model Routing
|
|
555
|
+
|
|
556
|
+
| Scenario | Primary verifier | Cross verifier |
|
|
557
|
+
|----------|-----------------|----------------|
|
|
558
|
+
| per-US, primary=claude | `--verifier-model` (sonnet) | `--consensus-model` (gpt-5.4:medium) |
|
|
559
|
+
| per-US, primary=codex | `--verifier-model` | claude opus (fixed) |
|
|
560
|
+
| final, primary=claude | `--final-verifier-model` (opus) | `--final-consensus-model` (gpt-5.4:high) |
|
|
561
|
+
| final, primary=codex | `--final-verifier-model` | claude opus (fixed) |
|
|
562
|
+
|
|
563
|
+
- Both must pass. No engine priority.
|
|
564
|
+
- spark is not allowed as a consensus cross verifier (100k output limit).
|
|
565
|
+
|
|
523
566
|
**Key rules:**
|
|
524
567
|
- Both claude and codex CLI must be installed
|
|
525
568
|
- Verifiers run sequentially in the same Verifier pane (tmux) or as sequential calls (Agent mode)
|
|
526
569
|
- Verdicts are saved as `verify-verdict-claude.json` and `verify-verdict-codex.json`
|
|
527
570
|
- Combined fix contracts include issues from both engines
|
|
528
571
|
- `status.json` includes `consensus_round`, `claude_verdict`, and `codex_verdict` fields
|
|
529
|
-
- Consensus can be combined with per-US verification (each US gets consensus-verified)
|
|
572
|
+
- Consensus can be combined with per-US verification (`--consensus all`: each US gets consensus-verified)
|
|
530
573
|
|
|
531
574
|
## 7½. Fix Loop Protocol
|
|
532
575
|
|
|
@@ -574,7 +617,7 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
|
|
|
574
617
|
|-----------|---------|
|
|
575
618
|
| context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
|
|
576
619
|
| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
|
|
577
|
-
| `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`) |
|
|
620
|
+
| `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`; when `--consensus` is not `off`, effective threshold doubles automatically: default 6 → 12) |
|
|
578
621
|
| max_iter reached | TIMEOUT (report to user) |
|
|
579
622
|
|
|
580
623
|
The Leader tracks `consecutive_failures` in `status.json`:
|
|
@@ -582,6 +625,14 @@ The Leader tracks `consecutive_failures` in `status.json`:
|
|
|
582
625
|
- "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
|
|
583
626
|
- "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
|
|
584
627
|
|
|
628
|
+
## 8½. Self-Verification Feedback Loop
|
|
629
|
+
|
|
630
|
+
When `--with-self-verification` is enabled, the SV report feeds back into the next brainstorm cycle:
|
|
631
|
+
- SV report identifies patterns: which US types fail most, which AC quality issues recur, which model tiers underperform.
|
|
632
|
+
- Next brainstorm SHOULD reference the prior campaign's SV report (if available) to inform US sizing, model selection, and AC quality standards.
|
|
633
|
+
- This creates an iterative improvement loop: campaign → SV report → next brainstorm → better campaign.
|
|
634
|
+
- The loop operates whether the reviewer is human or system — readiness to iterate is what matters.
|
|
635
|
+
|
|
585
636
|
## 9. Change Policy
|
|
586
637
|
|
|
587
638
|
- Changes to the shared workflow → modify this document
|