@ai-dev-methodologies/rlp-desk 0.4.0 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +145 -69
- package/docs/plans/cozy-gliding-trinket.md +53 -0
- package/docs/plans/keen-sauteeing-snowflake.md +245 -0
- package/docs/plans/toasty-whistling-diffie-agent-a6814625642e956da.md +201 -0
- package/docs/plans/toasty-whistling-diffie.md +117 -0
- package/docs/prompts/ralplan-codex-review.md +1 -1
- package/install.sh +5 -0
- package/package.json +1 -1
- package/scripts/postinstall.js +5 -0
- package/scripts/uninstall.js +1 -0
- package/src/commands/rlp-desk.md +193 -51
- package/src/governance.md +28 -10
- package/src/model-upgrade-table.md +50 -0
- package/src/scripts/init_ralph_desk.zsh +200 -19
- package/src/scripts/lib_ralph_desk.zsh +838 -0
- package/src/scripts/run_ralph_desk.zsh +821 -608
package/src/commands/rlp-desk.md
CHANGED
|
@@ -41,13 +41,44 @@ Ask about these items one by one (or in small groups):
|
|
|
41
41
|
- Default recommendation: one US per iteration for 3+ stories.
|
|
42
42
|
5. **Verification Commands** — build, test, lint commands
|
|
43
43
|
6. **Completion / Blocked Criteria**
|
|
44
|
-
7. **Worker / Verifier Model** —
|
|
44
|
+
7. **Worker / Verifier Model** — Evaluate PRD complexity using 5 factors (overall = highest factor), then recommend model.
|
|
45
|
+
|
|
46
|
+
**Complexity Evaluation Table**:
|
|
47
|
+
|
|
48
|
+
| Factor | LOW | MEDIUM | HIGH | CRITICAL |
|
|
49
|
+
|--------|-----|--------|------|----------|
|
|
50
|
+
| US count | 1-2 | 3-5 | 6-10 | 10+ |
|
|
51
|
+
| File change scope | single | 2-5 files | 6+ files | cross-repo |
|
|
52
|
+
| Logic complexity | simple | conditionals | algorithms | security |
|
|
53
|
+
| External dependencies | none | 1-2 | 3+ | distributed |
|
|
54
|
+
| Existing code impact | new only | modify | refactor | architecture |
|
|
55
|
+
|
|
56
|
+
**Model mapping** (Worker / Verifier):
|
|
57
|
+
- LOW → haiku / sonnet
|
|
58
|
+
- MEDIUM → sonnet / opus
|
|
59
|
+
- HIGH → opus / opus
|
|
60
|
+
- CRITICAL → opus / opus + require human review
|
|
61
|
+
|
|
62
|
+
Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM → I suggest Worker: sonnet, Verifier: opus."
|
|
63
|
+
|
|
45
64
|
8. **Engine & Model** — For each role (Worker, Verifier):
|
|
46
65
|
- Engine: claude (default) or codex
|
|
47
66
|
- If claude: suggest model (haiku/sonnet/opus) based on task complexity
|
|
48
67
|
- If codex: suggest model (default: gpt-5.4) and reasoning effort (low/medium/high)
|
|
49
|
-
|
|
50
|
-
|
|
68
|
+
|
|
69
|
+
**Codex Detection** — check if codex CLI is installed (`command -v codex`):
|
|
70
|
+
|
|
71
|
+
**If codex IS installed** — recommend cross-engine Worker:
|
|
72
|
+
- Suggest: `--worker-model gpt-5.4:high --verify-consensus` (cross-engine + consensus)
|
|
73
|
+
- Alternative: `--worker-model gpt-5.3-codex-spark:high` (spark preset — note: 100k output token limit per request, best for smaller scope PRDs)
|
|
74
|
+
- Say: "Codex is installed. I recommend it as Worker for cost savings (codex tokens are cheaper than claude tokens for bulk iteration) and cross-engine blind-spot coverage (claude Verifier catches issues codex Worker misses)."
|
|
75
|
+
|
|
76
|
+
**If codex is NOT installed** — recommend claude-only + install suggestion:
|
|
77
|
+
- Defaulting to claude-only Worker (sonnet).
|
|
78
|
+
- Say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
|
|
79
|
+
|
|
80
|
+
AI should recommend: "For this task complexity, I suggest Worker: sonnet, Verifier: opus"
|
|
81
|
+
If codex selected: "For codex Worker, I suggest gpt-5.4 with high reasoning"
|
|
51
82
|
9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
|
|
52
83
|
10. **Verify Consensus** — Ask: "Use cross-engine consensus verification? (Both claude and codex verify independently, both must pass.) Requires codex CLI." Default: no.
|
|
53
84
|
11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
|
|
@@ -83,33 +114,63 @@ If brainstorm was done, auto-fill PRD and test-spec with the results.
|
|
|
83
114
|
Tell the user:
|
|
84
115
|
1. The scaffold has been created — list the generated files
|
|
85
116
|
2. Ask them to review/edit the PRD and test-spec if needed
|
|
86
|
-
3. Present run options with explanations and ONE recommendation. The user MUST copy and paste the command themselves
|
|
117
|
+
3. Present run options with explanations and ONE recommendation. The user MUST copy and paste the command themselves.
|
|
87
118
|
|
|
88
|
-
|
|
89
|
-
Available run commands (copy the one you want):
|
|
119
|
+
Check if codex CLI is installed: run `command -v codex` in shell or check if the binary exists.
|
|
90
120
|
|
|
91
|
-
|
|
92
|
-
/rlp-desk run <slug> --debug
|
|
121
|
+
**If codex IS installed** — show cross-engine presets first:
|
|
93
122
|
|
|
94
|
-
|
|
95
|
-
|
|
123
|
+
```
|
|
124
|
+
Available run commands (copy the one you want):
|
|
96
125
|
|
|
97
|
-
#
|
|
98
|
-
/rlp-desk run <slug> --
|
|
126
|
+
# Recommended: cross-engine + final-consensus (cost savings + blind-spot coverage):
|
|
127
|
+
/rlp-desk run <actual-slug> --worker-model gpt-5.4:high --final-consensus --debug
|
|
99
128
|
|
|
100
|
-
#
|
|
101
|
-
/rlp-desk run <slug> --
|
|
129
|
+
# Spark Pro preset (fast codex worker, lower cost):
|
|
130
|
+
/rlp-desk run <actual-slug> --worker-model gpt-5.3-codex-spark:high --debug
|
|
102
131
|
|
|
103
|
-
#
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
#
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
#
|
|
110
|
-
# --
|
|
111
|
-
# --
|
|
112
|
-
|
|
132
|
+
# Claude-only:
|
|
133
|
+
/rlp-desk run <actual-slug> --debug
|
|
134
|
+
|
|
135
|
+
# Basic agent:
|
|
136
|
+
/rlp-desk run <actual-slug>
|
|
137
|
+
|
|
138
|
+
# Full options reference:
|
|
139
|
+
# --mode agent|tmux (default: agent)
|
|
140
|
+
# --worker-model MODEL haiku|sonnet|opus or gpt-5.4:low|medium|high (default: sonnet)
|
|
141
|
+
# --verifier-model MODEL haiku|sonnet|opus (default: opus)
|
|
142
|
+
# --verify-consensus both claude+codex must pass
|
|
143
|
+
# --verify-mode per-us|batch (default: per-us)
|
|
144
|
+
# --max-iter N (default: 100)
|
|
145
|
+
# --debug enable debug logging
|
|
146
|
+
# --with-self-verification post-campaign analysis report
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
**If codex is NOT installed** — show claude-only presets + install recommendation:
|
|
150
|
+
|
|
151
|
+
```
|
|
152
|
+
Available run commands (copy the one you want):
|
|
153
|
+
|
|
154
|
+
# Recommended: tmux mode + claude-only (real-time visibility):
|
|
155
|
+
/rlp-desk run <actual-slug> --mode tmux --debug
|
|
156
|
+
|
|
157
|
+
# Agent mode:
|
|
158
|
+
/rlp-desk run <actual-slug> --debug
|
|
159
|
+
|
|
160
|
+
# Install codex for cost savings + cross-engine blind-spot coverage:
|
|
161
|
+
npm install -g @openai/codex
|
|
162
|
+
|
|
163
|
+
# Full options reference:
|
|
164
|
+
# --mode agent|tmux (default: agent)
|
|
165
|
+
# --worker-model MODEL haiku|sonnet|opus (default: sonnet)
|
|
166
|
+
# --verifier-model MODEL haiku|sonnet|opus (default: opus)
|
|
167
|
+
# --verify-mode per-us|batch (default: per-us)
|
|
168
|
+
# --max-iter N (default: 100)
|
|
169
|
+
# --debug enable debug logging
|
|
170
|
+
# --with-self-verification post-campaign analysis report
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Replace `<actual-slug>` with the real slug from this init (e.g. `auth-refactor`).
|
|
113
174
|
|
|
114
175
|
**CRITICAL: Do NOT offer to run for the user. Do NOT ask "shall I run?" or offer to execute. The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
|
|
115
176
|
|
|
@@ -138,10 +199,21 @@ Options (parse from `$ARGUMENTS`):
|
|
|
138
199
|
- `all`: consensus runs on every verify (current behavior)
|
|
139
200
|
- `final-only`: consensus only on final ALL verify
|
|
140
201
|
- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
|
|
202
|
+
- `--consensus-fail-fast` — skip second verifier if first verifier fails (saves time/tokens in consensus mode)
|
|
141
203
|
- `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
|
|
142
|
-
- `--debug` — enable debug logging (writes to
|
|
204
|
+
- `--debug` — enable debug logging (writes to ~/.claude/ralph-desk/analytics/<slug>/debug.log)
|
|
143
205
|
- `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
|
|
144
206
|
|
|
207
|
+
### Analytics Directory (`~/.claude/ralph-desk/analytics/<slug>/`)
|
|
208
|
+
When `--debug` or `--with-self-verification` is active, analytics data is written to a user-level directory for cross-project aggregation. Contents:
|
|
209
|
+
- `metadata.json` — campaign metadata: slug, project_root, campaign_status, start_time, end_time
|
|
210
|
+
- `debug.log` — debug output (versioned: `debug-v{N}.log` on re-execution)
|
|
211
|
+
- `campaign.jsonl` — per-iteration structured data (versioned: `campaign-v{N}.jsonl` on re-execution). Schema: iter, us_id, worker_model, worker_engine, verifier_engine, claude_verdict, codex_verdict, consensus, duration_worker_s, duration_verifier_s, project_root, slug, timestamp
|
|
212
|
+
- `self-verification-data.json` — cumulative SV records (agent-mode only, when `--with-self-verification`)
|
|
213
|
+
- `self-verification-report-NNN.md` — versioned SV reports (when `--with-self-verification`)
|
|
214
|
+
|
|
215
|
+
Cross-project aggregation: scan `~/.claude/ralph-desk/analytics/` and read each slug's `metadata.json` to discover project_root, campaign_status, and timestamps. Slug directories use `<slug>--<root_hash>` format to prevent collision across projects.
|
|
216
|
+
|
|
145
217
|
### Mode Selection
|
|
146
218
|
|
|
147
219
|
Parse the `--mode` flag. If absent or `agent`, use the Agent() path below. If `tmux`, use the Tmux path.
|
|
@@ -181,22 +253,31 @@ WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
|
|
|
181
253
|
|
|
182
254
|
**IMPORTANT RULES:**
|
|
183
255
|
- Tmux mode requires the user to already be inside a tmux session. If the runner script rejects because $TMUX is not set, do NOT try to create a tmux session yourself. Tell the user: "Start tmux first, then retry."
|
|
184
|
-
-
|
|
256
|
+
- MUST launch the runner with `run_in_background: true` so `/rlp-desk` returns control immediately while preserving live tmux visibility.
|
|
257
|
+
- Run-in-background is used so the shell can keep the command visible and keep the pane layout stable for status checks and completion flow.
|
|
185
258
|
- Do NOT kill panes after completion. Panes stay alive for inspection. User cleans up with `/rlp-desk clean <slug> --kill-session`.
|
|
186
|
-
- `--with-self-verification` is accepted in tmux mode
|
|
259
|
+
- `--with-self-verification` is accepted in tmux mode. After campaign completion, `run_ralph_desk.zsh` spawns `claude CLI` to generate the SV report from campaign artifacts (done-claims, verify-verdicts, campaign-report). SV reports are written to `~/.claude/ralph-desk/analytics/<slug>/`. Requires `claude` CLI available in PATH; if not found, an error is appended to the campaign report.
|
|
260
|
+
|
|
261
|
+
**tmux UX model (5 items):**
|
|
262
|
+
- The session returns immediately after launch (`run_in_background: true`) so the command returns control to the parent CLI.
|
|
263
|
+
- Worker/Verifier panes remain visible to the user during execution.
|
|
264
|
+
- Users check progress with the **status command**: `/rlp-desk status <slug>`.
|
|
265
|
+
- On completion, the command returns a completion notification before the loop ends.
|
|
266
|
+
- Agent mode remains unchanged, and no tmux-specific behavior is mixed into Agent mode.
|
|
187
267
|
|
|
188
268
|
#### Agent Mode (`--mode agent` or default)
|
|
189
269
|
|
|
190
270
|
### Preparation
|
|
191
271
|
1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
|
|
192
|
-
2.
|
|
193
|
-
3.
|
|
194
|
-
4.
|
|
195
|
-
5.
|
|
272
|
+
2. **Codex CLI pre-validation**: If `--verify-consensus` is enabled OR `--worker-engine codex` / `--verifier-engine codex` is set, check that `codex` CLI exists in PATH. If codex CLI not found → STOP immediately, print install instructions (`npm install -g @openai/codex`), do not start the loop.
|
|
273
|
+
3. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
|
|
274
|
+
4. Clean previous `done-claim.json`, `verify-verdict.json`.
|
|
275
|
+
5. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
|
|
276
|
+
6. If `--debug`: also create/clear `~/.claude/ralph-desk/analytics/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> ~/.claude/ralph-desk/analytics/<slug>/debug.log")`. When `--debug` is active, debug.log contains all baseline.log fields plus detailed phase logs.
|
|
196
277
|
- **4-category log system**: all debug_log entries use exactly one of: `[GOV]` (governance checks: IL enforcement, CB triggers, scope lock, verdict evaluation), `[DECIDE]` (leader decisions: model selection, fix contracts, escalation), `[OPTION]` (configuration snapshot at loop start: thresholds, modes, models), `[FLOW]` (execution progress: worker/verifier dispatch, signal reads, phase transitions)
|
|
197
278
|
- **Re-execution versioning**: If `debug.log` already exists at `--debug` start, rename it to `debug-v{N}.log` (N = next available integer ≥ 1) before creating a fresh `debug.log`.
|
|
198
279
|
- **baseline.log lifecycle**: baseline.log is deleted on re-execution (when `init --mode improve` or `init --mode fresh` is run).
|
|
199
|
-
|
|
280
|
+
7. Capture baseline commit: `Bash("git rev-parse HEAD 2>/dev/null || echo none")` → store as `BASELINE_COMMIT`. Include in the first `status.json` write as `baseline_commit` field.
|
|
200
281
|
|
|
201
282
|
### Leader Loop
|
|
202
283
|
|
|
@@ -241,6 +322,11 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
|
|
|
241
322
|
**④ Build worker prompt (Prompt Assembly Protocol)**
|
|
242
323
|
1. Capture `WORKING_DIR` once: use `$PWD` from when `/rlp-desk run` was invoked. Store for all prompt construction.
|
|
243
324
|
2. Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` — use its content **verbatim**. Do NOT rewrite, paraphrase, or regenerate paths. The prompt file contains correct absolute paths from init.
|
|
325
|
+
2a. **Per-US PRD injection** (when targeting a specific `us_id`, not "ALL"):
|
|
326
|
+
- Check if `.claude/ralph-desk/plans/prd-<slug>-{us_id}.md` exists (created by init split)
|
|
327
|
+
- If yes: in the assembled prompt text, replace the full PRD reference (`prd-<slug>.md`) with the per-US file path (`prd-<slug>-{us_id}.md`) — so Worker reads only the relevant US section
|
|
328
|
+
- If no per-US file: fall back to full PRD (`prd-<slug>.md`) with no change needed
|
|
329
|
+
- Note: this absolute-path substitution is permitted — only absolute→relative rewrites are forbidden.
|
|
244
330
|
3. Prepend meta comment: `## WORKING_DIR: {absolute path}` — Worker must use this as its working directory.
|
|
245
331
|
4. Append iteration number + memory contract.
|
|
246
332
|
5. Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail).
|
|
@@ -298,7 +384,8 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
|
|
|
298
384
|
- Add that US to `verified_us` in status.json
|
|
299
385
|
- If more US remain → Worker does next US → verify → ...
|
|
300
386
|
- If all US individually passed → signal final full verify (us_id=ALL)
|
|
301
|
-
-
|
|
387
|
+
- **Sequential final verify** (timeout prevention): Instead of one big ALL verify, loop through each US individually with scoped verifier. After all per-US pass, run the project's test suite as a cross-US integration check. Only COMPLETE if both per-US checks and integration check pass.
|
|
388
|
+
- After sequential final verify passes → COMPLETE
|
|
302
389
|
|
|
303
390
|
**Batch mode** (`--verify-mode batch`):
|
|
304
391
|
- Legacy behavior: verify only when Worker signals all work is done
|
|
@@ -374,6 +461,7 @@ After reading the verdict, archive to `logs/<slug>/`:
|
|
|
374
461
|
- Write `status.json`
|
|
375
462
|
- Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
|
|
376
463
|
- **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
|
|
464
|
+
- **Always**: append JSONL to `~/.claude/ralph-desk/analytics/<slug>/campaign.jsonl`: `{"iter":N,"us_id":"US-NNN","verdict":"pass|fail","worker_model":"...","worker_engine":"...","verifier_model":"...","verifier_engine":"...","duration_worker_s":N,"duration_verifier_s":N,"timestamp":"ISO8601"}`
|
|
377
465
|
- If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
|
|
378
466
|
|
|
379
467
|
At loop end (COMPLETE, BLOCKED, or TIMEOUT):
|
|
@@ -384,8 +472,8 @@ At loop end (COMPLETE, BLOCKED, or TIMEOUT):
|
|
|
384
472
|
After the loop ends, the Leader performs post-campaign analysis:
|
|
385
473
|
|
|
386
474
|
1. **Collect data**: Read all archived `iter-NNN.result.md`, done-claim.json (with execution_steps), and verify-verdict.json (with reasoning) from `logs/<slug>/`
|
|
387
|
-
2. **Write cumulative data**:
|
|
388
|
-
3. **Generate versioned report**:
|
|
475
|
+
2. **Write cumulative data**: `~/.claude/ralph-desk/analytics/<slug>/self-verification-data.json` — normalized iteration records (agent-mode only artifact)
|
|
476
|
+
3. **Generate versioned report**: `~/.claude/ralph-desk/analytics/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
|
|
389
477
|
4. **Report to user**: Display the full report content
|
|
390
478
|
|
|
391
479
|
Report template (10 sections):
|
|
@@ -483,7 +571,22 @@ When `--verify-consensus` is enabled, also track in `status.json`:
|
|
|
483
571
|
---
|
|
484
572
|
|
|
485
573
|
## `status <slug>`
|
|
486
|
-
Read `.claude/ralph-desk/logs/<slug>/status.json` and display
|
|
574
|
+
Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` and display a detailed report:
|
|
575
|
+
|
|
576
|
+
```
|
|
577
|
+
Campaign: <slug>
|
|
578
|
+
Iteration: <iteration> / <max_iter>
|
|
579
|
+
Phase: <phase> | Last Result: <last_result>
|
|
580
|
+
Worker Model: <worker_model> (<worker_engine>) | Verifier Model: <verifier_model> (<verifier_engine>)
|
|
581
|
+
Verify Mode: <verify_mode> | Consensus: <verify_consensus>
|
|
582
|
+
Consecutive Failures: <consecutive_failures>
|
|
583
|
+
Verified US: <verified_us array, comma-separated>
|
|
584
|
+
Updated: <updated_at_utc> (elapsed: now - updated_at)
|
|
585
|
+
```
|
|
586
|
+
|
|
587
|
+
If `status.json` does not exist, display "No active campaign for <slug>."
|
|
588
|
+
If the campaign has a `complete` or `blocked` sentinel, show that status prominently.
|
|
589
|
+
Read the last `verify-verdict.json` to show the most recent verdict summary and any failure issues.
|
|
487
590
|
|
|
488
591
|
## `logs <slug> [N]`
|
|
489
592
|
- No N: show latest `iter-*.worker-prompt.md` summary
|
|
@@ -497,25 +600,64 @@ Remove:
|
|
|
497
600
|
- `.claude/ralph-desk/memos/<slug>-verify-verdict.json`
|
|
498
601
|
- `.claude/ralph-desk/memos/<slug>-iter-signal.json`
|
|
499
602
|
- `.claude/ralph-desk/logs/<slug>/circuit-breaker.json`
|
|
500
|
-
- `.claude/ralph-desk/logs/<slug>/session-config.json`
|
|
501
|
-
- `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
|
|
502
|
-
- `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
|
|
603
|
+
- `.claude/ralph-desk/logs/<slug>/runtime/session-config.json`
|
|
604
|
+
- `.claude/ralph-desk/logs/<slug>/runtime/worker-heartbeat.json`
|
|
605
|
+
- `.claude/ralph-desk/logs/<slug>/runtime/verifier-heartbeat.json`
|
|
503
606
|
- `.claude/ralph-desk/memos/<slug>-escalation.md`
|
|
504
|
-
Note: `
|
|
607
|
+
Note: `campaign-report.md`, `campaign-report-v{N}.md`, `iter-NNN-done-claim.json`, and `iter-NNN-verify-verdict.json` are intentionally preserved across clean for historical comparison. Analytics files (`debug.log`, `campaign.jsonl`, `self-verification-data.json`, `self-verification-report-NNN.md`) at `~/.claude/ralph-desk/analytics/<slug>/` are NOT affected by project-level clean.
|
|
505
608
|
|
|
506
|
-
If `--kill-session` is passed, clean up
|
|
609
|
+
If `--kill-session` is passed, clean up Worker/Verifier tmux panes using session-config.json:
|
|
507
610
|
```bash
|
|
508
|
-
#
|
|
509
|
-
|
|
611
|
+
# Read pane IDs from session-config.json (safe — targets only Worker/Verifier panes)
|
|
612
|
+
SESSION_CONFIG=".claude/ralph-desk/logs/<slug>/runtime/session-config.json"
|
|
613
|
+
if [ -f "$SESSION_CONFIG" ] && command -v jq &>/dev/null; then
|
|
614
|
+
WORKER_PANE=$(jq -r '.panes.worker // empty' "$SESSION_CONFIG")
|
|
615
|
+
VERIFIER_PANE=$(jq -r '.panes.verifier // empty' "$SESSION_CONFIG")
|
|
616
|
+
|
|
617
|
+
for pane_id in "$WORKER_PANE" "$VERIFIER_PANE"; do
|
|
618
|
+
if [ -n "$pane_id" ]; then
|
|
619
|
+
tmux send-keys -t "$pane_id" C-c 2>/dev/null
|
|
620
|
+
tmux send-keys -t "$pane_id" "/exit" Enter 2>/dev/null
|
|
621
|
+
fi
|
|
622
|
+
done
|
|
623
|
+
sleep 2
|
|
624
|
+
for pane_id in "$WORKER_PANE" "$VERIFIER_PANE"; do
|
|
625
|
+
if [ -n "$pane_id" ]; then
|
|
626
|
+
tmux kill-pane -t "$pane_id" 2>/dev/null
|
|
627
|
+
fi
|
|
628
|
+
done
|
|
629
|
+
else
|
|
630
|
+
echo "WARNING: session-config.json not found or jq not installed."
|
|
631
|
+
echo "Cannot safely identify Worker/Verifier panes. Kill them manually."
|
|
632
|
+
fi
|
|
633
|
+
```
|
|
634
|
+
**CRITICAL: NEVER use `grep -i 'claude\|codex'` to find panes to kill.** The user's own Claude Code session matches those patterns. Always use the specific pane IDs from session-config.json.
|
|
635
|
+
|
|
636
|
+
## `analytics [slug]`
|
|
637
|
+
|
|
638
|
+
Cross-project analytics dashboard. Scans `~/.claude/ralph-desk/analytics/` for all campaign data.
|
|
639
|
+
|
|
640
|
+
- No slug: show summary across all projects (total campaigns, pass/fail rate, average iterations, total cost)
|
|
641
|
+
- With slug: show detailed analytics for that project (per-US pass rate, model upgrade frequency, iteration distribution, cost per US)
|
|
510
642
|
|
|
511
|
-
|
|
512
|
-
|
|
513
|
-
|
|
514
|
-
|
|
515
|
-
done
|
|
643
|
+
Data sources:
|
|
644
|
+
- `campaign.jsonl` — per-iteration structured records
|
|
645
|
+
- `metadata.json` — project root, campaign status, timestamps
|
|
646
|
+
- `self-verification-data.json` — campaign-level quality metrics
|
|
516
647
|
|
|
517
|
-
|
|
518
|
-
|
|
648
|
+
## `resume <slug>`
|
|
649
|
+
|
|
650
|
+
Resume a previously interrupted campaign. Equivalent to `run <slug>` but explicitly restores state:
|
|
651
|
+
|
|
652
|
+
1. Read `.claude/ralph-desk/logs/<slug>/runtime/status.json` for `verified_us`, `iteration`, `consecutive_failures`
|
|
653
|
+
2. Read `.claude/ralph-desk/memos/<slug>-memory.md` for completed stories and next iteration contract
|
|
654
|
+
3. Check for sentinels (`complete.md`, `blocked.md`) — if present, inform user and stop
|
|
655
|
+
4. If no sentinels, invoke `run <slug>` with the same options from the previous session (stored in status.json fields: `worker_model`, `verifier_model`, `verify_mode`, `verify_consensus`)
|
|
656
|
+
5. The runner automatically restores `verified_us` from memory or status.json on startup
|
|
657
|
+
|
|
658
|
+
Example:
|
|
659
|
+
```
|
|
660
|
+
/rlp-desk resume my-feature
|
|
519
661
|
```
|
|
520
662
|
|
|
521
663
|
## No args or `help`
|
|
@@ -543,7 +685,7 @@ Run options:
|
|
|
543
685
|
--consensus-scope SCOPE When consensus runs: all|final-only (default: all)
|
|
544
686
|
--cb-threshold N CB threshold: consecutive failures before BLOCKED (default: 3)
|
|
545
687
|
--iter-timeout N Per-iteration timeout in seconds, tmux mode only (default: 600)
|
|
546
|
-
--debug Debug logging (
|
|
688
|
+
--debug Debug logging (~/.claude/ralph-desk/analytics/<slug>/debug.log)
|
|
547
689
|
--with-self-verification Campaign self-verification analysis (post-loop report)
|
|
548
690
|
```
|
|
549
691
|
|
package/src/governance.md
CHANGED
|
@@ -11,6 +11,7 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
|
|
|
11
11
|
- **Filesystem = memory**: State exists only on the filesystem (PRD, memory, context, memos).
|
|
12
12
|
- **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
|
|
13
13
|
- **Worker scope is bounded**: Worker implements only the contracted US per iteration (Scope Lock). Out-of-scope changes are flagged by the Verifier.
|
|
14
|
+
- **Worker must NEVER modify Claude Code settings** (settings.json, settings.local.json). Permission prompts must be reported as blocked, not bypassed by editing settings.
|
|
14
15
|
- **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
|
|
15
16
|
- **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
|
|
16
17
|
- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
|
|
@@ -184,7 +185,7 @@ Verification occurs at two boundaries, not as a single final event.
|
|
|
184
185
|
### Checkpoint 2: Release Readiness (us_id=ALL)
|
|
185
186
|
- Trigger: all individual US pass Checkpoint 1 → Worker signals verify with us_id = "ALL"
|
|
186
187
|
- Scope: all AC + L2 integration (if applicable) + L3 E2E Simulation + L4 deploy (if applicable) + mutation score (if CRITICAL, when mutation testing tool is configured in test-spec)
|
|
187
|
-
- On fail: fix loop; escalation to user if
|
|
188
|
+
- On fail: fix loop; escalation to user if 6 consecutive failures (default cb_threshold)
|
|
188
189
|
|
|
189
190
|
### Relationship to Existing Flow
|
|
190
191
|
- Checkpoint 1 = existing per-US verify (§7a). No change.
|
|
@@ -199,14 +200,27 @@ This is the default behavior, not an optional flag. Without it, IL-1 (Evidence M
|
|
|
199
200
|
### Worker: execution_steps in done-claim.json
|
|
200
201
|
Worker records what was done, in what order, with command evidence in `done-claim.json`:
|
|
201
202
|
- Each step includes: what action, which AC, command executed, exit code, summary
|
|
202
|
-
- Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`
|
|
203
|
+
- Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`, `verify_existing`
|
|
203
204
|
- This proves the Worker followed test-first approach and did not skip steps
|
|
205
|
+
- **Existing implementation rule**: When code already exists from a prior iteration/campaign, Worker MAY use `verify_existing` instead of `write_test → verify_red → implement → verify_green`. `verify_existing` requires: run all existing tests, record exit codes, confirm all AC are covered by passing tests. Worker MUST NOT skip recording evidence — `verify_existing` is evidence that existing code satisfies AC, not a shortcut to skip verification.
|
|
204
206
|
|
|
205
207
|
### Verifier: reasoning in verify-verdict.json
|
|
206
208
|
Verifier records WHY each judgment was made in `verify-verdict.json`:
|
|
207
209
|
- Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
|
|
208
|
-
- Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit
|
|
210
|
+
- Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit, Test Coverage Audit
|
|
209
211
|
- This proves the Verifier actually performed each check rather than rubber-stamping
|
|
212
|
+
- **Test Coverage Audit (mandatory)**: Verifier MUST check that tests cover ALL code paths, not just happy paths. Specifically:
|
|
213
|
+
- Every branch in `case` statements must have a test (e.g., all model types in get_next_model)
|
|
214
|
+
- Every engine/model combination must be tested (claude, codex 5.4, spark — not just 1-2)
|
|
215
|
+
- Every ceiling/boundary must be tested (not just "opus ceiling" — also spark ceiling, 5.4 ceiling)
|
|
216
|
+
- If Worker's tests cover only 2 of 3 engine paths, verdict MUST be fail with "test coverage gap" issue
|
|
217
|
+
- "Tests pass" is NOT sufficient — "tests cover all code paths" is required
|
|
218
|
+
- **Integration Test (mandatory when functions call other functions)**: Verifier MUST check that function call chains produce correct end-to-end results, not just that each function works in isolation. Specifically:
|
|
219
|
+
- If function A's output is function B's input, there MUST be a test that runs A→B together and verifies the result
|
|
220
|
+
- Example: `get_model_string()` returns "gpt-5.3-codex-spark:medium" — `get_next_model()` must accept that exact value and return the correct upgrade. A test must verify this chain.
|
|
221
|
+
- Unit tests (extract_fn + isolated run) are necessary but NOT sufficient for refactored code
|
|
222
|
+
- Structural tests (grep for function existence) are necessary but NOT sufficient
|
|
223
|
+
- "All unit tests pass" does NOT prove the system works — integration tests prove it
|
|
210
224
|
|
|
211
225
|
### Why This Is Default (Not Optional)
|
|
212
226
|
- IL-1 says "no claims without evidence" — this applies to Worker AND Verifier
|
|
@@ -388,19 +402,23 @@ Characteristics:
|
|
|
388
402
|
├── plans/
|
|
389
403
|
│ ├── prd-<slug>.md # PRD (in-place: --mode improve | deleted: --mode fresh)
|
|
390
404
|
│ └── test-spec-<slug>.md # Verification criteria (regenerated on re-execution)
|
|
391
|
-
└── logs/<slug>/
|
|
392
|
-
├── debug.log # Debug output (versioned: debug-v{N}.log on re-execution)
|
|
405
|
+
└── logs/<slug>/ # Project-level operational data
|
|
393
406
|
├── campaign-report.md # Campaign summary (versioned: campaign-report-v{N}.md on re-execution)
|
|
394
407
|
├── iter-NNN.worker-prompt.md # Audit trail prompt copy (deleted on re-execution)
|
|
395
408
|
├── iter-NNN.verifier-prompt.md # Audit trail prompt copy (deleted on re-execution)
|
|
396
409
|
├── iter-NNN.result.md # Iteration result (deleted on re-execution)
|
|
397
410
|
├── iter-NNN-done-claim.json # Archived done-claim per iteration (deleted on re-execution)
|
|
398
411
|
├── iter-NNN-verify-verdict.json # Archived verdict per iteration (deleted on re-execution)
|
|
399
|
-
├── self-verification-data.json # Cumulative SV data (--with-self-verification; deleted on re-execution)
|
|
400
|
-
├── self-verification-report-NNN.md # Versioned SV report (-NNN auto-increment; NOT versioned via version_file)
|
|
401
412
|
├── status.json # Leader's loop state (deleted on re-execution)
|
|
402
413
|
├── baseline.log # Baseline capture (deleted on re-execution)
|
|
403
414
|
└── cost-log.jsonl # Per-iteration cost log (deleted on re-execution)
|
|
415
|
+
|
|
416
|
+
~/.claude/ralph-desk/analytics/<slug>--<root_hash>/ # User-level cross-project analytics
|
|
417
|
+
├── metadata.json # Campaign metadata (slug, project_root, status, times)
|
|
418
|
+
├── debug.log # Debug output (versioned: debug-v{N}.log on re-execution)
|
|
419
|
+
├── campaign.jsonl # Per-iteration structured data (versioned: campaign-v{N}.jsonl)
|
|
420
|
+
├── self-verification-data.json # Cumulative SV data (agent-mode only, --with-self-verification)
|
|
421
|
+
└── self-verification-report-NNN.md # Versioned SV report (--with-self-verification; NNN auto-increment)
|
|
404
422
|
```
|
|
405
423
|
|
|
406
424
|
## 7. Leader Loop Protocol
|
|
@@ -555,14 +573,14 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
|
|
|
555
573
|
| Condition | Verdict |
|
|
556
574
|
|-----------|---------|
|
|
557
575
|
| context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
|
|
558
|
-
| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → Architecture Escalation (§7¾) → BLOCKED |
|
|
559
|
-
|
|
|
576
|
+
| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
|
|
577
|
+
| `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`) |
|
|
560
578
|
| max_iter reached | TIMEOUT (report to user) |
|
|
561
579
|
|
|
562
580
|
The Leader tracks `consecutive_failures` in `status.json`:
|
|
563
581
|
- Increments on `fail`, resets on `pass`, **unchanged by `request_info`**.
|
|
564
582
|
- "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
|
|
565
|
-
- "Diverse failures" =
|
|
583
|
+
- "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
|
|
566
584
|
|
|
567
585
|
## 9. Change Policy
|
|
568
586
|
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# Model Upgrade Table
|
|
2
|
+
|
|
3
|
+
Progressive Worker model upgrade on consecutive failure per US.
|
|
4
|
+
CB default: 6. Override: `--cb-threshold N`. Worker only — Verifier fixed at campaign start.
|
|
5
|
+
|
|
6
|
+
## Rules
|
|
7
|
+
- Each row = 2-attempt window (same model for 2 consecutive fails)
|
|
8
|
+
- Ceiling reached → repeat same model until CB
|
|
9
|
+
- CB < table columns → BLOCKED at that column
|
|
10
|
+
- CB > 6 → repeat ceiling model beyond column 6
|
|
11
|
+
|
|
12
|
+
## GPT Pro (spark — separate token limit)
|
|
13
|
+
|
|
14
|
+
| Complexity | 1-2 | 3-4 | 5-6 | 7+ |
|
|
15
|
+
|------------|-----|-----|-----|-----|
|
|
16
|
+
| LOW | spark:low | spark:medium | spark:high | BLOCKED |
|
|
17
|
+
| MEDIUM | spark:medium | spark:high | spark:xhigh | BLOCKED |
|
|
18
|
+
| HIGH | spark:high | spark:xhigh | spark:xhigh | BLOCKED |
|
|
19
|
+
| CRITICAL | spark:xhigh | spark:xhigh | spark:xhigh | BLOCKED |
|
|
20
|
+
|
|
21
|
+
## Non-Pro (gpt-5.4)
|
|
22
|
+
|
|
23
|
+
| Complexity | 1-2 | 3-4 | 5-6 | 7+ |
|
|
24
|
+
|------------|-----|-----|-----|-----|
|
|
25
|
+
| LOW | 5.4:low | 5.4:medium | 5.4:high | BLOCKED |
|
|
26
|
+
| MEDIUM | 5.4:medium | 5.4:high | 5.4:xhigh | BLOCKED |
|
|
27
|
+
| HIGH | 5.4:high | 5.4:xhigh | 5.4:xhigh | BLOCKED |
|
|
28
|
+
| CRITICAL | 5.4:xhigh | 5.4:xhigh | 5.4:xhigh | BLOCKED |
|
|
29
|
+
|
|
30
|
+
## Claude-only
|
|
31
|
+
|
|
32
|
+
| Complexity | 1-2 | 3-4 | 5-6 | 7+ |
|
|
33
|
+
|------------|-----|-----|-----|-----|
|
|
34
|
+
| LOW | haiku | sonnet | opus | BLOCKED |
|
|
35
|
+
| MEDIUM | sonnet | opus | opus | BLOCKED |
|
|
36
|
+
| HIGH | sonnet | opus | opus | BLOCKED |
|
|
37
|
+
| CRITICAL | opus | opus | opus | BLOCKED |
|
|
38
|
+
|
|
39
|
+
## Complexity Evaluation (brainstorm determines this)
|
|
40
|
+
|
|
41
|
+
| Factor | LOW | MEDIUM | HIGH | CRITICAL |
|
|
42
|
+
|--------|-----|--------|------|----------|
|
|
43
|
+
| US count | 1-2 | 3-5 | 6-10 | 10+ |
|
|
44
|
+
| File scope | single | 2-5 | 6+ | cross-repo |
|
|
45
|
+
| Logic | simple CRUD | conditionals | algorithms | security/crypto |
|
|
46
|
+
| Dependencies | none | 1-2 | 3+ API/DB | distributed |
|
|
47
|
+
| Code impact | new only | modify existing | refactor | architecture change |
|
|
48
|
+
|
|
49
|
+
Overall complexity = highest factor level.
|
|
50
|
+
Campaign starting model = lowest US risk level (progressive upgrade handles harder US).
|