@ai-dev-methodologies/rlp-desk 0.3.6 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -99,6 +99,22 @@ for iteration in 1..max_iter:
99
99
  8. Update status, report to user, continue or stop
100
100
  ```
101
101
 
102
+ ### Live PRD Update
103
+
104
+ The Leader computes a hash for `prd-<slug>.md` at startup and again at each iteration using `md5`.
105
+
106
+ When the hash changes, it:
107
+
108
+ - Logs `prd_changed=true` with `prd_hash`, previous/new US counts, and `new_us`
109
+ - Splits the PRD into per-US files (`prd-<slug>-US-<id>.md`)
110
+ - Splits the test-spec into per-US files (`test-spec-<slug>-US-<id>.md`)
111
+ - Updates the in-memory PRD US list used for per-US dispatch
112
+ - Adds `NOTE: PRD was updated since last iteration. New/changed US may exist.` to the Worker prompt
113
+
114
+ If the PRD hash is unchanged, `prd_changed=false` is logged and no re-split is triggered.
115
+
116
+ If the PRD file is missing, the process degrades gracefully and continues without failing the campaign loop.
117
+
102
118
  ### Verification Policy (v0.3.0)
103
119
 
104
120
  RLP Desk enforces a comprehensive verification policy defined in `governance.md`:
@@ -133,15 +149,75 @@ RLP Desk enforces a comprehensive verification policy defined in `governance.md`
133
149
  | 3 consecutive failures | Architecture Escalation (§7¾) → report to user |
134
150
  | Max iterations reached | TIMEOUT |
135
151
 
136
- ### Model Routing
152
+ ### Verification Strategy (v0.5)
153
+
154
+ **Core principle: Worker and Verifier use different AI engines whenever possible.**
155
+
156
+ - Per-US: lightweight verification after each user story (catches issues early)
157
+ - Final: top-tier consensus gate before COMPLETE (quality guarantee)
158
+ - Progressive upgrade: auto-upgrade models on consecutive failure (2-attempt windows)
159
+ - Verifier minimum: claude sonnet (haiku cannot verify)
160
+
161
+ #### 1. Claude-only (codex not installed)
162
+
163
+ Verifier is always +1 tier above Worker. Same-engine shares blind spots — install codex for improved detection.
164
+
165
+ | Risk | Worker | Per-US Verifier | Worker upgrade path | Verifier upgrade path |
166
+ |------|--------|-----------------|--------------------|-----------------------|
167
+ | LOW | haiku | sonnet | sonnet → opus | sonnet → opus |
168
+ | MEDIUM | sonnet | sonnet | opus | sonnet → opus |
169
+ | HIGH | sonnet | opus | opus | opus (ceiling) |
170
+ | CRITICAL | opus | opus ⚠ | (ceiling) | (ceiling) |
171
+
172
+ Final: **opus solo** ⚠ same-engine warning displayed
173
+
174
+ #### 2. Cross-engine: GPT Pro (spark + 5.4)
175
+
176
+ Spark is speed-optimized for coding. Use as Worker for LOW-HIGH; 5.4 for CRITICAL.
177
+
178
+ | Risk | Worker (codex) | Per-US Verifier (claude) | Worker upgrade path | Verifier upgrade path |
179
+ |------|---------------|--------------------------|--------------------|-----------------------|
180
+ | LOW | spark medium | sonnet | spark high → xhigh | sonnet → opus |
181
+ | MEDIUM | spark high | sonnet | spark xhigh → 5.4 medium | sonnet → opus |
182
+ | HIGH | spark xhigh | opus | 5.4 high → 5.4 xhigh | opus (ceiling) |
183
+ | CRITICAL | 5.4 high | opus | 5.4 xhigh | opus (ceiling) |
184
+
185
+ Final: **opus + 5.4 high** (both must PASS)
186
+
187
+ #### 3. Cross-engine: Non-Pro (5.4 only)
188
+
189
+ | Risk | Worker (codex) | Per-US Verifier (claude) | Worker upgrade path | Verifier upgrade path |
190
+ |------|---------------|--------------------------|--------------------|-----------------------|
191
+ | LOW | 5.4 low | sonnet | 5.4 medium → high | sonnet → opus |
192
+ | MEDIUM | 5.4 medium | sonnet | 5.4 high → xhigh | sonnet → opus |
193
+ | HIGH | 5.4 high | opus | 5.4 xhigh | opus (ceiling) |
194
+ | CRITICAL | 5.4 xhigh | opus | (ceiling) | opus (ceiling) |
195
+
196
+ Final: **opus + 5.4 high** (both must PASS)
197
+
198
+ #### Final Verify
199
+
200
+ | Environment | Engine 1 | Engine 2 | Rule |
201
+ |-------------|----------|----------|------|
202
+ | Claude-only | opus | — | Solo ⚠ |
203
+ | Cross-engine | opus | 5.4 high | Both must PASS → COMPLETE |
204
+
205
+ #### Progressive Upgrade (Worker Only)
206
+
207
+ Worker auto-upgrades on consecutive same-US failure. Verifier is fixed at campaign start. CB default: 6.
208
+
209
+ ```
210
+ fail 1-2: keep current model (2-attempt window)
211
+ fail 3-4: upgrade 1 step (e.g., haiku → sonnet)
212
+ fail 5-6: upgrade 2 steps (e.g., haiku → opus)
213
+ fail 7+: ceiling reached → BLOCKED
214
+ ```
137
215
 
138
- | Scenario | Model |
139
- |----------|-------|
140
- | Simple, single-file changes | `haiku` |
141
- | Standard work (default) | `sonnet` |
142
- | Architecture changes, multi-file, prior failure | `opus` |
143
- | Verification (default) | `opus` |
144
- | Lightweight verification | `sonnet` |
216
+ See `src/model-upgrade-table.md` for full upgrade paths per engine and complexity level.
217
+
218
+ #### Sequential Final Verify
219
+
220
+ When all US pass individually, the final ALL verify runs **sequentially per-US** instead of one big check. This prevents verifier timeout on large PRDs. After all per-US checks pass, the project's test suite runs once as a cross-US integration check.
145
221
 
146
222
  ## Commands
147
223
 
@@ -159,18 +235,29 @@ RLP Desk enforces a comprehensive verification policy defined in `governance.md`
159
235
  | Flag | Default | Description |
160
236
  |------|---------|-------------|
161
237
  | `--max-iter N` | 100 | Maximum iterations before timeout |
162
- | `--worker-model MODEL` | sonnet | Worker model (haiku/sonnet/opus) |
163
- | `--verifier-model MODEL` | opus | Verifier model (haiku/sonnet/opus) |
164
238
  | `--mode agent\|tmux` | agent | Execution mode (see below) |
165
- | `--worker-engine claude\|codex` | claude | Engine for Worker (claude uses Agent(), codex uses Bash CLI) |
166
- | `--verifier-engine claude\|codex` | claude | Engine for Verifier |
167
- | `--codex-model MODEL` | gpt-5.4 | Model passed to the Codex CLI (when engine=codex) |
168
- | `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
239
+ | `--worker-model MODEL` | sonnet | Claude worker model (haiku/sonnet/opus) |
240
+ | `--worker-engine claude\|codex` | claude | Worker engine |
241
+ | `--verifier-model MODEL` | auto | Auto-selected: +1 tier (same-engine) or cross-engine |
242
+ | `--verifier-engine claude\|codex` | auto | Opposite of worker engine if codex available |
243
+ | `--codex-model MODEL` | gpt-5.4 | Codex model (spark requires GPT Pro) |
244
+ | `--codex-reasoning LEVEL` | medium | low/medium/high/xhigh |
169
245
  | `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
170
- | `--verify-consensus` | off | Cross-engine consensus verification (see below) |
246
+ | `--lock-worker-model` | off | Disable progressive model upgrade on failure |
171
247
  | `--debug` | off | Debug logging to `logs/<slug>/debug.log` |
172
248
  | `--with-self-verification` | off | Campaign-level post-loop analysis report |
173
249
 
250
+ ### Init Presets
251
+
252
+ After `brainstorm`, `init` detects your environment and presents run command presets:
253
+
254
+ - **Codex detected** → recommends cross-engine mode (`--worker-model gpt-5.4:high --verify-consensus`)
255
+ - **GPT Pro (spark)** → offers spark preset (`--worker-model gpt-5.3-codex-spark:high`)
256
+ - **Claude-only** → defaults to `--worker-model sonnet` with opus verifier
257
+ - **Basic** → minimal flags for quick iteration
258
+
259
+ The brainstorm phase evaluates complexity (US count, file scope, logic, dependencies, code impact) and recommends a starting model. You can override any recommendation.
260
+
174
261
  ## Execution Modes
175
262
 
176
263
  RLP Desk supports two execution modes. Both honor the same governance protocol.
@@ -277,28 +364,18 @@ Uses the `codex` CLI via `Bash()` (agent mode) or as an interactive TUI (tmux mo
277
364
 
278
365
  ## Verification Modes
279
366
 
280
- RLP Desk supports two verification strategies. **Per-US is the default.**
281
-
282
367
  ### Per-US Verification (default)
283
368
 
284
- ```
285
- /rlp-desk run calculator
286
- /rlp-desk run calculator --verify-mode per-us
287
- ```
288
-
289
- Each user story is verified independently after completion, then a final full verification runs after all stories pass:
369
+ Each user story is verified independently, then a final full verification runs:
290
370
 
291
371
  ```
292
- Worker: US-001 → Verifier: US-001 AC only → pass
293
- Worker: US-002 → Verifier: US-002 AC only → pass
294
- Worker: US-003 → Verifier: US-003 AC only → pass
295
- Final full verify: ALL AC → pass → COMPLETE
372
+ Worker: US-001 → Verifier(per-US): US-001 only → pass
373
+ Worker: US-002 → Verifier(per-US): US-002 only → pass
374
+ ...
375
+ Final Verify: opus + 5.4 high both pass → COMPLETE
296
376
  ```
297
377
 
298
- Benefits:
299
- - Catch issues early, before later stories build on broken foundations
300
- - Smaller verification scope = faster, more accurate checks
301
- - Failed verification retries only the specific US
378
+ Per-US catches issues early before later stories build on broken foundations.
302
379
 
303
380
  ### Batch Verification
304
381
 
@@ -306,30 +383,7 @@ Benefits:
306
383
  /rlp-desk run calculator --verify-mode batch
307
384
  ```
308
385
 
309
- Legacy behavior: Worker completes all stories, then a single verification checks all acceptance criteria at once.
310
-
311
- ### Cross-Engine Consensus Verification
312
-
313
- ```
314
- /rlp-desk run calculator --verify-consensus
315
- ```
316
-
317
- When enabled, **both claude and codex verify independently**. Both must pass for verification to succeed.
318
-
319
- ```
320
- Worker completes US → Claude verifies → Codex verifies
321
- Both pass → proceed
322
- Either fails → combined fix contract → Worker retry
323
- 3 rounds without consensus → BLOCKED
324
- ```
325
-
326
- Consensus can be combined with per-US mode for maximum rigor:
327
-
328
- ```
329
- /rlp-desk run calculator --verify-mode per-us --verify-consensus
330
- ```
331
-
332
- Prerequisites: Both `claude` and `codex` CLIs must be installed.
386
+ Worker completes all stories, then a single verification checks all AC at once. Final verify still applies.
333
387
 
334
388
  ## Project Structure
335
389
 
@@ -337,20 +391,42 @@ After `init`, your project gets this scaffold:
337
391
 
338
392
  ```
339
393
  your-project/
340
- └── .claude/ralph-desk/
341
- ├── prompts/
342
- ├── <slug>.worker.prompt.md
343
- └── <slug>.verifier.prompt.md
344
- ├── context/
345
- │ └── <slug>-latest.md
346
- ├── memos/
347
- │ └── <slug>-memory.md
348
- ├── plans/
349
- ├── prd-<slug>.md
350
- └── test-spec-<slug>.md
351
- └── logs/<slug>/
352
- └── status.json
353
- ```
394
+ ├── .claude/
395
+ ├── settings.local.json # rlp-desk permissions (auto-added by init)
396
+ └── ralph-desk/
397
+ ├── prompts/
398
+ │ │ ├── <slug>.worker.prompt.md
399
+ └── <slug>.verifier.prompt.md
400
+ ├── context/
401
+ └── <slug>-latest.md
402
+ ├── memos/
403
+ └── <slug>-memory.md
404
+ ├── plans/
405
+ │ │ ├── prd-<slug>.md
406
+ │ │ └── test-spec-<slug>.md
407
+ │ └── logs/<slug>/
408
+ │ └── status.json
409
+ ```
410
+
411
+ ### Local Settings
412
+
413
+ `init` automatically adds the following permissions to `.claude/settings.local.json`:
414
+
415
+ ```json
416
+ {
417
+ "permissions": {
418
+ "allow": [
419
+ "Read(.claude/ralph-desk/**)",
420
+ "Edit(.claude/ralph-desk/**)",
421
+ "Write(.claude/ralph-desk/**)"
422
+ ]
423
+ }
424
+ }
425
+ ```
426
+
427
+ **Why:** Claude Code treats `.claude/` files as sensitive and prompts for confirmation on each access, even with `--dangerously-skip-permissions`. Without these permissions, Worker and Verifier agents are blocked by interactive prompts during automated loop execution.
428
+
429
+ **Note:** `settings.local.json` is local to your machine and is not committed to git. If the file already exists, permissions are merged without overwriting your existing settings.
354
430
 
355
431
  ## Example: Calculator
356
432
 
@@ -0,0 +1,347 @@
1
+ # Blueprint: rlp-desk v0.4 Evolution
2
+
3
+ > Design blueprint for rlp-desk's next major direction.
4
+ > Status: CONFIRMED (Deep Interview 4.9% ambiguity) | Author: kyjin | Date: 2026-03-26
5
+
6
+ ---
7
+
8
+ ## Vision
9
+
10
+ rlp-desk is both a **task execution tool** and a **workflow generator**.
11
+
12
+ Users start with unstructured work, iterate through self-verification cycles,
13
+ and the accumulated process naturally becomes a reusable, formalized workflow.
14
+
15
+ ```
16
+ [Unstructured] [Structured]
17
+ brainstorm → run → verify Workflow (skill + command composition)
18
+ → re-brainstorm → run → verify Feedback loop enforcement
19
+ → run → verify Reproducible process
20
+ → final result ──▶ (P3: determined after P0-P2 iteration)
21
+ ```
22
+
23
+ ---
24
+
25
+ ## 1. Debug (`--debug`) — Execution Trace
26
+
27
+ ### Purpose
28
+
29
+ Trace rlp-desk's execution process. Not verbose data dumps — focused logging
30
+ of whether rules were followed and options behaved as configured.
31
+
32
+ ### Two audiences
33
+
34
+ | Audience | Use case |
35
+ |----------|----------|
36
+ | Developer (self) | Verify governance compliance, catch erroneous execution (e.g., codex consensus FAIL treated as PASS) |
37
+ | External users (npm) | Run with `--debug`, attach debug.log + version to bug report |
38
+
39
+ ### Scope
40
+
41
+ - Governance rule compliance (IL-1 through IL-5, checkpoint enforcement)
42
+ - Option behavior verification (consensus, per-us, model routing)
43
+ - Decision points (model upgrades, circuit breaker triggers)
44
+ - NOT implementation details of Worker/Verifier content
45
+
46
+ ### Current state
47
+
48
+ Basic implementation exists in v0.3.6 (debug.log with phase-level entries).
49
+ Refine to match the scoped purpose above — no expansion needed, possibly trimming.
50
+
51
+ ### Versioning
52
+
53
+ debug.log is versioned on re-execution: `debug-v1.log`, `debug-v2.log`, ...
54
+ Preserved for bug tracking across versions.
55
+
56
+ ---
57
+
58
+ ## 2. Self-Verification (`--with-self-verification`) — Quality Feedback Loop
59
+
60
+ ### Purpose
61
+
62
+ Evaluate whether the AI implementation meets quality expectations.
63
+ When the user is unsatisfied, self-verification becomes the **input for the next
64
+ execution cycle** — same goal, improved strategy.
65
+
66
+ ### Status
67
+
68
+ Remains an **optional flag** (`--with-self-verification`). Not always-on.
69
+ Simple tasks don't need the re-execution cycle.
70
+
71
+ ### Current vs. New
72
+
73
+ | Aspect | Current (v0.3.x) | New vision |
74
+ |--------|-------------------|------------|
75
+ | Timing | Post-campaign report only | Input for re-execution cycle |
76
+ | Output | Static report | Living document that drives next iteration |
77
+ | Scope | Analysis of what happened | Analysis + recommendations that reshape execution plan |
78
+ | PRD | Not touched | May be refined based on self-verification findings |
79
+
80
+ ### Re-execution Cycle (same slug)
81
+
82
+ ```
83
+ brainstorm("auth-refactor")
84
+ → init → run → self-verification-v1 + campaign-report-v1 + debug-v1
85
+
86
+ user: "not satisfied, re-run"
87
+
88
+
89
+ re-brainstorm("auth-refactor")
90
+
91
+ ├─ PRD: single file, updated in place if needed (no versioning)
92
+ ├─ SV report: renamed to self-verification-v1.md (preserved)
93
+ ├─ Campaign report: renamed to campaign-report-v1.md (preserved)
94
+ ├─ Debug log: renamed to debug-v1.log (preserved)
95
+ ├─ Everything else: deleted (test-spec, prompts, context, memos, logs)
96
+ └─ Re-brainstorm: informed by self-verification-v1
97
+
98
+
99
+ → init → run → self-verification-v2 + campaign-report-v2 + debug-v2
100
+
101
+ ...
102
+ ```
103
+
104
+ ### Versioning Rules
105
+
106
+ When re-running the same slug:
107
+
108
+ 1. **PRD** — single file (`prd-<slug>.md`), updated in place if needed.
109
+ No versioning. PRD is the single source of truth.
110
+
111
+ 2. **Versioned files (3 total)** — renamed with vN suffix before re-run:
112
+ - `self-verification-report.md` → `self-verification-v1.md`
113
+ - `campaign-report.md` → `campaign-report-v1.md`
114
+ - `debug.log` → `debug-v1.log`
115
+
116
+ 3. **Everything else** — deleted. Next run regenerates them automatically:
117
+ - `test-spec-<slug>.md`, `prompts/`, `context/`, `memos/`, `logs/<slug>/*`
118
+
119
+ 4. **Self-verification as the historical record** — each version's story
120
+ is told by its self-verification report. No need to preserve iteration
121
+ logs; SV summarizes what happened.
122
+
123
+ ### Re-execution Detection
124
+
125
+ When brainstorm detects an existing slug:
126
+ - Ask the user: "Improve based on previous results, or start fresh?"
127
+ - If improve: version existing files, carry forward PRD, re-brainstorm with SV context
128
+ - If start fresh: clean everything (equivalent to `clean` + new brainstorm)
129
+
130
+ ### Implementation: Shell Script
131
+
132
+ Deterministic file operations (rename, delete, version detection) go in
133
+ `init_ralph_desk.zsh`. AI handles judgment (PRD refinement, strategy changes).
134
+
135
+ ---
136
+
137
+ ## 3. Post-Run Reporting — Mandatory Completion Report
138
+
139
+ ### Purpose
140
+
141
+ After `run` completes, the user MUST receive a comprehensive, templated report
142
+ before deciding next steps. This is the **decision surface** for whether to
143
+ re-brainstorm or accept the result.
144
+
145
+ ### Trigger
146
+
147
+ Mandatory after every `run` completion (COMPLETE, BLOCKED, or TIMEOUT).
148
+ Not optional. Not skippable. Applies regardless of SV flag.
149
+
150
+ ### Output
151
+
152
+ - **File**: `logs/<slug>/campaign-report.md` (versioned on re-execution)
153
+ - **Screen**: Full report displayed to user
154
+
155
+ ### Report Template
156
+
157
+ ```markdown
158
+ # Campaign Report: <slug>
159
+
160
+ ## Objective
161
+ <from PRD>
162
+
163
+ ## Execution Summary
164
+ | Metric | Value |
165
+ |--------|-------|
166
+ | Total iterations | N |
167
+ | Outcome | COMPLETE / BLOCKED / TIMEOUT |
168
+ | Worker model | sonnet / opus |
169
+ | Verifier model | opus |
170
+ | Duration | Xm Ys |
171
+
172
+ ## User Stories Status
173
+ | US | Description | Status | Iterations | Notes |
174
+ |----|-------------|--------|------------|-------|
175
+ | US-001 | ... | PASS | 2 | — |
176
+ | US-002 | ... | PASS | 4 | 2 fix rounds |
177
+
178
+ ## Verification Results
179
+ - L1 (Unit): PASS — N tests, N assertions
180
+ - L2 (Integration): PASS / N/A
181
+ - L3 (E2E): PASS — input/output comparison
182
+ - L4 (Deploy): N/A
183
+
184
+ ## Issues Encountered
185
+ <failures, fix loops, model upgrades, escalations>
186
+
187
+ ## Cost & Performance
188
+ | Role | Model | Tokens | Duration | Source |
189
+ |------|-------|--------|----------|--------|
190
+ | Worker | sonnet | N | Xm Ys | measured/estimated |
191
+ | Verifier | opus | N | Xm Ys | measured/estimated |
192
+ | **Total** | | **N** | **Xm Ys** | |
193
+
194
+ ## Self-Verification Summary (if enabled)
195
+ <from self-verification report — strengths, weaknesses, recommendations>
196
+
197
+ ## Files Changed
198
+ <git diff --stat summary>
199
+ ```
200
+
201
+ ### Post-Report Flow
202
+
203
+ ```
204
+ [All runs]
205
+ → Report displayed + saved to file
206
+
207
+ [SV enabled only]
208
+ → "Would you like to re-brainstorm to improve the result?"
209
+ → Yes: trigger re-execution cycle (§2)
210
+ → No: session ends
211
+
212
+ [SV not enabled]
213
+ → Report displayed, session ends (no re-brainstorm question)
214
+ ```
215
+
216
+ ### Rules
217
+
218
+ - Report content must reference actual data (status.json, iteration results,
219
+ self-verification if available) — no fabrication
220
+ - Template is fixed; sections may show "N/A" but cannot be omitted
221
+ - Re-brainstorm question only appears when SV is enabled
222
+
223
+ ---
224
+
225
+ ## 4. Workflow Generation — From Ad-hoc to Reproducible (P3, deferred)
226
+
227
+ ### Purpose
228
+
229
+ After multiple self-verification cycles produce a final result,
230
+ automatically generate a **formalized workflow** that captures the
231
+ proven process as a reusable, enforceable process.
232
+
233
+ ### Status: DEFERRED
234
+
235
+ P3 design will be determined after P0-P2 are working and tested through
236
+ real usage. The user will iterate with the developer to find the right form.
237
+
238
+ ### Known Direction
239
+
240
+ - Output: **skill + command composition** (not rlp-desk PRD format)
241
+ - Invoking the command triggers a combination of skills
242
+ - The command structure enforces feedback loops
243
+ - Leverage existing skill ecosystem (`/find-skills`, etc.)
244
+ - Trust AI models + well-structured skills + checklist-managed feedback loops
245
+
246
+ ### Success Criteria (confirmed)
247
+
248
+ - Process must be reproducible: same workflow → same quality of results
249
+ - Code implementation may differ, but behavior/quality must be equivalent
250
+ - Feedback loop + template documents as structural components
251
+
252
+ ### Open Questions (to resolve after P0-P2)
253
+
254
+ 1. Separate subcommand (`/rlp-desk workflow <slug>`) or independent skill?
255
+ 2. Minimum self-verification versions required (2? 3?)
256
+ 3. How to validate generated workflow actually reproduces results?
257
+
258
+ ---
259
+
260
+ ## Feature Relationship
261
+
262
+ ```
263
+ ┌──────────────────────────────────────────────────────────────┐
264
+ │ rlp-desk execution │
265
+ │ │
266
+ │ brainstorm → init → run ──┐ │
267
+ │ │ │
268
+ │ ┌──────────────┼──────────────────┐ │
269
+ │ │ │ │ │
270
+ │ --debug --with-self-verification │ │
271
+ │ (execution (quality evaluation) │ │
272
+ │ trace) │ │ │
273
+ │ │ ▼ │ │
274
+ │ │ self-verification report │ │
275
+ │ │ │ │ │
276
+ │ ▼ ▼ │ │
277
+ │ debug.log Post-Run Report (mandatory) │ │
278
+ │ (versioned) + campaign-report.md │ │
279
+ │ │ (versioned) │ │
280
+ │ │ │ │ │
281
+ │ │ [SV enabled?] │ │
282
+ │ │ │ │ │ │
283
+ │ │ Yes No │ │
284
+ │ │ │ │ │ │
285
+ │ │ ▼ ▼ │ │
286
+ │ │ "Re-brainstorm?" End │ │
287
+ │ │ │ │ │ │
288
+ │ │ Yes No │ │
289
+ │ │ │ │ │ │
290
+ │ │ ▼ ▼ │ │
291
+ │ │ Re-execute Accept │ │
292
+ │ │ (vN+1) result │ │
293
+ │ │ │ │ │ │
294
+ │ │ │ ┌────┘ │ │
295
+ │ │ ▼ ▼ │ │
296
+ │ │ Workflow Generation (P3) │ │
297
+ │ │ (deferred — after P0-P2) │ │
298
+ │ │ │ │
299
+ │ └── Bug report (external users) │ │
300
+ └──────────────────────────────────────────────────────────────┘
301
+ ```
302
+
303
+ ---
304
+
305
+ ## Implementation Priority
306
+
307
+ | Phase | Feature | Dependency | Scope |
308
+ |-------|---------|------------|-------|
309
+ | P0 | Debug refinement | None | rlp-desk.md, governance.md |
310
+ | P1 | Post-Run Report | None | rlp-desk.md |
311
+ | P2 | Self-Verification redesign | P1 | rlp-desk.md, init_ralph_desk.zsh, governance.md |
312
+ | P3 | Workflow Generation | P2 + real usage | TBD after iteration |
313
+
314
+ - P0 and P1 are independent, can be done in parallel.
315
+ - P2 builds on P1 (report triggers re-brainstorm question).
316
+ - P3 requires P2 (needs versioned self-verification data) + real-world testing.
317
+ - Breaking changes allowed (0.x semver). Document in CHANGELOG.
318
+
319
+ ---
320
+
321
+ ## Design Decisions Log (from Deep Interview)
322
+
323
+ | # | Decision | Rationale |
324
+ |---|----------|-----------|
325
+ | 1 | brainstorm auto-handles re-execution | Natural UX — same command, system detects context |
326
+ | 2 | Mixed judgment (quantitative + user) | Pure metrics miss quality nuance; pure subjective misses patterns |
327
+ | 3 | Breaking changes OK | 0.x semver, clean redesign over backward compat hacks |
328
+ | 4 | P3 essential but deferred | Core vision requires it, but form emerges from P0-P2 usage |
329
+ | 5 | Skill + command composition for P3 | Leverage existing ecosystem, not reinvent |
330
+ | 6 | All features in rlp-desk | Connected UX > separate tools |
331
+ | 7 | Shell script for deterministic ops | AI interpretation unreliable for file manipulation |
332
+ | 8 | SV remains optional | Simple tasks don't need re-execution overhead |
333
+ | 9 | Re-brainstorm only with SV | No SV = no improvement data = no point asking |
334
+ | 10 | Report always mandatory | Users need decision surface regardless of SV |
335
+ | 11 | PRD = single file, no versioning | Source of truth, updated in place |
336
+ | 12 | Only 3 files versioned | SV report + campaign report + debug.log |
337
+ | 13 | Report includes cost section | Token/time tracking for optimization decisions |
338
+ | 14 | Ask user intent on slug reuse | "Improve?" vs "Start fresh?" — don't assume |
339
+
340
+ ---
341
+
342
+ ## Open Questions (P3 only)
343
+
344
+ 1. **Workflow as subcommand or separate skill?** — defer to P0-P2 experience
345
+ 2. **Minimum SV versions for generation** — need real data to determine
346
+ 3. **Reproducibility validation method** — how to test generated workflow works
347
+ 4. **Skill composition mechanics** — how feedback loop enforcement works in practice
@@ -0,0 +1,53 @@
1
+ # Plan: 리팩토링 실행 검증 + v05-remaining 재시작
2
+
3
+ ## Context
4
+ Engine path refactoring Phase 0~7 완료 (38 TDD 구조적 테스트 pass).
5
+ 하지만 **실제 tmux 실행 검증**을 안 했음. 리팩토링이 실제 캠페인에서 정상 동작하는지 확인 필요.
6
+
7
+ ## 검증 순서
8
+
9
+ ### Step 1: 좀비 runner + sentinel 정리
10
+ ```bash
11
+ ps aux | grep run_ralph_desk | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
12
+ for p in $(tmux list-panes -F '#{pane_id}' | grep -v '%360'); do tmux kill-pane -t "$p" 2>/dev/null; done
13
+ rm -f .claude/ralph-desk/memos/v05-remaining-blocked.md
14
+ rm -f .claude/ralph-desk/memos/v05-remaining-complete.md
15
+ rm -f .claude/ralph-desk/memos/v05-remaining-done-claim.json
16
+ rm -f .claude/ralph-desk/memos/v05-remaining-verify-verdict.json
17
+ rm -f .claude/ralph-desk/memos/v05-remaining-iter-signal.json
18
+ rm -f .claude/ralph-desk/logs/v05-remaining/session-config.json
19
+ ```
20
+
21
+ ### Step 2: v05-remaining 캠페인 실행 (spark worker)
22
+ ```bash
23
+ LOOP_NAME="v05-remaining" ROOT="$PWD" MAX_ITER=15 \
24
+ WORKER_MODEL=gpt-5.3-codex-spark WORKER_ENGINE=codex \
25
+ WORKER_CODEX_MODEL=gpt-5.3-codex-spark WORKER_CODEX_REASONING=medium \
26
+ VERIFIER_MODEL=sonnet VERIFIER_ENGINE=claude \
27
+ VERIFY_MODE=per-us VERIFY_CONSENSUS=0 CB_THRESHOLD=6 \
28
+ ITER_TIMEOUT=600 DEBUG=1 WITH_SELF_VERIFICATION=1 \
29
+ zsh ~/.claude/ralph-desk/run_ralph_desk.zsh
30
+ ```
31
+ (run_in_background=true)
32
+
33
+ ### Step 3: 검증 체크리스트
34
+ - [ ] Pane 3개 생성됨 (leader + worker + verifier)
35
+ - [ ] Worker pane에서 codex exec 실행됨 (bash trigger, dead pane 오판 없음)
36
+ - [ ] Worker 완료 후 heartbeat exited → signal auto-generate
37
+ - [ ] Verifier(sonnet) 정상 시작 + verdict 작성
38
+ - [ ] US-002 이상 진행 (이전 US-001은 이미 verified)
39
+ - [ ] 좀비 runner 없음 (ps 확인)
40
+
41
+ ### Step 4: 실패 시 대응
42
+ - codex worker 시작 실패 → trigger script 내용 확인 + 수동 실행 테스트
43
+ - verifier timeout → runner log tail + pane 상태 확인
44
+ - BLOCKED → sentinel 원인 분석 + 수정 후 재시도
45
+
46
+ ### Step 5: 성공 시
47
+ - 캠페인 진행 모니터링 (status 확인)
48
+ - 완료 대기 또는 다음 세션 handoff
49
+
50
+ ## 파일
51
+ - `src/scripts/run_ralph_desk.zsh` — 리팩토링된 runner
52
+ - `~/.claude/ralph-desk/run_ralph_desk.zsh` — 로컬 동기화된 사본
53
+ - `.claude/ralph-desk/logs/v05-remaining/` — 캠페인 아티팩트