@ai-dev-methodologies/rlp-desk 0.3.6 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/blueprints/blueprint-v0.4-evolution.md +347 -0
- package/docs/prompts/ralplan-codex-review.md +55 -0
- package/package.json +1 -1
- package/src/commands/rlp-desk.md +62 -22
- package/src/governance.md +39 -22
- package/src/scripts/init_ralph_desk.zsh +139 -4
- package/src/scripts/run_ralph_desk.zsh +358 -80
|
@@ -0,0 +1,347 @@
|
|
|
1
|
+
# Blueprint: rlp-desk v0.4 Evolution
|
|
2
|
+
|
|
3
|
+
> Design blueprint for rlp-desk's next major direction.
|
|
4
|
+
> Status: CONFIRMED (Deep Interview 4.9% ambiguity) | Author: kyjin | Date: 2026-03-26
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Vision
|
|
9
|
+
|
|
10
|
+
rlp-desk is both a **task execution tool** and a **workflow generator**.
|
|
11
|
+
|
|
12
|
+
Users start with unstructured work, iterate through self-verification cycles,
|
|
13
|
+
and the accumulated process naturally becomes a reusable, formalized workflow.
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
[Unstructured] [Structured]
|
|
17
|
+
brainstorm → run → verify Workflow (skill + command composition)
|
|
18
|
+
→ re-brainstorm → run → verify Feedback loop enforcement
|
|
19
|
+
→ run → verify Reproducible process
|
|
20
|
+
→ final result ──▶ (P3: determined after P0-P2 iteration)
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## 1. Debug (`--debug`) — Execution Trace
|
|
26
|
+
|
|
27
|
+
### Purpose
|
|
28
|
+
|
|
29
|
+
Trace rlp-desk's execution process. Not verbose data dumps — focused logging
|
|
30
|
+
of whether rules were followed and options behaved as configured.
|
|
31
|
+
|
|
32
|
+
### Two audiences
|
|
33
|
+
|
|
34
|
+
| Audience | Use case |
|
|
35
|
+
|----------|----------|
|
|
36
|
+
| Developer (self) | Verify governance compliance, catch erroneous execution (e.g., codex consensus FAIL treated as PASS) |
|
|
37
|
+
| External users (npm) | Run with `--debug`, attach debug.log + version to bug report |
|
|
38
|
+
|
|
39
|
+
### Scope
|
|
40
|
+
|
|
41
|
+
- Governance rule compliance (IL-1 through IL-5, checkpoint enforcement)
|
|
42
|
+
- Option behavior verification (consensus, per-us, model routing)
|
|
43
|
+
- Decision points (model upgrades, circuit breaker triggers)
|
|
44
|
+
- NOT implementation details of Worker/Verifier content
|
|
45
|
+
|
|
46
|
+
### Current state
|
|
47
|
+
|
|
48
|
+
Basic implementation exists in v0.3.6 (debug.log with phase-level entries).
|
|
49
|
+
Refine to match the scoped purpose above — no expansion needed, possibly trimming.
|
|
50
|
+
|
|
51
|
+
### Versioning
|
|
52
|
+
|
|
53
|
+
debug.log is versioned on re-execution: `debug-v1.log`, `debug-v2.log`, ...
|
|
54
|
+
Preserved for bug tracking across versions.
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## 2. Self-Verification (`--with-self-verification`) — Quality Feedback Loop
|
|
59
|
+
|
|
60
|
+
### Purpose
|
|
61
|
+
|
|
62
|
+
Evaluate whether the AI implementation meets quality expectations.
|
|
63
|
+
When the user is unsatisfied, self-verification becomes the **input for the next
|
|
64
|
+
execution cycle** — same goal, improved strategy.
|
|
65
|
+
|
|
66
|
+
### Status
|
|
67
|
+
|
|
68
|
+
Remains an **optional flag** (`--with-self-verification`). Not always-on.
|
|
69
|
+
Simple tasks don't need the re-execution cycle.
|
|
70
|
+
|
|
71
|
+
### Current vs. New
|
|
72
|
+
|
|
73
|
+
| Aspect | Current (v0.3.x) | New vision |
|
|
74
|
+
|--------|-------------------|------------|
|
|
75
|
+
| Timing | Post-campaign report only | Input for re-execution cycle |
|
|
76
|
+
| Output | Static report | Living document that drives next iteration |
|
|
77
|
+
| Scope | Analysis of what happened | Analysis + recommendations that reshape execution plan |
|
|
78
|
+
| PRD | Not touched | May be refined based on self-verification findings |
|
|
79
|
+
|
|
80
|
+
### Re-execution Cycle (same slug)
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
brainstorm("auth-refactor")
|
|
84
|
+
→ init → run → self-verification-v1 + campaign-report-v1 + debug-v1
|
|
85
|
+
│
|
|
86
|
+
user: "not satisfied, re-run"
|
|
87
|
+
│
|
|
88
|
+
▼
|
|
89
|
+
re-brainstorm("auth-refactor")
|
|
90
|
+
│
|
|
91
|
+
├─ PRD: single file, updated in place if needed (no versioning)
|
|
92
|
+
├─ SV report: renamed to self-verification-v1.md (preserved)
|
|
93
|
+
├─ Campaign report: renamed to campaign-report-v1.md (preserved)
|
|
94
|
+
├─ Debug log: renamed to debug-v1.log (preserved)
|
|
95
|
+
├─ Everything else: deleted (test-spec, prompts, context, memos, logs)
|
|
96
|
+
└─ Re-brainstorm: informed by self-verification-v1
|
|
97
|
+
│
|
|
98
|
+
▼
|
|
99
|
+
→ init → run → self-verification-v2 + campaign-report-v2 + debug-v2
|
|
100
|
+
│
|
|
101
|
+
...
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
### Versioning Rules
|
|
105
|
+
|
|
106
|
+
When re-running the same slug:
|
|
107
|
+
|
|
108
|
+
1. **PRD** — single file (`prd-<slug>.md`), updated in place if needed.
|
|
109
|
+
No versioning. PRD is the single source of truth.
|
|
110
|
+
|
|
111
|
+
2. **Versioned files (3 total)** — renamed with vN suffix before re-run:
|
|
112
|
+
- `self-verification-report.md` → `self-verification-v1.md`
|
|
113
|
+
- `campaign-report.md` → `campaign-report-v1.md`
|
|
114
|
+
- `debug.log` → `debug-v1.log`
|
|
115
|
+
|
|
116
|
+
3. **Everything else** — deleted. Next run regenerates them automatically:
|
|
117
|
+
- `test-spec-<slug>.md`, `prompts/`, `context/`, `memos/`, `logs/<slug>/*`
|
|
118
|
+
|
|
119
|
+
4. **Self-verification as the historical record** — each version's story
|
|
120
|
+
is told by its self-verification report. No need to preserve iteration
|
|
121
|
+
logs; SV summarizes what happened.
|
|
122
|
+
|
|
123
|
+
### Re-execution Detection
|
|
124
|
+
|
|
125
|
+
When brainstorm detects an existing slug:
|
|
126
|
+
- Ask the user: "Improve based on previous results, or start fresh?"
|
|
127
|
+
- If improve: version existing files, carry forward PRD, re-brainstorm with SV context
|
|
128
|
+
- If start fresh: clean everything (equivalent to `clean` + new brainstorm)
|
|
129
|
+
|
|
130
|
+
### Implementation: Shell Script
|
|
131
|
+
|
|
132
|
+
Deterministic file operations (rename, delete, version detection) go in
|
|
133
|
+
`init_ralph_desk.zsh`. AI handles judgment (PRD refinement, strategy changes).
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## 3. Post-Run Reporting — Mandatory Completion Report
|
|
138
|
+
|
|
139
|
+
### Purpose
|
|
140
|
+
|
|
141
|
+
After `run` completes, the user MUST receive a comprehensive, templated report
|
|
142
|
+
before deciding next steps. This is the **decision surface** for whether to
|
|
143
|
+
re-brainstorm or accept the result.
|
|
144
|
+
|
|
145
|
+
### Trigger
|
|
146
|
+
|
|
147
|
+
Mandatory after every `run` completion (COMPLETE, BLOCKED, or TIMEOUT).
|
|
148
|
+
Not optional. Not skippable. Applies regardless of SV flag.
|
|
149
|
+
|
|
150
|
+
### Output
|
|
151
|
+
|
|
152
|
+
- **File**: `logs/<slug>/campaign-report.md` (versioned on re-execution)
|
|
153
|
+
- **Screen**: Full report displayed to user
|
|
154
|
+
|
|
155
|
+
### Report Template
|
|
156
|
+
|
|
157
|
+
```markdown
|
|
158
|
+
# Campaign Report: <slug>
|
|
159
|
+
|
|
160
|
+
## Objective
|
|
161
|
+
<from PRD>
|
|
162
|
+
|
|
163
|
+
## Execution Summary
|
|
164
|
+
| Metric | Value |
|
|
165
|
+
|--------|-------|
|
|
166
|
+
| Total iterations | N |
|
|
167
|
+
| Outcome | COMPLETE / BLOCKED / TIMEOUT |
|
|
168
|
+
| Worker model | sonnet / opus |
|
|
169
|
+
| Verifier model | opus |
|
|
170
|
+
| Duration | Xm Ys |
|
|
171
|
+
|
|
172
|
+
## User Stories Status
|
|
173
|
+
| US | Description | Status | Iterations | Notes |
|
|
174
|
+
|----|-------------|--------|------------|-------|
|
|
175
|
+
| US-001 | ... | PASS | 2 | — |
|
|
176
|
+
| US-002 | ... | PASS | 4 | 2 fix rounds |
|
|
177
|
+
|
|
178
|
+
## Verification Results
|
|
179
|
+
- L1 (Unit): PASS — N tests, N assertions
|
|
180
|
+
- L2 (Integration): PASS / N/A
|
|
181
|
+
- L3 (E2E): PASS — input/output comparison
|
|
182
|
+
- L4 (Deploy): N/A
|
|
183
|
+
|
|
184
|
+
## Issues Encountered
|
|
185
|
+
<failures, fix loops, model upgrades, escalations>
|
|
186
|
+
|
|
187
|
+
## Cost & Performance
|
|
188
|
+
| Role | Model | Tokens | Duration | Source |
|
|
189
|
+
|------|-------|--------|----------|--------|
|
|
190
|
+
| Worker | sonnet | N | Xm Ys | measured/estimated |
|
|
191
|
+
| Verifier | opus | N | Xm Ys | measured/estimated |
|
|
192
|
+
| **Total** | | **N** | **Xm Ys** | |
|
|
193
|
+
|
|
194
|
+
## Self-Verification Summary (if enabled)
|
|
195
|
+
<from self-verification report — strengths, weaknesses, recommendations>
|
|
196
|
+
|
|
197
|
+
## Files Changed
|
|
198
|
+
<git diff --stat summary>
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Post-Report Flow
|
|
202
|
+
|
|
203
|
+
```
|
|
204
|
+
[All runs]
|
|
205
|
+
→ Report displayed + saved to file
|
|
206
|
+
|
|
207
|
+
[SV enabled only]
|
|
208
|
+
→ "Would you like to re-brainstorm to improve the result?"
|
|
209
|
+
→ Yes: trigger re-execution cycle (§2)
|
|
210
|
+
→ No: session ends
|
|
211
|
+
|
|
212
|
+
[SV not enabled]
|
|
213
|
+
→ Report displayed, session ends (no re-brainstorm question)
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
### Rules
|
|
217
|
+
|
|
218
|
+
- Report content must reference actual data (status.json, iteration results,
|
|
219
|
+
self-verification if available) — no fabrication
|
|
220
|
+
- Template is fixed; sections may show "N/A" but cannot be omitted
|
|
221
|
+
- Re-brainstorm question only appears when SV is enabled
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## 4. Workflow Generation — From Ad-hoc to Reproducible (P3, deferred)
|
|
226
|
+
|
|
227
|
+
### Purpose
|
|
228
|
+
|
|
229
|
+
After multiple self-verification cycles produce a final result,
|
|
230
|
+
automatically generate a **formalized workflow** that captures the
|
|
231
|
+
proven process as a reusable, enforceable process.
|
|
232
|
+
|
|
233
|
+
### Status: DEFERRED
|
|
234
|
+
|
|
235
|
+
P3 design will be determined after P0-P2 are working and tested through
|
|
236
|
+
real usage. The user will iterate with the developer to find the right form.
|
|
237
|
+
|
|
238
|
+
### Known Direction
|
|
239
|
+
|
|
240
|
+
- Output: **skill + command composition** (not rlp-desk PRD format)
|
|
241
|
+
- Invoking the command triggers a combination of skills
|
|
242
|
+
- The command structure enforces feedback loops
|
|
243
|
+
- Leverage existing skill ecosystem (`/find-skills`, etc.)
|
|
244
|
+
- Trust AI models + well-structured skills + checklist-managed feedback loops
|
|
245
|
+
|
|
246
|
+
### Success Criteria (confirmed)
|
|
247
|
+
|
|
248
|
+
- Process must be reproducible: same workflow → same quality of results
|
|
249
|
+
- Code implementation may differ, but behavior/quality must be equivalent
|
|
250
|
+
- Feedback loop + template documents as structural components
|
|
251
|
+
|
|
252
|
+
### Open Questions (to resolve after P0-P2)
|
|
253
|
+
|
|
254
|
+
1. Separate subcommand (`/rlp-desk workflow <slug>`) or independent skill?
|
|
255
|
+
2. Minimum self-verification versions required (2? 3?)
|
|
256
|
+
3. How to validate generated workflow actually reproduces results?
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
## Feature Relationship
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
┌──────────────────────────────────────────────────────────────┐
|
|
264
|
+
│ rlp-desk execution │
|
|
265
|
+
│ │
|
|
266
|
+
│ brainstorm → init → run ──┐ │
|
|
267
|
+
│ │ │
|
|
268
|
+
│ ┌──────────────┼──────────────────┐ │
|
|
269
|
+
│ │ │ │ │
|
|
270
|
+
│ --debug --with-self-verification │ │
|
|
271
|
+
│ (execution (quality evaluation) │ │
|
|
272
|
+
│ trace) │ │ │
|
|
273
|
+
│ │ ▼ │ │
|
|
274
|
+
│ │ self-verification report │ │
|
|
275
|
+
│ │ │ │ │
|
|
276
|
+
│ ▼ ▼ │ │
|
|
277
|
+
│ debug.log Post-Run Report (mandatory) │ │
|
|
278
|
+
│ (versioned) + campaign-report.md │ │
|
|
279
|
+
│ │ (versioned) │ │
|
|
280
|
+
│ │ │ │ │
|
|
281
|
+
│ │ [SV enabled?] │ │
|
|
282
|
+
│ │ │ │ │ │
|
|
283
|
+
│ │ Yes No │ │
|
|
284
|
+
│ │ │ │ │ │
|
|
285
|
+
│ │ ▼ ▼ │ │
|
|
286
|
+
│ │ "Re-brainstorm?" End │ │
|
|
287
|
+
│ │ │ │ │ │
|
|
288
|
+
│ │ Yes No │ │
|
|
289
|
+
│ │ │ │ │ │
|
|
290
|
+
│ │ ▼ ▼ │ │
|
|
291
|
+
│ │ Re-execute Accept │ │
|
|
292
|
+
│ │ (vN+1) result │ │
|
|
293
|
+
│ │ │ │ │ │
|
|
294
|
+
│ │ │ ┌────┘ │ │
|
|
295
|
+
│ │ ▼ ▼ │ │
|
|
296
|
+
│ │ Workflow Generation (P3) │ │
|
|
297
|
+
│ │ (deferred — after P0-P2) │ │
|
|
298
|
+
│ │ │ │
|
|
299
|
+
│ └── Bug report (external users) │ │
|
|
300
|
+
└──────────────────────────────────────────────────────────────┘
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Implementation Priority
|
|
306
|
+
|
|
307
|
+
| Phase | Feature | Dependency | Scope |
|
|
308
|
+
|-------|---------|------------|-------|
|
|
309
|
+
| P0 | Debug refinement | None | rlp-desk.md, governance.md |
|
|
310
|
+
| P1 | Post-Run Report | None | rlp-desk.md |
|
|
311
|
+
| P2 | Self-Verification redesign | P1 | rlp-desk.md, init_ralph_desk.zsh, governance.md |
|
|
312
|
+
| P3 | Workflow Generation | P2 + real usage | TBD after iteration |
|
|
313
|
+
|
|
314
|
+
- P0 and P1 are independent, can be done in parallel.
|
|
315
|
+
- P2 builds on P1 (report triggers re-brainstorm question).
|
|
316
|
+
- P3 requires P2 (needs versioned self-verification data) + real-world testing.
|
|
317
|
+
- Breaking changes allowed (0.x semver). Document in CHANGELOG.
|
|
318
|
+
|
|
319
|
+
---
|
|
320
|
+
|
|
321
|
+
## Design Decisions Log (from Deep Interview)
|
|
322
|
+
|
|
323
|
+
| # | Decision | Rationale |
|
|
324
|
+
|---|----------|-----------|
|
|
325
|
+
| 1 | brainstorm auto-handles re-execution | Natural UX — same command, system detects context |
|
|
326
|
+
| 2 | Mixed judgment (quantitative + user) | Pure metrics miss quality nuance; pure subjective misses patterns |
|
|
327
|
+
| 3 | Breaking changes OK | 0.x semver, clean redesign over backward compat hacks |
|
|
328
|
+
| 4 | P3 essential but deferred | Core vision requires it, but form emerges from P0-P2 usage |
|
|
329
|
+
| 5 | Skill + command composition for P3 | Leverage existing ecosystem, not reinvent |
|
|
330
|
+
| 6 | All features in rlp-desk | Connected UX > separate tools |
|
|
331
|
+
| 7 | Shell script for deterministic ops | AI interpretation unreliable for file manipulation |
|
|
332
|
+
| 8 | SV remains optional | Simple tasks don't need re-execution overhead |
|
|
333
|
+
| 9 | Re-brainstorm only with SV | No SV = no improvement data = no point asking |
|
|
334
|
+
| 10 | Report always mandatory | Users need decision surface regardless of SV |
|
|
335
|
+
| 11 | PRD = single file, no versioning | Source of truth, updated in place |
|
|
336
|
+
| 12 | Only 3 files versioned | SV report + campaign report + debug.log |
|
|
337
|
+
| 13 | Report includes cost section | Token/time tracking for optimization decisions |
|
|
338
|
+
| 14 | Ask user intent on slug reuse | "Improve?" vs "Start fresh?" — don't assume |
|
|
339
|
+
|
|
340
|
+
---
|
|
341
|
+
|
|
342
|
+
## Open Questions (P3 only)
|
|
343
|
+
|
|
344
|
+
1. **Workflow as subcommand or separate skill?** — defer to P0-P2 experience
|
|
345
|
+
2. **Minimum SV versions for generation** — need real data to determine
|
|
346
|
+
3. **Reproducibility validation method** — how to test generated workflow works
|
|
347
|
+
4. **Skill composition mechanics** — how feedback loop enforcement works in practice
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
# Ralplan + Codex Cross-Validation
|
|
2
|
+
|
|
3
|
+
```
|
|
4
|
+
/ralplan {{OBJECTIVE}}
|
|
5
|
+
{{SCOPE}}
|
|
6
|
+
Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
|
|
7
|
+
If source documents are insufficient, identify gaps before proceeding.
|
|
8
|
+
```
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Placeholders
|
|
13
|
+
|
|
14
|
+
| Placeholder | Required | Description |
|
|
15
|
+
|---|---|---|
|
|
16
|
+
| `{{OBJECTIVE}}` | Yes | Planning goal. Naturally includes target documents, deliverable type, and context. |
|
|
17
|
+
| `{{SCOPE}}` | No | Files or directories to change. Omit to target the entire current project. |
|
|
18
|
+
|
|
19
|
+
## Prerequisites
|
|
20
|
+
|
|
21
|
+
- [oh-my-claudecode](https://github.com/anthropics/oh-my-claudecode) installed and configured
|
|
22
|
+
- Codex CLI installed and authenticated (`codex --version`)
|
|
23
|
+
- Deep interview (`/deep-interview`) completed beforehand to clarify requirements
|
|
24
|
+
|
|
25
|
+
## How It Works
|
|
26
|
+
|
|
27
|
+
1. `/ralplan` runs the Planner -> Architect -> Critic consensus loop (built-in)
|
|
28
|
+
2. After consensus APPROVE, codex independently reviews the plan
|
|
29
|
+
3. If codex finds issues, the plan is revised and re-enters consensus
|
|
30
|
+
4. Repeats until codex returns 0 issues
|
|
31
|
+
|
|
32
|
+
## Why
|
|
33
|
+
|
|
34
|
+
Iterating ralplan consensus with codex cross-validation produces plans robust enough
|
|
35
|
+
to leverage rlp-desk's verification loop (self-verification, post-run report)
|
|
36
|
+
effectively. Plans that survive both reviewers have fewer surprises during execution.
|
|
37
|
+
|
|
38
|
+
## Examples
|
|
39
|
+
|
|
40
|
+
### With scope
|
|
41
|
+
|
|
42
|
+
```
|
|
43
|
+
/ralplan Implementation plan based on blueprint-v0.4
|
|
44
|
+
src/commands/rlp-desk.md, src/governance.md, src/scripts/init_ralph_desk.zsh
|
|
45
|
+
Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
|
|
46
|
+
If source documents are insufficient, identify gaps before proceeding.
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Without scope (defaults to current project)
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
/ralplan Auth module refactoring strategy
|
|
53
|
+
Run codex cross-validation after consensus. Repeat revise -> consensus -> codex until 0 issues.
|
|
54
|
+
If source documents are insufficient, identify gaps before proceeding.
|
|
55
|
+
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ai-dev-methodologies/rlp-desk",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.4.0",
|
|
4
4
|
"description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
|
|
5
5
|
"scripts": {
|
|
6
6
|
"postinstall": "node scripts/postinstall.js",
|
package/src/commands/rlp-desk.md
CHANGED
|
@@ -62,7 +62,11 @@ After all items are confirmed:
|
|
|
62
62
|
Present the score table to the user before proceeding.
|
|
63
63
|
2. Present the full contract summary.
|
|
64
64
|
3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
|
|
65
|
-
4.
|
|
65
|
+
4. **Re-execution check**: After slug is confirmed, check if `.claude/ralph-desk/plans/prd-<slug>.md` already exists. If a PRD already exists for this slug, ask: "A PRD already exists for this slug. Improve the existing PRD or start fresh (delete and recreate)?"
|
|
66
|
+
- "improve" → pass `--mode improve` to init
|
|
67
|
+
- "start fresh" → pass `--mode fresh` to init
|
|
68
|
+
- If no PRD exists: standard first-run (no --mode needed)
|
|
69
|
+
5. On approval, offer to run `init`.
|
|
66
70
|
|
|
67
71
|
Do NOT create files during brainstorm.
|
|
68
72
|
Do NOT auto-decide iteration unit — the user MUST explicitly choose.
|
|
@@ -71,7 +75,7 @@ Do NOT auto-decide iteration unit — the user MUST explicitly choose.
|
|
|
71
75
|
|
|
72
76
|
## `init <slug> [objective]`
|
|
73
77
|
|
|
74
|
-
Run: `~/.claude/ralph-desk/init_ralph_desk.zsh <slug> "<objective>"`
|
|
78
|
+
Run: `~/.claude/ralph-desk/init_ralph_desk.zsh <slug> "<objective>" [--mode fresh|improve]`
|
|
75
79
|
If brainstorm was done, auto-fill PRD and test-spec with the results.
|
|
76
80
|
|
|
77
81
|
**After init completes, STOP. Do NOT auto-run the loop.**
|
|
@@ -133,6 +137,8 @@ Options (parse from `$ARGUMENTS`):
|
|
|
133
137
|
- `--consensus-scope all|final-only` — when consensus runs (default: `all`)
|
|
134
138
|
- `all`: consensus runs on every verify (current behavior)
|
|
135
139
|
- `final-only`: consensus only on final ALL verify
|
|
140
|
+
- `--cb-threshold N` — circuit breaker threshold: consecutive failures before BLOCKED (default: 3). When `--verify-consensus` is active, effective threshold is automatically doubled (e.g., default becomes 6).
|
|
141
|
+
- `--iter-timeout N` — per-iteration timeout in seconds (default: 600). Enforced in tmux mode only. Agent mode: not enforced (Agent() has no timeout API).
|
|
136
142
|
- `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
|
|
137
143
|
- `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
|
|
138
144
|
|
|
@@ -164,7 +170,10 @@ VERIFIER_CODEX_REASONING=<--verifier-codex-reasoning value, default: high> \
|
|
|
164
170
|
VERIFY_MODE=<--verify-mode value, default: per-us> \
|
|
165
171
|
VERIFY_CONSENSUS=<1 if --verify-consensus, else 0> \
|
|
166
172
|
CONSENSUS_SCOPE=<--consensus-scope value, default: all> \
|
|
173
|
+
CB_THRESHOLD=<--cb-threshold value, default: 3> \
|
|
174
|
+
ITER_TIMEOUT=<--iter-timeout value, default: 600> \
|
|
167
175
|
DEBUG=<1 if --debug, else 0> \
|
|
176
|
+
WITH_SELF_VERIFICATION=<1 if --with-self-verification, else 0> \
|
|
168
177
|
zsh ~/.claude/ralph-desk/run_ralph_desk.zsh
|
|
169
178
|
```
|
|
170
179
|
6. **If the script exits with error (exit code 1)** — report the error to the user and STOP. Do NOT attempt to work around it. Do NOT create tmux sessions yourself. Do NOT re-launch the script in a different way. Just tell the user what went wrong and suggest using Agent mode instead.
|
|
@@ -174,6 +183,7 @@ DEBUG=<1 if --debug, else 0> \
|
|
|
174
183
|
- Tmux mode requires the user to already be inside a tmux session. If the runner script rejects because $TMUX is not set, do NOT try to create a tmux session yourself. Tell the user: "Start tmux first, then retry."
|
|
175
184
|
- Do NOT run the script in background (`&`, `run_in_background`). The script must run in foreground so panes remain visible to the user. The user needs to see Worker/Verifier panes in real-time.
|
|
176
185
|
- Do NOT kill panes after completion. Panes stay alive for inspection. User cleans up with `/rlp-desk clean <slug> --kill-session`.
|
|
186
|
+
- `--with-self-verification` is accepted in tmux mode but SV report generation is Agent-mode only (requires AI analysis). In tmux mode, the flag is recorded in session-config for post-hoc analysis. Use Agent mode for full SV report generation.
|
|
177
187
|
|
|
178
188
|
#### Agent Mode (`--mode agent` or default)
|
|
179
189
|
|
|
@@ -183,6 +193,10 @@ DEBUG=<1 if --debug, else 0> \
|
|
|
183
193
|
3. Clean previous `done-claim.json`, `verify-verdict.json`.
|
|
184
194
|
4. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
|
|
185
195
|
5. If `--debug`: also create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`. When `--debug` is active, debug.log contains all baseline.log fields plus detailed phase logs.
|
|
196
|
+
- **4-category log system**: all debug_log entries use exactly one of: `[GOV]` (governance checks: IL enforcement, CB triggers, scope lock, verdict evaluation), `[DECIDE]` (leader decisions: model selection, fix contracts, escalation), `[OPTION]` (configuration snapshot at loop start: thresholds, modes, models), `[FLOW]` (execution progress: worker/verifier dispatch, signal reads, phase transitions)
|
|
197
|
+
- **Re-execution versioning**: If `debug.log` already exists at `--debug` start, rename it to `debug-v{N}.log` (N = next available integer ≥ 1) before creating a fresh `debug.log`.
|
|
198
|
+
- **baseline.log lifecycle**: baseline.log is deleted on re-execution (when `init --mode improve` or `init --mode fresh` is run).
|
|
199
|
+
6. Capture baseline commit: `Bash("git rev-parse HEAD 2>/dev/null || echo none")` → store as `BASELINE_COMMIT`. Include in the first `status.json` write as `baseline_commit` field.
|
|
186
200
|
|
|
187
201
|
### Leader Loop
|
|
188
202
|
|
|
@@ -194,7 +208,10 @@ DEBUG=<1 if --debug, else 0> \
|
|
|
194
208
|
- **After every step result, IMMEDIATELY start the next step's tool call in the SAME response.** For example, after reading the verdict (⑦c), report via Bash("echo") AND start ⑧'s tool calls in one response.
|
|
195
209
|
- If you output "Iter 1 complete, moving to iter 2" as plain text without a tool call, the turn terminates and the loop breaks. This is a platform constraint, not a compliance issue — no amount of "DO NOT STOP" text can override it.
|
|
196
210
|
|
|
197
|
-
If `--debug`, at loop start debug_log
|
|
211
|
+
If `--debug`, at loop start debug_log the following (3 [OPTION] entries):
|
|
212
|
+
- `[OPTION] slug=<slug> max_iter=<N> verify_mode=<mode> consensus=<0|1> consensus_scope=<scope>`
|
|
213
|
+
- `[OPTION] cb_threshold=<N> effective_cb_threshold=<N>`
|
|
214
|
+
- `[OPTION] worker_engine=<engine> worker_model=<model> verifier_engine=<engine> verifier_model=<model>`
|
|
198
215
|
|
|
199
216
|
For each iteration (1 to max_iter):
|
|
200
217
|
|
|
@@ -213,13 +230,13 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
|
|
|
213
230
|
**② Read memory.md** → Stop Status, Next Iteration Contract
|
|
214
231
|
- Also read **Completed Stories** → verified work so far
|
|
215
232
|
- Also read **Key Decisions** → settled architectural choices
|
|
216
|
-
- If `--debug`: debug_log `[
|
|
233
|
+
- If `--debug`: debug_log `[FLOW] iter=N phase=read_memory stop_status=<status> contract="<summary>"`
|
|
217
234
|
|
|
218
235
|
**③ Decide model** (§4 of governance.md)
|
|
219
236
|
- Previous iteration failed → upgrade model
|
|
220
237
|
- Simple task → downgrade
|
|
221
238
|
- User specified → use that
|
|
222
|
-
- If `--debug`: debug_log `[
|
|
239
|
+
- If `--debug`: debug_log `[DECIDE] iter=N phase=model_select worker_model=<model> reason=<reason>`
|
|
223
240
|
|
|
224
241
|
**④ Build worker prompt (Prompt Assembly Protocol)**
|
|
225
242
|
1. Capture `WORKING_DIR` once: use `$PWD` from when `/rlp-desk run` was invoked. Store for all prompt construction.
|
|
@@ -232,11 +249,11 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
|
|
|
232
249
|
|
|
233
250
|
**④½ Contract review** (agent mode only)
|
|
234
251
|
- Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
|
|
235
|
-
- If `--debug`: debug_log `[
|
|
236
|
-
- In tmux mode: skip (shell leader cannot reason). Log: `[
|
|
252
|
+
- If `--debug`: debug_log `[GOV] iter=N phase=contract_review scope_lock=<us_id|null> ac_count=<N> result=<ok|issues>`
|
|
253
|
+
- In tmux mode: skip (shell leader cannot reason). Log: `[FLOW] iter=N phase=contract_review skipped=tmux_mode`
|
|
237
254
|
|
|
238
255
|
**⑤ Execute Worker**
|
|
239
|
-
- If `--debug`: debug_log `[
|
|
256
|
+
- If `--debug`: debug_log `[FLOW] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
|
|
240
257
|
|
|
241
258
|
If `--worker-engine claude` (default):
|
|
242
259
|
```
|
|
@@ -258,14 +275,13 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
|
|
|
258
275
|
- Codex runs as a subprocess via Bash(), not Agent().
|
|
259
276
|
- Each Bash() call = fresh context for codex.
|
|
260
277
|
|
|
261
|
-
- If `--debug`: debug_log `[EXEC] iter=N phase=worker_done engine=<engine>`
|
|
262
278
|
|
|
263
279
|
**⑥ Read memory.md again** (Worker updated it)
|
|
264
280
|
- `stop=continue` → go to ⑧
|
|
265
281
|
- `stop=verify` → go to ⑦
|
|
266
282
|
- `stop=blocked` → write BLOCKED sentinel, stop
|
|
267
283
|
- Also read `iter-signal.json` for `us_id` field (which US was just completed)
|
|
268
|
-
- If `--debug`: debug_log `[
|
|
284
|
+
- If `--debug`: debug_log `[FLOW] iter=N phase=worker_done_signal engine=<engine> status=<stop_status> us_id=<us_id>`
|
|
269
285
|
|
|
270
286
|
**CRITICAL: Immediately proceed to ⑦. Do NOT pause, do NOT ask the user, do NOT wait for confirmation. The loop is autonomous.**
|
|
271
287
|
|
|
@@ -291,7 +307,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
|
|
|
291
307
|
**⑦a Dispatch Verifier**
|
|
292
308
|
- Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
|
|
293
309
|
- **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
|
|
294
|
-
- If `--debug`: debug_log `[
|
|
310
|
+
- If `--debug`: debug_log `[FLOW] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
|
|
295
311
|
|
|
296
312
|
If `--verifier-engine claude` (default):
|
|
297
313
|
```
|
|
@@ -316,7 +332,7 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
316
332
|
- Both produce `verify-verdict.json` (Leader renames to `verify-verdict-claude.json` and `verify-verdict-codex.json`)
|
|
317
333
|
- **Both pass** → proceed (next US or COMPLETE)
|
|
318
334
|
- **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
|
|
319
|
-
- Max
|
|
335
|
+
- Max 6 consensus rounds per US. After 6 rounds → BLOCKED.
|
|
320
336
|
|
|
321
337
|
**NO ENGINE PRIORITY (ABSOLUTE):** There is no primary or secondary engine. Claude and Codex have EQUAL weight. If one passes and the other fails, the verdict is FAIL — always. The Leader MUST NOT override, prioritize, or dismiss either engine's verdict. "Claude priority", "primary engine override", "infrastructure failure" (when a valid verdict file exists), or any similar rationalization = governance violation. Infrastructure failure means ONLY: CLI crash (exit ≠ 0), timeout, or verdict file not generated.
|
|
322
338
|
|
|
@@ -330,22 +346,26 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
330
346
|
3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
|
|
331
347
|
4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
|
|
332
348
|
5. Increment `consecutive_failures` in `status.json`
|
|
333
|
-
6. If `consecutive_failures >=
|
|
349
|
+
6. If `consecutive_failures >= cb_threshold` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
|
|
350
|
+
- If `--debug`: debug_log `[GOV] iter=N phase=CB_trigger consecutive_failures=<N> us_id=<us_id> action=architecture_escalation`
|
|
334
351
|
7. Go to ⑧ with fix contract as next Worker contract
|
|
335
352
|
- `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
|
|
336
353
|
- `blocked` → write BLOCKED sentinel, stop
|
|
337
|
-
- If `--debug`: debug_log `[
|
|
338
|
-
- If `--debug`: debug_log `[
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
|
|
354
|
+
- If `--debug`: debug_log `[GOV] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id> L1=<status> L2=<status> L3=<status> L4=<status>`
|
|
355
|
+
- If `--debug`: debug_log `[GOV] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
|
|
356
|
+
|
|
357
|
+
**⑦d Archive iteration artifacts** (always — independent of --debug)
|
|
358
|
+
After reading the verdict, archive to `logs/<slug>/`:
|
|
359
|
+
- `iter-NNN-done-claim.json` ← copy from `memos/<slug>-done-claim.json`
|
|
360
|
+
- `iter-NNN-verify-verdict.json` ← copy from `memos/<slug>-verify-verdict.json`
|
|
361
|
+
(Preserved across clean; data source for campaign report generation and SV analysis.)
|
|
342
362
|
|
|
343
363
|
**CRITICAL: Immediately proceed to ⑧. Do NOT pause, do NOT ask the user. Continue the loop.**
|
|
344
364
|
|
|
345
365
|
**⑧ Write result log and report to user, continue loop**
|
|
346
366
|
- Write `logs/<slug>/iter-NNN.result.md`:
|
|
347
367
|
- Result status `[leader-measured]`
|
|
348
|
-
- Files changed via `git diff --stat HEAD
|
|
368
|
+
- Files changed: cumulative working tree state via `git diff --stat HEAD` `[git-measured]` (note: cumulative in tmux mode, not per-iteration delta)
|
|
349
369
|
- Verifier verdict `[leader-measured]`
|
|
350
370
|
- **Record cost & performance per iteration**:
|
|
351
371
|
- Agent mode: record `total_tokens` and `duration_ms` from Agent() return metadata for both Worker and Verifier
|
|
@@ -354,10 +374,10 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
354
374
|
- Write `status.json`
|
|
355
375
|
- Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
|
|
356
376
|
- **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
|
|
357
|
-
- If `--debug`: debug_log `[
|
|
377
|
+
- If `--debug`: debug_log `[FLOW] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
|
|
358
378
|
|
|
359
379
|
At loop end (COMPLETE, BLOCKED, or TIMEOUT):
|
|
360
|
-
- If `--debug`: debug_log `[
|
|
380
|
+
- If `--debug`: debug_log `[FLOW] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
|
|
361
381
|
|
|
362
382
|
**⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
|
|
363
383
|
|
|
@@ -421,6 +441,23 @@ PRD, and test-spec. Information from source code inspection that is not in these
|
|
|
421
441
|
or explicitly marked as "[source-inspection]" with justification.
|
|
422
442
|
```
|
|
423
443
|
|
|
444
|
+
**⑩ Campaign Report** (always — independent of `--debug` and `--with-self-verification`)
|
|
445
|
+
|
|
446
|
+
After the loop ends (COMPLETE, BLOCKED, or TIMEOUT), generate `logs/<slug>/campaign-report.md`:
|
|
447
|
+
|
|
448
|
+
1. If `campaign-report.md` already exists, rename it to `campaign-report-v{N}.md` (N = next available integer ≥ 1) before writing new.
|
|
449
|
+
2. Generate report with 8 required sections:
|
|
450
|
+
- **Objective**: From PRD
|
|
451
|
+
- **Execution Summary**: Iterations run, terminal state (COMPLETE/BLOCKED/TIMEOUT), elapsed time
|
|
452
|
+
- **US Status**: Each US with final verified/failed/pending status (from `status.json`)
|
|
453
|
+
- **Verification Results**: Per-US and final verify outcomes (from archived iter artifacts)
|
|
454
|
+
- **Issues Encountered**: Fix contracts and failure verdicts from campaign
|
|
455
|
+
- **Cost & Performance**: Per-iter token/duration data from `status.json`
|
|
456
|
+
- **SV Summary**: If `--with-self-verification` ran, pointer to SV report file; otherwise "N/A — --with-self-verification not enabled"
|
|
457
|
+
- **Files Changed**: `git diff --stat <baseline_commit>` (working tree vs baseline, includes uncommitted changes and untracked files). Note: may include pre-existing uncommitted changes if the campaign started in a dirty worktree.
|
|
458
|
+
3. Data sources: `status.json` (baseline_commit, per-iter data), archived `iter-NNN-done-claim.json` / `iter-NNN-verify-verdict.json`, PRD, git diff.
|
|
459
|
+
4. If `--with-self-verification` was enabled: ⑨ SV report runs first, then ⑩ Campaign Report (which includes the SV Summary section pointing to the SV report file).
|
|
460
|
+
|
|
424
461
|
### Circuit Breaker
|
|
425
462
|
- context-latest.md unchanged 3 iterations → BLOCKED
|
|
426
463
|
- Same acceptance criterion fails 2 consecutive iterations → upgrade model, retry once, then BLOCKED
|
|
@@ -441,6 +478,7 @@ When `--verify-consensus` is enabled, also track in `status.json`:
|
|
|
441
478
|
- YOU track iteration count
|
|
442
479
|
- Write `status.json` after each iteration
|
|
443
480
|
- Worker claim ≠ complete. Only YOU write COMPLETE sentinel after verifier pass.
|
|
481
|
+
- **NEVER modify rlp-desk infrastructure files** (`~/.claude/ralph-desk/*`, `~/.claude/commands/rlp-desk.md`). If you or a Worker/Verifier discovers a bug in rlp-desk itself, write BLOCKED sentinel with reason `"rlp-desk bug: <description>"` and STOP. Do NOT attempt to fix rlp-desk — report the bug to the user.
|
|
444
482
|
|
|
445
483
|
---
|
|
446
484
|
|
|
@@ -463,7 +501,7 @@ Remove:
|
|
|
463
501
|
- `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
|
|
464
502
|
- `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
|
|
465
503
|
- `.claude/ralph-desk/memos/<slug>-escalation.md`
|
|
466
|
-
Note: `logs/<slug>/self-verification-data.json
|
|
504
|
+
Note: `logs/<slug>/self-verification-data.json`, `self-verification-report-NNN.md`, `campaign-report.md`, `campaign-report-v{N}.md`, `iter-NNN-done-claim.json`, and `iter-NNN-verify-verdict.json` are intentionally preserved across clean for historical comparison.
|
|
467
505
|
|
|
468
506
|
If `--kill-session` is passed, clean up ALL tmux artifacts:
|
|
469
507
|
```bash
|
|
@@ -503,6 +541,8 @@ Run options:
|
|
|
503
541
|
--verify-mode per-us|batch Verification strategy (default: per-us)
|
|
504
542
|
--verify-consensus Cross-engine consensus verification
|
|
505
543
|
--consensus-scope SCOPE When consensus runs: all|final-only (default: all)
|
|
544
|
+
--cb-threshold N CB threshold: consecutive failures before BLOCKED (default: 3)
|
|
545
|
+
--iter-timeout N Per-iteration timeout in seconds, tmux mode only (default: 600)
|
|
506
546
|
--debug Debug logging (logs/<slug>/debug.log)
|
|
507
547
|
--with-self-verification Campaign self-verification analysis (post-loop report)
|
|
508
548
|
```
|