@ai-dev-methodologies/rlp-desk 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -40,9 +40,11 @@ curl -sSL https://raw.githubusercontent.com/ai-dev-methodologies/rlp-desk/main/i
40
40
 
41
41
  You'll be asked to confirm each item:
42
42
  - **Slug** — project identifier
43
- - **User Stories** — discrete, testable units of work
43
+ - **User Stories** — discrete, testable units with Given/When/Then acceptance criteria
44
+ - **Task Type & Risk Level** — code/visual/content/integration/infra × LOW/MEDIUM/HIGH/CRITICAL
44
45
  - **Iteration Unit** — one story per iteration (incremental) or all at once (fast)
45
46
  - **Verification Commands** — how to check the work
47
+ - **Ambiguity Gate** — AC quality scoring (IL-2, 0-12 scale, blocks init if < 6)
46
48
  - **Models** — which Claude model for Worker/Verifier
47
49
 
48
50
  ### 3. Run
@@ -97,12 +99,38 @@ for iteration in 1..max_iter:
97
99
  8. Update status, report to user, continue or stop
98
100
  ```
99
101
 
102
+ ### Verification Policy (v0.3.0)
103
+
104
+ RLP Desk enforces a comprehensive verification policy defined in `governance.md`:
105
+
106
+ **Iron Laws (§1a)** — 4 absolute rules that cannot be violated:
107
+ - **IL-1**: No completion claims without fresh verification evidence
108
+ - **IL-2**: No init without AC quality score ≥ 6 (Ambiguity Gate)
109
+ - **IL-3**: No pass with TODO in any required verification layer
110
+ - **IL-4**: No pass without test count ≥ AC count × 3
111
+
112
+ **Evidence Gate (§1b)** — 5-step protocol: IDENTIFY → RUN → READ → VERIFY → ONLY THEN claim
113
+
114
+ **Risk Classification (§1c)** — Proportional verification layers per risk level:
115
+
116
+ | Risk | Required Layers |
117
+ |------|----------------|
118
+ | LOW | L1 (Unit) + L3 (E2E) |
119
+ | MEDIUM | L1 + L2 (Integration) + L3 |
120
+ | HIGH | L1 + L2 + L3 + L4 (Deploy) |
121
+ | CRITICAL | L1 + L2 + L3 + L4 + mutation testing |
122
+
123
+ **Execution Traceability (§1f)** — Always-on, not flag-gated:
124
+ - Worker records `execution_steps` in done-claim.json (what was done, in what order, with evidence)
125
+ - Verifier records `reasoning` in verify-verdict.json (why each judgment was made)
126
+
100
127
  ### Circuit Breakers
101
128
 
102
129
  | Condition | Action |
103
130
  |-----------|--------|
104
131
  | Context unchanged for 3 iterations | BLOCKED |
105
132
  | Same error repeated twice | Upgrade model, retry once, then BLOCKED |
133
+ | 3 consecutive failures | Architecture Escalation (§7¾) → report to user |
106
134
  | Max iterations reached | TIMEOUT |
107
135
 
108
136
  ### Model Routing
@@ -140,6 +168,8 @@ for iteration in 1..max_iter:
140
168
  | `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
141
169
  | `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
142
170
  | `--verify-consensus` | off | Cross-engine consensus verification (see below) |
171
+ | `--debug` | off | Debug logging to `logs/<slug>/debug.log` |
172
+ | `--with-self-verification` | off | Campaign-level post-loop analysis report |
143
173
 
144
174
  ## Execution Modes
145
175
 
@@ -334,6 +364,7 @@ mkdir my-calc && cd my-calc
334
364
  - [Architecture](docs/architecture.md) — Design philosophy, Agent() and tmux execution modes
335
365
  - [Getting Started](docs/getting-started.md) — Step-by-step tutorial with the calculator example
336
366
  - [Protocol Reference](docs/protocol-reference.md) — Full protocol specification
367
+ - [Future Plans](docs/TODO-verification-next.md) — P3 items and upcoming features
337
368
 
338
369
  ## Contributing
339
370
 
@@ -48,6 +48,31 @@ RLP Desk supports two modes for running the Leader loop. Both honor the same gov
48
48
  | **Agent() — "Smart mode"** (default) | LLM (current session) | Dynamic — Leader reasons about which model to use each iteration | Active Claude Code session | Interactive development, complex routing decisions |
49
49
  | **Tmux — "Lean mode"** | Shell script (`run_ralph_desk.zsh`) | Static — set via `WORKER_MODEL`/`VERIFIER_MODEL` env vars | None (runs detached) | Long campaigns, CI, observability, zero-token orchestration |
50
50
 
51
+ ### Verification Policy Layer
52
+
53
+ Both modes enforce the same verification policy (governance §1a-§1f):
54
+
55
+ ```
56
+ ┌─────────────────────────────────┐
57
+ │ Governance (§1a-§1f) │
58
+ │ Iron Laws · Evidence Gate │
59
+ │ Risk Classification · Layers │
60
+ │ Checkpoints · Traceability │
61
+ └──────────┬──────────────────────┘
62
+ │ enforced by
63
+ ┌────────────────┼────────────────┐
64
+ ▼ ▼ ▼
65
+ Worker Template Verifier Template Leader Loop
66
+ (Test-First, (12-step process, (Contract review,
67
+ 12 Shortcuts, 5 reasoning Checkpoints,
68
+ execution_steps) categories) Escalation)
69
+ ```
70
+
71
+ Key design decisions:
72
+ - **execution_steps** (Worker) and **reasoning** (Verifier) are always-on (§1f), not gated by flags
73
+ - **`--with-self-verification`** adds post-campaign analysis only — does not change loop behavior
74
+ - **Risk-proportional layers**: LOW gets L1+L3, CRITICAL gets L1+L2+L3+L4+mutation
75
+
51
76
  **Agent() mode** is synchronous and simple: each `Agent()` call blocks until the subprocess finishes, then the Leader reads the filesystem. No polling, no signal files, no tmux.
52
77
 
53
78
  **Tmux mode** trades dynamic routing for visibility and independence. The shell Leader writes prompts to files, sends short trigger commands via `tmux send-keys`, and polls structured JSON signal files (`iter-signal.json`, `verify-verdict.json`) for control flow. It uses proven tmux patterns — write-then-notify, pane ID stability, copy-mode guards, heartbeat monitoring — for reliable, race-free orchestration.
@@ -45,10 +45,12 @@ The brainstorm phase interactively determines:
45
45
  |------|---------|
46
46
  | **Slug** | `loop-test` |
47
47
  | **Objective** | Implement calc.py + test_calc.py |
48
- | **User Stories** | US-001: calculator functions, US-002: pytest tests |
48
+ | **User Stories** | US-001: calculator functions (Given/When/Then AC format) |
49
+ | **Task Type & Risk** | code, LOW |
49
50
  | **Iteration Unit** | One user story per iteration |
50
51
  | **Verification** | `python3 -m pytest test_calc.py -v` |
51
52
  | **Models** | Worker: sonnet, Verifier: opus |
53
+ | **Ambiguity Gate (IL-2)** | AC quality score ≥ 6 required to proceed |
52
54
  | **Max Iterations** | 10 |
53
55
 
54
56
  On approval, brainstorm offers to run `init` automatically.
@@ -81,9 +83,11 @@ This creates the scaffold:
81
83
  Edit `.claude/ralph-desk/plans/prd-loop-test.md` to define your user stories and acceptance criteria. See [`examples/calculator/`](../examples/calculator/.claude/ralph-desk/plans/prd-loop-test.md) for a complete example.
82
84
 
83
85
  Key sections:
84
- - **User Stories** with specific, testable acceptance criteria
86
+ - **User Stories** with Given/When/Then acceptance criteria, Task Type, and Risk Level
87
+ - **Boundary Cases** for each US
88
+ - **Verification Layers** (L1-L4 based on risk level per governance §1c)
85
89
  - **Technical Constraints** (e.g., "Python 3 + pytest only")
86
- - **Done When** conditions
90
+ - **Done When** conditions (must reference Evidence Gate §1b)
87
91
 
88
92
  ## Step 6: Define the Test Spec
89
93
 
@@ -147,7 +151,10 @@ If you want to run the loop again:
147
151
  ## Tips
148
152
 
149
153
  - **Start small**: One or two user stories for your first loop
150
- - **Be specific in acceptance criteria**: "function returns float" is testable; "function works well" is not
154
+ - **Use Given/When/Then**: "Given 10 and 5, When add, Then 15" not "function works well"
155
+ - **Set risk levels**: LOW for docs/config, MEDIUM for features, HIGH for deploys, CRITICAL for security/finance
151
156
  - **Include verification commands**: The verifier needs concrete commands to run
152
157
  - **One story per iteration**: Each worker should do one bounded action
153
158
  - **Check logs when stuck**: `logs/<slug>/iter-NNN.worker-prompt.md` shows exactly what the worker received
159
+ - **Review done-claims**: Worker's `execution_steps` show exactly what was done and in what order
160
+ - **Review verdict reasoning**: Verifier's `reasoning` shows why each judgment was made
@@ -136,24 +136,28 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
136
136
 
137
137
  ### Done Claim (`<slug>-done-claim.json`)
138
138
 
139
- Written by the Worker when claiming all work is complete:
139
+ Written by the Worker when claiming work is complete. **Must include `execution_steps`** (governance §1f — always-on, not optional):
140
140
 
141
141
  ```json
142
142
  {
143
- "iteration": 3,
144
- "claimed_at_utc": "2025-01-15T10:30:00Z",
145
- "summary": "All user stories implemented and tests passing",
146
- "stories_completed": ["US-001", "US-002"],
147
- "evidence": {
148
- "test_output": "8 passed in 0.05s",
149
- "files_created": ["calc.py", "test_calc.py"]
150
- }
143
+ "us_id": "US-001",
144
+ "claims": ["AC1: add(10,5)=15", "AC2: subtract(10,5)=5"],
145
+ "execution_steps": [
146
+ {"step": "write_test", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "wrote tests/test_add.py"},
147
+ {"step": "verify_red", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 1, "summary": "RED: test fails"},
148
+ {"step": "implement", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "created add()"},
149
+ {"step": "verify_green", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 0, "summary": "GREEN: 3 passed"},
150
+ {"step": "verify_e2e", "ac_id": "AC1", "command": "python -c '...'", "exit_code": 0, "summary": "E2E matches"},
151
+ {"step": "commit", "ac_id": "AC1", "command": "git commit", "exit_code": 0, "summary": "committed abc1234"}
152
+ ]
151
153
  }
152
154
  ```
153
155
 
156
+ `execution_steps` must be a JSON array of objects. Step types from §1f vocabulary: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `verify_e2e`, `commit`, `verify`.
157
+
154
158
  ### Verify Verdict (`<slug>-verify-verdict.json`)
155
159
 
156
- Written by the Verifier after independent verification.
160
+ Written by the Verifier after independent verification. **Must include `reasoning`** (governance §1f — always-on, not optional).
157
161
 
158
162
  **Tmux mode polling:** In tmux mode, after dispatching the Verifier, the shell Leader polls for the existence of `verify-verdict.json` (same pattern as `iter-signal.json`). Once it appears, the Leader reads the `verdict` and `recommended_state_transition` fields via `jq` to decide whether to write a COMPLETE sentinel, continue iterating, or write a BLOCKED sentinel.
159
163
 
@@ -172,6 +176,13 @@ Written by the Verifier after independent verification.
172
176
  "evidence": "test -f calc.py → exit 0"
173
177
  }
174
178
  ],
179
+ "reasoning": [
180
+ {"check": "IL-1 Evidence Gate", "decision": "pass", "basis": "ran pytest fresh, exit 0"},
181
+ {"check": "Layer Enforcement", "decision": "pass", "basis": "L1+L3 required and executed"},
182
+ {"check": "Test Sufficiency", "decision": "pass", "basis": "9 tests / 3 AC = 3.0 ratio"},
183
+ {"check": "Anti-Gaming", "decision": "pass", "basis": "no tautological tests, no mocking"},
184
+ {"check": "Worker Process Audit", "decision": "pass", "basis": "verify_red present with exit≠0"}
185
+ ],
175
186
  "missing_evidence": [],
176
187
  "issues": [
177
188
  {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.3.0",
3
+ "version": "0.3.2",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -165,7 +165,8 @@ DEBUG=<1 if --debug, else 0> \
165
165
  1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
166
166
  2. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
167
167
  3. Clean previous `done-claim.json`, `verify-verdict.json`.
168
- 4. If `--debug`: create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`.
168
+ 4. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
169
+ 5. If `--debug`: also create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`. When `--debug` is active, debug.log contains all baseline.log fields plus detailed phase logs.
169
170
 
170
171
  ### Leader Loop
171
172
 
@@ -198,11 +199,14 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
198
199
  - User specified → use that
199
200
  - If `--debug`: debug_log `[EXEC] iter=N phase=model_select worker_model=<model> reason=<reason>`
200
201
 
201
- **④ Build worker prompt**
202
- - Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
203
- - Combine with iteration number + memory contract
204
- - Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
202
+ **④ Build worker prompt (Prompt Assembly Protocol)**
203
+ 1. Capture `WORKING_DIR` once: use `$PWD` from when `/rlp-desk run` was invoked. Store for all prompt construction.
204
+ 2. Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` use its content **verbatim**. Do NOT rewrite, paraphrase, or regenerate paths. The prompt file contains correct absolute paths from init.
205
+ 3. Prepend meta comment: `## WORKING_DIR: {absolute path}` Worker must use this as its working directory.
206
+ 4. Append iteration number + memory contract.
207
+ 5. Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail).
205
208
  - Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
209
+ - **Rewriting paths from absolute to relative WILL break worktree campaigns. Only additions (WORKING_DIR header, iteration context) are allowed.**
206
210
 
207
211
  **④½ Contract review** (agent mode only)
208
212
  - Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
@@ -241,6 +245,8 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
241
245
  - Also read `iter-signal.json` for `us_id` field (which US was just completed)
242
246
  - If `--debug`: debug_log `[EXEC] iter=N phase=worker_signal status=<stop_status> us_id=<us_id>`
243
247
 
248
+ **CRITICAL: Immediately proceed to ⑦. Do NOT pause, do NOT ask the user, do NOT wait for confirmation. The loop is autonomous.**
249
+
244
250
  **⑦ Execute Verifier**
245
251
 
246
252
  **Per-US mode** (default, `--verify-mode per-us`):
@@ -262,6 +268,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
262
268
 
263
269
  **⑦a Dispatch Verifier**
264
270
  - Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
271
+ - **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
265
272
  - If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
266
273
 
267
274
  If `--verifier-engine claude` (default):
@@ -309,6 +316,8 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
309
316
  - If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
310
317
  - If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
311
318
 
319
+ **CRITICAL: Immediately proceed to ⑧. Do NOT pause, do NOT ask the user. Continue the loop.**
320
+
312
321
  **⑧ Write result log and report to user, continue loop**
313
322
  - Write `logs/<slug>/iter-NNN.result.md`:
314
323
  - Result status `[leader-measured]`
@@ -316,6 +325,7 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
316
325
  - Verifier verdict `[leader-measured]`
317
326
  - Write `status.json`
318
327
  - Report: iteration N, phase, model used, result
328
+ - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
319
329
  - If `--debug`: debug_log `[EXEC] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
320
330
 
321
331
  At loop end (COMPLETE, BLOCKED, or TIMEOUT):
@@ -346,7 +356,7 @@ Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
346
356
  ## 3. Worker Process Quality (§1f audit)
347
357
  Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
348
358
  Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
349
- Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
359
+ Audit: each step object must have "step" field with value from §1f vocabulary (write_test, verify_red, implement, verify_green, refactor, verify_e2e, commit, verify) + ac_id + command + exit_code
350
360
 
351
361
  ## 4. Verifier Judgment Quality (§1f audit)
352
362
  Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?
@@ -472,6 +472,27 @@ GIEOF
472
472
  echo " + .gitignore (created with rlp-desk rules)"
473
473
  fi
474
474
 
475
+ # --- Post-init validation gate ---
476
+ INIT_FAIL=0
477
+ for REQUIRED_FILE in \
478
+ "$DESK/prompts/$SLUG.worker.prompt.md" \
479
+ "$DESK/prompts/$SLUG.verifier.prompt.md" \
480
+ "$DESK/context/$SLUG-latest.md" \
481
+ "$DESK/memos/$SLUG-memory.md" \
482
+ "$DESK/plans/prd-$SLUG.md" \
483
+ "$DESK/plans/test-spec-$SLUG.md"; do
484
+ if [[ ! -f "$REQUIRED_FILE" ]]; then
485
+ echo " ✗ MISSING: $REQUIRED_FILE"
486
+ INIT_FAIL=1
487
+ fi
488
+ done
489
+ if [[ $INIT_FAIL -eq 1 ]]; then
490
+ echo ""
491
+ echo "ERROR: Scaffold incomplete. Some required files were not created."
492
+ echo "Re-run init or check filesystem permissions."
493
+ exit 1
494
+ fi
495
+
475
496
  echo ""
476
497
  echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
477
498
  echo "Scaffold ready: $SLUG"