npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.3.0 → 0.3.2 - Mend

@ai-dev-methodologies/rlp-desk 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +32 -1
package/docs/architecture.md +25 -0
package/docs/getting-started.md +11 -4
package/docs/protocol-reference.md +21 -10
package/package.json +1 -1
package/src/commands/rlp-desk.md +16 -6
package/src/scripts/init_ralph_desk.zsh +21 -0

package/README.md CHANGED Viewed

@@ -40,9 +40,11 @@ curl -sSL https://raw.githubusercontent.com/ai-dev-methodologies/rlp-desk/main/i
 You'll be asked to confirm each item:
 - **Slug** — project identifier
-- **User Stories** — discrete, testable units of work
+- **User Stories** — discrete, testable units with Given/When/Then acceptance criteria
+- **Task Type & Risk Level** — code/visual/content/integration/infra × LOW/MEDIUM/HIGH/CRITICAL
 - **Iteration Unit** — one story per iteration (incremental) or all at once (fast)
 - **Verification Commands** — how to check the work
+- **Ambiguity Gate** — AC quality scoring (IL-2, 0-12 scale, blocks init if < 6)
 - **Models** — which Claude model for Worker/Verifier
 ### 3. Run
@@ -97,12 +99,38 @@ for iteration in 1..max_iter:
   8. Update status, report to user, continue or stop
 ```
+### Verification Policy (v0.3.0)
+RLP Desk enforces a comprehensive verification policy defined in `governance.md`:
+**Iron Laws (§1a)** — 4 absolute rules that cannot be violated:
+- **IL-1**: No completion claims without fresh verification evidence
+- **IL-2**: No init without AC quality score ≥ 6 (Ambiguity Gate)
+- **IL-3**: No pass with TODO in any required verification layer
+- **IL-4**: No pass without test count ≥ AC count × 3
+**Evidence Gate (§1b)** — 5-step protocol: IDENTIFY → RUN → READ → VERIFY → ONLY THEN claim
+**Risk Classification (§1c)** — Proportional verification layers per risk level:
+| Risk | Required Layers |
+|------|----------------|
+| LOW | L1 (Unit) + L3 (E2E) |
+| MEDIUM | L1 + L2 (Integration) + L3 |
+| HIGH | L1 + L2 + L3 + L4 (Deploy) |
+| CRITICAL | L1 + L2 + L3 + L4 + mutation testing |
+**Execution Traceability (§1f)** — Always-on, not flag-gated:
+- Worker records `execution_steps` in done-claim.json (what was done, in what order, with evidence)
+- Verifier records `reasoning` in verify-verdict.json (why each judgment was made)
 ### Circuit Breakers
 | Condition | Action |
 |-----------|--------|
 | Context unchanged for 3 iterations | BLOCKED |
 | Same error repeated twice | Upgrade model, retry once, then BLOCKED |
+| 3 consecutive failures | Architecture Escalation (§7¾) → report to user |
 | Max iterations reached | TIMEOUT |
 ### Model Routing
@@ -140,6 +168,8 @@ for iteration in 1..max_iter:
 | `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
 | `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
 | `--verify-consensus` | off | Cross-engine consensus verification (see below) |
+| `--debug` | off | Debug logging to `logs/<slug>/debug.log` |
+| `--with-self-verification` | off | Campaign-level post-loop analysis report |
 ## Execution Modes
@@ -334,6 +364,7 @@ mkdir my-calc && cd my-calc
 - [Architecture](docs/architecture.md) — Design philosophy, Agent() and tmux execution modes
 - [Getting Started](docs/getting-started.md) — Step-by-step tutorial with the calculator example
 - [Protocol Reference](docs/protocol-reference.md) — Full protocol specification
+- [Future Plans](docs/TODO-verification-next.md) — P3 items and upcoming features
 ## Contributing

package/docs/architecture.md CHANGED Viewed

@@ -48,6 +48,31 @@ RLP Desk supports two modes for running the Leader loop. Both honor the same gov
 | **Agent() — "Smart mode"** (default) | LLM (current session) | Dynamic — Leader reasons about which model to use each iteration | Active Claude Code session | Interactive development, complex routing decisions |
 | **Tmux — "Lean mode"** | Shell script (`run_ralph_desk.zsh`) | Static — set via `WORKER_MODEL`/`VERIFIER_MODEL` env vars | None (runs detached) | Long campaigns, CI, observability, zero-token orchestration |
+### Verification Policy Layer
+Both modes enforce the same verification policy (governance §1a-§1f):
+```
+                    ┌─────────────────────────────────┐
+                    │     Governance (§1a-§1f)         │
+                    │  Iron Laws · Evidence Gate       │
+                    │  Risk Classification · Layers    │
+                    │  Checkpoints · Traceability      │
+                    └──────────┬──────────────────────┘
+                               │ enforced by
+              ┌────────────────┼────────────────┐
+              ▼                ▼                ▼
+         Worker Template  Verifier Template  Leader Loop
+         (Test-First,     (12-step process,  (Contract review,
+          12 Shortcuts,    5 reasoning       Checkpoints,
+          execution_steps) categories)       Escalation)
+```
+Key design decisions:
+- **execution_steps** (Worker) and **reasoning** (Verifier) are always-on (§1f), not gated by flags
+- **`--with-self-verification`** adds post-campaign analysis only — does not change loop behavior
+- **Risk-proportional layers**: LOW gets L1+L3, CRITICAL gets L1+L2+L3+L4+mutation
 **Agent() mode** is synchronous and simple: each `Agent()` call blocks until the subprocess finishes, then the Leader reads the filesystem. No polling, no signal files, no tmux.
 **Tmux mode** trades dynamic routing for visibility and independence. The shell Leader writes prompts to files, sends short trigger commands via `tmux send-keys`, and polls structured JSON signal files (`iter-signal.json`, `verify-verdict.json`) for control flow. It uses proven tmux patterns — write-then-notify, pane ID stability, copy-mode guards, heartbeat monitoring — for reliable, race-free orchestration.

package/docs/getting-started.md CHANGED Viewed

@@ -45,10 +45,12 @@ The brainstorm phase interactively determines:
 |------|---------|
 | **Slug** | `loop-test` |
 | **Objective** | Implement calc.py + test_calc.py |
-| **User Stories** | US-001: calculator functions, US-002: pytest tests |
+| **User Stories** | US-001: calculator functions (Given/When/Then AC format) |
+| **Task Type & Risk** | code, LOW |
 | **Iteration Unit** | One user story per iteration |
 | **Verification** | `python3 -m pytest test_calc.py -v` |
 | **Models** | Worker: sonnet, Verifier: opus |
+| **Ambiguity Gate (IL-2)** | AC quality score ≥ 6 required to proceed |
 | **Max Iterations** | 10 |
 On approval, brainstorm offers to run `init` automatically.
@@ -81,9 +83,11 @@ This creates the scaffold:
 Edit `.claude/ralph-desk/plans/prd-loop-test.md` to define your user stories and acceptance criteria. See [`examples/calculator/`](../examples/calculator/.claude/ralph-desk/plans/prd-loop-test.md) for a complete example.
 Key sections:
-- **User Stories** with specific, testable acceptance criteria
+- **User Stories** with Given/When/Then acceptance criteria, Task Type, and Risk Level
+- **Boundary Cases** for each US
+- **Verification Layers** (L1-L4 based on risk level per governance §1c)
 - **Technical Constraints** (e.g., "Python 3 + pytest only")
-- **Done When** conditions
+- **Done When** conditions (must reference Evidence Gate §1b)
 ## Step 6: Define the Test Spec
@@ -147,7 +151,10 @@ If you want to run the loop again:
 ## Tips
 - **Start small**: One or two user stories for your first loop
-- **Be specific in acceptance criteria**: "function returns float" is testable; "function works well" is not
+- **Use Given/When/Then**: "Given 10 and 5, When add, Then 15" — not "function works well"
+- **Set risk levels**: LOW for docs/config, MEDIUM for features, HIGH for deploys, CRITICAL for security/finance
 - **Include verification commands**: The verifier needs concrete commands to run
 - **One story per iteration**: Each worker should do one bounded action
 - **Check logs when stuck**: `logs/<slug>/iter-NNN.worker-prompt.md` shows exactly what the worker received
+- **Review done-claims**: Worker's `execution_steps` show exactly what was done and in what order
+- **Review verdict reasoning**: Verifier's `reasoning` shows why each judgment was made

package/docs/protocol-reference.md CHANGED Viewed

@@ -136,24 +136,28 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
 ### Done Claim (`<slug>-done-claim.json`)
-Written by the Worker when claiming all work is complete:
+Written by the Worker when claiming work is complete. **Must include `execution_steps`** (governance §1f — always-on, not optional):
 ```json
 {
-  "iteration": 3,
-  "claimed_at_utc": "2025-01-15T10:30:00Z",
-  "summary": "All user stories implemented and tests passing",
-  "stories_completed": ["US-001", "US-002"],
-  "evidence": {
-    "test_output": "8 passed in 0.05s",
-    "files_created": ["calc.py", "test_calc.py"]
-  }
+  "us_id": "US-001",
+  "claims": ["AC1: add(10,5)=15", "AC2: subtract(10,5)=5"],
+  "execution_steps": [
+    {"step": "write_test", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "wrote tests/test_add.py"},
+    {"step": "verify_red", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 1, "summary": "RED: test fails"},
+    {"step": "implement", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "created add()"},
+    {"step": "verify_green", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 0, "summary": "GREEN: 3 passed"},
+    {"step": "verify_e2e", "ac_id": "AC1", "command": "python -c '...'", "exit_code": 0, "summary": "E2E matches"},
+    {"step": "commit", "ac_id": "AC1", "command": "git commit", "exit_code": 0, "summary": "committed abc1234"}
+  ]
 }
 ```
+`execution_steps` must be a JSON array of objects. Step types from §1f vocabulary: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `verify_e2e`, `commit`, `verify`.
 ### Verify Verdict (`<slug>-verify-verdict.json`)
-Written by the Verifier after independent verification.
+Written by the Verifier after independent verification. **Must include `reasoning`** (governance §1f — always-on, not optional).
 **Tmux mode polling:** In tmux mode, after dispatching the Verifier, the shell Leader polls for the existence of `verify-verdict.json` (same pattern as `iter-signal.json`). Once it appears, the Leader reads the `verdict` and `recommended_state_transition` fields via `jq` to decide whether to write a COMPLETE sentinel, continue iterating, or write a BLOCKED sentinel.
@@ -172,6 +176,13 @@ Written by the Verifier after independent verification.
       "evidence": "test -f calc.py → exit 0"
     }
   ],
+  "reasoning": [
+    {"check": "IL-1 Evidence Gate", "decision": "pass", "basis": "ran pytest fresh, exit 0"},
+    {"check": "Layer Enforcement", "decision": "pass", "basis": "L1+L3 required and executed"},
+    {"check": "Test Sufficiency", "decision": "pass", "basis": "9 tests / 3 AC = 3.0 ratio"},
+    {"check": "Anti-Gaming", "decision": "pass", "basis": "no tautological tests, no mocking"},
+    {"check": "Worker Process Audit", "decision": "pass", "basis": "verify_red present with exit≠0"}
+  ],
   "missing_evidence": [],
   "issues": [
     {

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ai-dev-methodologies/rlp-desk",
-  "version": "0.3.0",
+  "version": "0.3.2",
   "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
   "scripts": {
     "postinstall": "node scripts/postinstall.js",

package/src/commands/rlp-desk.md CHANGED Viewed

@@ -165,7 +165,8 @@ DEBUG=<1 if --debug, else 0> \
 1. Validate scaffold: `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` etc.
 2. Check sentinels (complete/blocked). Found → tell user `/rlp-desk clean <slug>`.
 3. Clean previous `done-claim.json`, `verify-verdict.json`.
-4. If `--debug`: create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`.
+4. **Always**: write baseline log entry to `.claude/ralph-desk/logs/<slug>/baseline.log`: `[timestamp] iter=0 phase=start slug=<slug> worker_model=<model> verifier_model=<model>`. Baseline.log captures 1 line per iteration for lightweight post-mortem (always-on, no flag needed).
+5. If `--debug`: also create/clear `logs/<slug>/debug.log`. Define a helper: to "debug_log" means append a timestamped line to this file via `Bash("echo \"[$(date '+%Y-%m-%d %H:%M:%S')] $msg\" >> .claude/ralph-desk/logs/<slug>/debug.log")`. When `--debug` is active, debug.log contains all baseline.log fields plus detailed phase logs.
 ### Leader Loop
@@ -198,11 +199,14 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
 - User specified → use that
 - If `--debug`: debug_log `[EXEC] iter=N phase=model_select worker_model=<model> reason=<reason>`
-**④ Build worker prompt**
-- Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
-- Combine with iteration number + memory contract
-- Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
+**④ Build worker prompt (Prompt Assembly Protocol)**
+1. Capture `WORKING_DIR` once: use `$PWD` from when `/rlp-desk run` was invoked. Store for all prompt construction.
+2. Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md` — use its content **verbatim**. Do NOT rewrite, paraphrase, or regenerate paths. The prompt file contains correct absolute paths from init.
+3. Prepend meta comment: `## WORKING_DIR: {absolute path}` — Worker must use this as its working directory.
+4. Append iteration number + memory contract.
+5. Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail).
 - Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
+- **Rewriting paths from absolute to relative WILL break worktree campaigns. Only additions (WORKING_DIR header, iteration context) are allowed.**
 **④½ Contract review** (agent mode only)
 - Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
@@ -241,6 +245,8 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 - Also read `iter-signal.json` for `us_id` field (which US was just completed)
 - If `--debug`: debug_log `[EXEC] iter=N phase=worker_signal status=<stop_status> us_id=<us_id>`
+**CRITICAL: Immediately proceed to ⑦. Do NOT pause, do NOT ask the user, do NOT wait for confirmation. The loop is autonomous.**
 **⑦ Execute Verifier**
 **Per-US mode** (default, `--verify-mode per-us`):
@@ -262,6 +268,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
 **⑦a Dispatch Verifier**
 - Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
+- **Prompt Assembly Protocol (same as ④)**: Read verifier prompt file verbatim. Prepend `## WORKING_DIR: {absolute path}`. Do NOT rewrite paths.
 - If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
 If `--verifier-engine claude` (default):
@@ -309,6 +316,8 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
 - If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
 - If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
+**CRITICAL: Immediately proceed to ⑧. Do NOT pause, do NOT ask the user. Continue the loop.**
 **⑧ Write result log and report to user, continue loop**
 - Write `logs/<slug>/iter-NNN.result.md`:
   - Result status `[leader-measured]`
@@ -316,6 +325,7 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
   - Verifier verdict `[leader-measured]`
 - Write `status.json`
 - Report: iteration N, phase, model used, result
+- **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
 - If `--debug`: debug_log `[EXEC] iter=N phase=result status=<result> consecutive_failures=<N> verified_us=<list>`
 At loop end (COMPLETE, BLOCKED, or TIMEOUT):
@@ -346,7 +356,7 @@ Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
 ## 3. Worker Process Quality (§1f audit)
 Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
 Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
-Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
+Audit: each step object must have "step" field with value from §1f vocabulary (write_test, verify_red, implement, verify_green, refactor, verify_e2e, commit, verify) + ac_id + command + exit_code
 ## 4. Verifier Judgment Quality (§1f audit)
 Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?

package/src/scripts/init_ralph_desk.zsh CHANGED Viewed

@@ -472,6 +472,27 @@ GIEOF
   echo "  + .gitignore (created with rlp-desk rules)"
 fi
+# --- Post-init validation gate ---
+INIT_FAIL=0
+for REQUIRED_FILE in \
+  "$DESK/prompts/$SLUG.worker.prompt.md" \
+  "$DESK/prompts/$SLUG.verifier.prompt.md" \
+  "$DESK/context/$SLUG-latest.md" \
+  "$DESK/memos/$SLUG-memory.md" \
+  "$DESK/plans/prd-$SLUG.md" \
+  "$DESK/plans/test-spec-$SLUG.md"; do
+  if [[ ! -f "$REQUIRED_FILE" ]]; then
+    echo "  ✗ MISSING: $REQUIRED_FILE"
+    INIT_FAIL=1
+  fi
+done
+if [[ $INIT_FAIL -eq 1 ]]; then
+  echo ""
+  echo "ERROR: Scaffold incomplete. Some required files were not created."
+  echo "Re-run init or check filesystem permissions."
+  exit 1
+fi
 echo ""
 echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
 echo "Scaffold ready: $SLUG"