npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.3.6 → 0.5.0 - Mend

@ai-dev-methodologies/rlp-desk 0.3.6 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/README.md +145 -69
package/docs/blueprints/blueprint-v0.4-evolution.md +347 -0
package/docs/plans/cozy-gliding-trinket.md +53 -0
package/docs/plans/toasty-whistling-diffie-agent-a6814625642e956da.md +201 -0
package/docs/plans/toasty-whistling-diffie.md +117 -0
package/docs/prompts/ralplan-codex-review.md +55 -0
package/install.sh +5 -0
package/package.json +1 -1
package/scripts/postinstall.js +5 -0
package/scripts/uninstall.js +1 -0
package/src/commands/rlp-desk.md +252 -70
package/src/governance.md +63 -28
package/src/model-upgrade-table.md +50 -0
package/src/scripts/init_ralph_desk.zsh +329 -13
package/src/scripts/lib_ralph_desk.zsh +837 -0
package/src/scripts/run_ralph_desk.zsh +978 -482

package/README.md CHANGED Viewed

@@ -99,6 +99,22 @@ for iteration in 1..max_iter:
   8. Update status, report to user, continue or stop
 ```
+### Live PRD Update
+The Leader computes a hash for `prd-<slug>.md` at startup and again at each iteration using `md5`.
+When the hash changes, it:
+- Logs `prd_changed=true` with `prd_hash`, previous/new US counts, and `new_us`
+- Splits the PRD into per-US files (`prd-<slug>-US-<id>.md`)
+- Splits the test-spec into per-US files (`test-spec-<slug>-US-<id>.md`)
+- Updates the in-memory PRD US list used for per-US dispatch
+- Adds `NOTE: PRD was updated since last iteration. New/changed US may exist.` to the Worker prompt
+If the PRD hash is unchanged, `prd_changed=false` is logged and no re-split is triggered.
+If the PRD file is missing, the process degrades gracefully and continues without failing the campaign loop.
 ### Verification Policy (v0.3.0)
 RLP Desk enforces a comprehensive verification policy defined in `governance.md`:
@@ -133,15 +149,75 @@ RLP Desk enforces a comprehensive verification policy defined in `governance.md`
 | 3 consecutive failures | Architecture Escalation (§7¾) → report to user |
 | Max iterations reached | TIMEOUT |
-### Model Routing
+### Verification Strategy (v0.5)
+**Core principle: Worker and Verifier use different AI engines whenever possible.**
+- Per-US: lightweight verification after each user story (catches issues early)
+- Final: top-tier consensus gate before COMPLETE (quality guarantee)
+- Progressive upgrade: auto-upgrade models on consecutive failure (2-attempt windows)
+- Verifier minimum: claude sonnet (haiku cannot verify)
+#### 1. Claude-only (codex not installed)
+Verifier is always +1 tier above Worker. Same-engine shares blind spots — install codex for improved detection.
+| Risk | Worker | Per-US Verifier | Worker upgrade path | Verifier upgrade path |
+|------|--------|-----------------|--------------------|-----------------------|
+| LOW | haiku | sonnet | sonnet → opus | sonnet → opus |
+| MEDIUM | sonnet | sonnet | opus | sonnet → opus |
+| HIGH | sonnet | opus | opus | opus (ceiling) |
+| CRITICAL | opus | opus ⚠ | (ceiling) | (ceiling) |
+Final: **opus solo** ⚠ same-engine warning displayed
+#### 2. Cross-engine: GPT Pro (spark + 5.4)
+Spark is speed-optimized for coding. Use as Worker for LOW-HIGH; 5.4 for CRITICAL.
+| Risk | Worker (codex) | Per-US Verifier (claude) | Worker upgrade path | Verifier upgrade path |
+|------|---------------|--------------------------|--------------------|-----------------------|
+| LOW | spark medium | sonnet | spark high → xhigh | sonnet → opus |
+| MEDIUM | spark high | sonnet | spark xhigh → 5.4 medium | sonnet → opus |
+| HIGH | spark xhigh | opus | 5.4 high → 5.4 xhigh | opus (ceiling) |
+| CRITICAL | 5.4 high | opus | 5.4 xhigh | opus (ceiling) |
+Final: **opus + 5.4 high** (both must PASS)
+#### 3. Cross-engine: Non-Pro (5.4 only)
+| Risk | Worker (codex) | Per-US Verifier (claude) | Worker upgrade path | Verifier upgrade path |
+|------|---------------|--------------------------|--------------------|-----------------------|
+| LOW | 5.4 low | sonnet | 5.4 medium → high | sonnet → opus |
+| MEDIUM | 5.4 medium | sonnet | 5.4 high → xhigh | sonnet → opus |
+| HIGH | 5.4 high | opus | 5.4 xhigh | opus (ceiling) |
+| CRITICAL | 5.4 xhigh | opus | (ceiling) | opus (ceiling) |
+Final: **opus + 5.4 high** (both must PASS)
+#### Final Verify
+| Environment | Engine 1 | Engine 2 | Rule |
+|-------------|----------|----------|------|
+| Claude-only | opus | — | Solo ⚠ |
+| Cross-engine | opus | 5.4 high | Both must PASS → COMPLETE |
+#### Progressive Upgrade (Worker Only)
+Worker auto-upgrades on consecutive same-US failure. Verifier is fixed at campaign start. CB default: 6.
+```
+fail 1-2: keep current model (2-attempt window)
+fail 3-4: upgrade 1 step (e.g., haiku → sonnet)
+fail 5-6: upgrade 2 steps (e.g., haiku → opus)
+fail 7+:  ceiling reached → BLOCKED
+```
-| Scenario | Model |
-|----------|-------|
-| Simple, single-file changes | `haiku` |
-| Standard work (default) | `sonnet` |
-| Architecture changes, multi-file, prior failure | `opus` |
-| Verification (default) | `opus` |
-| Lightweight verification | `sonnet` |
+See `src/model-upgrade-table.md` for full upgrade paths per engine and complexity level.
+#### Sequential Final Verify
+When all US pass individually, the final ALL verify runs **sequentially per-US** instead of one big check. This prevents verifier timeout on large PRDs. After all per-US checks pass, the project's test suite runs once as a cross-US integration check.
 ## Commands
@@ -159,18 +235,29 @@ RLP Desk enforces a comprehensive verification policy defined in `governance.md`
 | Flag | Default | Description |
 |------|---------|-------------|
 | `--max-iter N` | 100 | Maximum iterations before timeout |
-| `--worker-model MODEL` | sonnet | Worker model (haiku/sonnet/opus) |
-| `--verifier-model MODEL` | opus | Verifier model (haiku/sonnet/opus) |
 | `--mode agent\|tmux` | agent | Execution mode (see below) |
-| `--worker-engine claude\|codex` | claude | Engine for Worker (claude uses Agent(), codex uses Bash CLI) |
-| `--verifier-engine claude\|codex` | claude | Engine for Verifier |
-| `--codex-model MODEL` | gpt-5.4 | Model passed to the Codex CLI (when engine=codex) |
-| `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
+| `--worker-model MODEL` | sonnet | Claude worker model (haiku/sonnet/opus) |
+| `--worker-engine claude\|codex` | claude | Worker engine |
+| `--verifier-model MODEL` | auto | Auto-selected: +1 tier (same-engine) or cross-engine |
+| `--verifier-engine claude\|codex` | auto | Opposite of worker engine if codex available |
+| `--codex-model MODEL` | gpt-5.4 | Codex model (spark requires GPT Pro) |
+| `--codex-reasoning LEVEL` | medium | low/medium/high/xhigh |
 | `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
-| `--verify-consensus` | off | Cross-engine consensus verification (see below) |
+| `--lock-worker-model` | off | Disable progressive model upgrade on failure |
 | `--debug` | off | Debug logging to `logs/<slug>/debug.log` |
 | `--with-self-verification` | off | Campaign-level post-loop analysis report |
+### Init Presets
+After `brainstorm`, `init` detects your environment and presents run command presets:
+- **Codex detected** → recommends cross-engine mode (`--worker-model gpt-5.4:high --verify-consensus`)
+- **GPT Pro (spark)** → offers spark preset (`--worker-model gpt-5.3-codex-spark:high`)
+- **Claude-only** → defaults to `--worker-model sonnet` with opus verifier
+- **Basic** → minimal flags for quick iteration
+The brainstorm phase evaluates complexity (US count, file scope, logic, dependencies, code impact) and recommends a starting model. You can override any recommendation.
 ## Execution Modes
 RLP Desk supports two execution modes. Both honor the same governance protocol.
@@ -277,28 +364,18 @@ Uses the `codex` CLI via `Bash()` (agent mode) or as an interactive TUI (tmux mo
 ## Verification Modes
-RLP Desk supports two verification strategies. **Per-US is the default.**
 ### Per-US Verification (default)
-```
-/rlp-desk run calculator
-/rlp-desk run calculator --verify-mode per-us
-```
-Each user story is verified independently after completion, then a final full verification runs after all stories pass:
+Each user story is verified independently, then a final full verification runs:
 ```
-Worker: US-001 → Verifier: US-001 AC only → pass
-Worker: US-002 → Verifier: US-002 AC only → pass
-Worker: US-003 → Verifier: US-003 AC only → pass
-Final full verify: ALL AC → pass → COMPLETE
+Worker: US-001 → Verifier(per-US): US-001 only → pass
+Worker: US-002 → Verifier(per-US): US-002 only → pass
+...
+Final Verify: opus + 5.4 high → both pass → COMPLETE
 ```
-Benefits:
-- Catch issues early, before later stories build on broken foundations
-- Smaller verification scope = faster, more accurate checks
-- Failed verification retries only the specific US
+Per-US catches issues early before later stories build on broken foundations.
 ### Batch Verification
@@ -306,30 +383,7 @@ Benefits:
 /rlp-desk run calculator --verify-mode batch
 ```
-Legacy behavior: Worker completes all stories, then a single verification checks all acceptance criteria at once.
-### Cross-Engine Consensus Verification
-```
-/rlp-desk run calculator --verify-consensus
-```
-When enabled, **both claude and codex verify independently**. Both must pass for verification to succeed.
-```
-Worker completes US → Claude verifies → Codex verifies
-  Both pass → proceed
-  Either fails → combined fix contract → Worker retry
-  3 rounds without consensus → BLOCKED
-```
-Consensus can be combined with per-US mode for maximum rigor:
-```
-/rlp-desk run calculator --verify-mode per-us --verify-consensus
-```
-Prerequisites: Both `claude` and `codex` CLIs must be installed.
+Worker completes all stories, then a single verification checks all AC at once. Final verify still applies.
 ## Project Structure
@@ -337,20 +391,42 @@ After `init`, your project gets this scaffold:
 ```
 your-project/
-└── .claude/ralph-desk/
-    ├── prompts/
-    │   ├── <slug>.worker.prompt.md
-    │   └── <slug>.verifier.prompt.md
-    ├── context/
-    │   └── <slug>-latest.md
-    ├── memos/
-    │   └── <slug>-memory.md
-    ├── plans/
-    │   ├── prd-<slug>.md
-    │   └── test-spec-<slug>.md
-    └── logs/<slug>/
-        └── status.json
-```
+├── .claude/
+│   ├── settings.local.json          # rlp-desk permissions (auto-added by init)
+│   └── ralph-desk/
+│       ├── prompts/
+│       │   ├── <slug>.worker.prompt.md
+│       │   └── <slug>.verifier.prompt.md
+│       ├── context/
+│       │   └── <slug>-latest.md
+│       ├── memos/
+│       │   └── <slug>-memory.md
+│       ├── plans/
+│       │   ├── prd-<slug>.md
+│       │   └── test-spec-<slug>.md
+│       └── logs/<slug>/
+│           └── status.json
+```
+### Local Settings
+`init` automatically adds the following permissions to `.claude/settings.local.json`:
+```json
+{
+  "permissions": {
+    "allow": [
+      "Read(.claude/ralph-desk/**)",
+      "Edit(.claude/ralph-desk/**)",
+      "Write(.claude/ralph-desk/**)"
+    ]
+  }
+}
+```
+**Why:** Claude Code treats `.claude/` files as sensitive and prompts for confirmation on each access, even with `--dangerously-skip-permissions`. Without these permissions, Worker and Verifier agents are blocked by interactive prompts during automated loop execution.
+**Note:** `settings.local.json` is local to your machine and is not committed to git. If the file already exists, permissions are merged without overwriting your existing settings.
 ## Example: Calculator

package/docs/blueprints/blueprint-v0.4-evolution.md ADDED Viewed

@@ -0,0 +1,347 @@
+# Blueprint: rlp-desk v0.4 Evolution
+> Design blueprint for rlp-desk's next major direction.
+> Status: CONFIRMED (Deep Interview 4.9% ambiguity) | Author: kyjin | Date: 2026-03-26
+---
+## Vision
+rlp-desk is both a **task execution tool** and a **workflow generator**.
+Users start with unstructured work, iterate through self-verification cycles,
+and the accumulated process naturally becomes a reusable, formalized workflow.
+```
+[Unstructured]                    [Structured]
+brainstorm → run → verify          Workflow (skill + command composition)
+  → re-brainstorm → run → verify   Feedback loop enforcement
+  → run → verify                    Reproducible process
+  → final result            ──▶     (P3: determined after P0-P2 iteration)
+```
+---
+## 1. Debug (`--debug`) — Execution Trace
+### Purpose
+Trace rlp-desk's execution process. Not verbose data dumps — focused logging
+of whether rules were followed and options behaved as configured.
+### Two audiences
+| Audience | Use case |
+|----------|----------|
+| Developer (self) | Verify governance compliance, catch erroneous execution (e.g., codex consensus FAIL treated as PASS) |
+| External users (npm) | Run with `--debug`, attach debug.log + version to bug report |
+### Scope
+- Governance rule compliance (IL-1 through IL-5, checkpoint enforcement)
+- Option behavior verification (consensus, per-us, model routing)
+- Decision points (model upgrades, circuit breaker triggers)
+- NOT implementation details of Worker/Verifier content
+### Current state
+Basic implementation exists in v0.3.6 (debug.log with phase-level entries).
+Refine to match the scoped purpose above — no expansion needed, possibly trimming.
+### Versioning
+debug.log is versioned on re-execution: `debug-v1.log`, `debug-v2.log`, ...
+Preserved for bug tracking across versions.
+---
+## 2. Self-Verification (`--with-self-verification`) — Quality Feedback Loop
+### Purpose
+Evaluate whether the AI implementation meets quality expectations.
+When the user is unsatisfied, self-verification becomes the **input for the next
+execution cycle** — same goal, improved strategy.
+### Status
+Remains an **optional flag** (`--with-self-verification`). Not always-on.
+Simple tasks don't need the re-execution cycle.
+### Current vs. New
+| Aspect | Current (v0.3.x) | New vision |
+|--------|-------------------|------------|
+| Timing | Post-campaign report only | Input for re-execution cycle |
+| Output | Static report | Living document that drives next iteration |
+| Scope | Analysis of what happened | Analysis + recommendations that reshape execution plan |
+| PRD | Not touched | May be refined based on self-verification findings |
+### Re-execution Cycle (same slug)
+```
+brainstorm("auth-refactor")
+  → init → run → self-verification-v1 + campaign-report-v1 + debug-v1
+                              │
+              user: "not satisfied, re-run"
+                              │
+                              ▼
+re-brainstorm("auth-refactor")
+  │
+  ├─ PRD: single file, updated in place if needed (no versioning)
+  ├─ SV report: renamed to self-verification-v1.md (preserved)
+  ├─ Campaign report: renamed to campaign-report-v1.md (preserved)
+  ├─ Debug log: renamed to debug-v1.log (preserved)
+  ├─ Everything else: deleted (test-spec, prompts, context, memos, logs)
+  └─ Re-brainstorm: informed by self-verification-v1
+                              │
+                              ▼
+  → init → run → self-verification-v2 + campaign-report-v2 + debug-v2
+                              │
+                              ...
+```
+### Versioning Rules
+When re-running the same slug:
+1. **PRD** — single file (`prd-<slug>.md`), updated in place if needed.
+   No versioning. PRD is the single source of truth.
+2. **Versioned files (3 total)** — renamed with vN suffix before re-run:
+   - `self-verification-report.md` → `self-verification-v1.md`
+   - `campaign-report.md` → `campaign-report-v1.md`
+   - `debug.log` → `debug-v1.log`
+3. **Everything else** — deleted. Next run regenerates them automatically:
+   - `test-spec-<slug>.md`, `prompts/`, `context/`, `memos/`, `logs/<slug>/*`
+4. **Self-verification as the historical record** — each version's story
+   is told by its self-verification report. No need to preserve iteration
+   logs; SV summarizes what happened.
+### Re-execution Detection
+When brainstorm detects an existing slug:
+- Ask the user: "Improve based on previous results, or start fresh?"
+- If improve: version existing files, carry forward PRD, re-brainstorm with SV context
+- If start fresh: clean everything (equivalent to `clean` + new brainstorm)
+### Implementation: Shell Script
+Deterministic file operations (rename, delete, version detection) go in
+`init_ralph_desk.zsh`. AI handles judgment (PRD refinement, strategy changes).
+---
+## 3. Post-Run Reporting — Mandatory Completion Report
+### Purpose
+After `run` completes, the user MUST receive a comprehensive, templated report
+before deciding next steps. This is the **decision surface** for whether to
+re-brainstorm or accept the result.
+### Trigger
+Mandatory after every `run` completion (COMPLETE, BLOCKED, or TIMEOUT).
+Not optional. Not skippable. Applies regardless of SV flag.
+### Output
+- **File**: `logs/<slug>/campaign-report.md` (versioned on re-execution)
+- **Screen**: Full report displayed to user
+### Report Template
+```markdown
+# Campaign Report: <slug>
+## Objective
+<from PRD>
+## Execution Summary
+| Metric | Value |
+|--------|-------|
+| Total iterations | N |
+| Outcome | COMPLETE / BLOCKED / TIMEOUT |
+| Worker model | sonnet / opus |
+| Verifier model | opus |
+| Duration | Xm Ys |
+## User Stories Status
+| US | Description | Status | Iterations | Notes |
+|----|-------------|--------|------------|-------|
+| US-001 | ... | PASS | 2 | — |
+| US-002 | ... | PASS | 4 | 2 fix rounds |
+## Verification Results
+- L1 (Unit): PASS — N tests, N assertions
+- L2 (Integration): PASS / N/A
+- L3 (E2E): PASS — input/output comparison
+- L4 (Deploy): N/A
+## Issues Encountered
+<failures, fix loops, model upgrades, escalations>
+## Cost & Performance
+| Role | Model | Tokens | Duration | Source |
+|------|-------|--------|----------|--------|
+| Worker | sonnet | N | Xm Ys | measured/estimated |
+| Verifier | opus | N | Xm Ys | measured/estimated |
+| **Total** | | **N** | **Xm Ys** | |
+## Self-Verification Summary (if enabled)
+<from self-verification report — strengths, weaknesses, recommendations>
+## Files Changed
+<git diff --stat summary>
+```
+### Post-Report Flow
+```
+[All runs]
+  → Report displayed + saved to file
+[SV enabled only]
+  → "Would you like to re-brainstorm to improve the result?"
+  → Yes: trigger re-execution cycle (§2)
+  → No:  session ends
+[SV not enabled]
+  → Report displayed, session ends (no re-brainstorm question)
+```
+### Rules
+- Report content must reference actual data (status.json, iteration results,
+  self-verification if available) — no fabrication
+- Template is fixed; sections may show "N/A" but cannot be omitted
+- Re-brainstorm question only appears when SV is enabled
+---
+## 4. Workflow Generation — From Ad-hoc to Reproducible (P3, deferred)
+### Purpose
+After multiple self-verification cycles produce a final result,
+automatically generate a **formalized workflow** that captures the
+proven process as a reusable, enforceable process.
+### Status: DEFERRED
+P3 design will be determined after P0-P2 are working and tested through
+real usage. The user will iterate with the developer to find the right form.
+### Known Direction
+- Output: **skill + command composition** (not rlp-desk PRD format)
+- Invoking the command triggers a combination of skills
+- The command structure enforces feedback loops
+- Leverage existing skill ecosystem (`/find-skills`, etc.)
+- Trust AI models + well-structured skills + checklist-managed feedback loops
+### Success Criteria (confirmed)
+- Process must be reproducible: same workflow → same quality of results
+- Code implementation may differ, but behavior/quality must be equivalent
+- Feedback loop + template documents as structural components
+### Open Questions (to resolve after P0-P2)
+1. Separate subcommand (`/rlp-desk workflow <slug>`) or independent skill?
+2. Minimum self-verification versions required (2? 3?)
+3. How to validate generated workflow actually reproduces results?
+---
+## Feature Relationship
+```
+┌──────────────────────────────────────────────────────────────┐
+│                     rlp-desk execution                       │
+│                                                              │
+│  brainstorm → init → run ──┐                                 │
+│                             │                                │
+│              ┌──────────────┼──────────────────┐             │
+│              │              │                  │             │
+│           --debug    --with-self-verification   │             │
+│           (execution  (quality evaluation)      │             │
+│            trace)           │                   │             │
+│              │              ▼                   │             │
+│              │     self-verification report     │             │
+│              │              │                   │             │
+│              ▼              ▼                   │             │
+│         debug.log   Post-Run Report (mandatory) │             │
+│         (versioned)  + campaign-report.md        │             │
+│              │       (versioned)                 │             │
+│              │         │                        │             │
+│              │    [SV enabled?]                  │             │
+│              │      │         │                  │             │
+│              │     Yes        No                 │             │
+│              │      │         │                  │             │
+│              │      ▼         ▼                  │             │
+│              │  "Re-brainstorm?"  End            │             │
+│              │    │         │                    │             │
+│              │   Yes        No                   │             │
+│              │    │         │                    │             │
+│              │    ▼         ▼                    │             │
+│              │  Re-execute  Accept               │             │
+│              │  (vN+1)     result                │             │
+│              │    │         │                    │             │
+│              │    │    ┌────┘                    │             │
+│              │    ▼    ▼                         │             │
+│              │  Workflow Generation (P3)          │             │
+│              │  (deferred — after P0-P2)          │             │
+│              │                                   │             │
+│              └── Bug report (external users)     │             │
+└──────────────────────────────────────────────────────────────┘
+```
+---
+## Implementation Priority
+| Phase | Feature | Dependency | Scope |
+|-------|---------|------------|-------|
+| P0 | Debug refinement | None | rlp-desk.md, governance.md |
+| P1 | Post-Run Report | None | rlp-desk.md |
+| P2 | Self-Verification redesign | P1 | rlp-desk.md, init_ralph_desk.zsh, governance.md |
+| P3 | Workflow Generation | P2 + real usage | TBD after iteration |
+- P0 and P1 are independent, can be done in parallel.
+- P2 builds on P1 (report triggers re-brainstorm question).
+- P3 requires P2 (needs versioned self-verification data) + real-world testing.
+- Breaking changes allowed (0.x semver). Document in CHANGELOG.
+---
+## Design Decisions Log (from Deep Interview)
+| # | Decision | Rationale |
+|---|----------|-----------|
+| 1 | brainstorm auto-handles re-execution | Natural UX — same command, system detects context |
+| 2 | Mixed judgment (quantitative + user) | Pure metrics miss quality nuance; pure subjective misses patterns |
+| 3 | Breaking changes OK | 0.x semver, clean redesign over backward compat hacks |
+| 4 | P3 essential but deferred | Core vision requires it, but form emerges from P0-P2 usage |
+| 5 | Skill + command composition for P3 | Leverage existing ecosystem, not reinvent |
+| 6 | All features in rlp-desk | Connected UX > separate tools |
+| 7 | Shell script for deterministic ops | AI interpretation unreliable for file manipulation |
+| 8 | SV remains optional | Simple tasks don't need re-execution overhead |
+| 9 | Re-brainstorm only with SV | No SV = no improvement data = no point asking |
+| 10 | Report always mandatory | Users need decision surface regardless of SV |
+| 11 | PRD = single file, no versioning | Source of truth, updated in place |
+| 12 | Only 3 files versioned | SV report + campaign report + debug.log |
+| 13 | Report includes cost section | Token/time tracking for optimization decisions |
+| 14 | Ask user intent on slug reuse | "Improve?" vs "Start fresh?" — don't assume |
+---
+## Open Questions (P3 only)
+1. **Workflow as subcommand or separate skill?** — defer to P0-P2 experience
+2. **Minimum SV versions for generation** — need real data to determine
+3. **Reproducibility validation method** — how to test generated workflow works
+4. **Skill composition mechanics** — how feedback loop enforcement works in practice

package/docs/plans/cozy-gliding-trinket.md ADDED Viewed

@@ -0,0 +1,53 @@
+# Plan: 리팩토링 실행 검증 + v05-remaining 재시작
+## Context
+Engine path refactoring Phase 0~7 완료 (38 TDD 구조적 테스트 pass).
+하지만 **실제 tmux 실행 검증**을 안 했음. 리팩토링이 실제 캠페인에서 정상 동작하는지 확인 필요.
+## 검증 순서
+### Step 1: 좀비 runner + sentinel 정리
+```bash
+ps aux | grep run_ralph_desk | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
+for p in $(tmux list-panes -F '#{pane_id}' | grep -v '%360'); do tmux kill-pane -t "$p" 2>/dev/null; done
+rm -f .claude/ralph-desk/memos/v05-remaining-blocked.md
+rm -f .claude/ralph-desk/memos/v05-remaining-complete.md
+rm -f .claude/ralph-desk/memos/v05-remaining-done-claim.json
+rm -f .claude/ralph-desk/memos/v05-remaining-verify-verdict.json
+rm -f .claude/ralph-desk/memos/v05-remaining-iter-signal.json
+rm -f .claude/ralph-desk/logs/v05-remaining/session-config.json
+```
+### Step 2: v05-remaining 캠페인 실행 (spark worker)
+```bash
+LOOP_NAME="v05-remaining" ROOT="$PWD" MAX_ITER=15 \
+WORKER_MODEL=gpt-5.3-codex-spark WORKER_ENGINE=codex \
+WORKER_CODEX_MODEL=gpt-5.3-codex-spark WORKER_CODEX_REASONING=medium \
+VERIFIER_MODEL=sonnet VERIFIER_ENGINE=claude \
+VERIFY_MODE=per-us VERIFY_CONSENSUS=0 CB_THRESHOLD=6 \
+ITER_TIMEOUT=600 DEBUG=1 WITH_SELF_VERIFICATION=1 \
+  zsh ~/.claude/ralph-desk/run_ralph_desk.zsh
+```
+(run_in_background=true)
+### Step 3: 검증 체크리스트
+- [ ] Pane 3개 생성됨 (leader + worker + verifier)
+- [ ] Worker pane에서 codex exec 실행됨 (bash trigger, dead pane 오판 없음)
+- [ ] Worker 완료 후 heartbeat exited → signal auto-generate
+- [ ] Verifier(sonnet) 정상 시작 + verdict 작성
+- [ ] US-002 이상 진행 (이전 US-001은 이미 verified)
+- [ ] 좀비 runner 없음 (ps 확인)
+### Step 4: 실패 시 대응
+- codex worker 시작 실패 → trigger script 내용 확인 + 수동 실행 테스트
+- verifier timeout → runner log tail + pane 상태 확인
+- BLOCKED → sentinel 원인 분석 + 수정 후 재시도
+### Step 5: 성공 시
+- 캠페인 진행 모니터링 (status 확인)
+- 완료 대기 또는 다음 세션 handoff
+## 파일
+- `src/scripts/run_ralph_desk.zsh` — 리팩토링된 runner
+- `~/.claude/ralph-desk/run_ralph_desk.zsh` — 로컬 동기화된 사본
+- `.claude/ralph-desk/logs/v05-remaining/` — 캠페인 아티팩트