@ai-dev-methodologies/rlp-desk 0.15.1 → 0.15.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,178 @@
1
+ # rlp-desk Stabilization Plan (v0.15.x → v0.16.x)
2
+
3
+ > **Status**: ACTIVE. Replaces the misdirected 2026-05-07 "pivot to omc" decision (PR #8 redirect via this plan).
4
+ > **Goal**: bring rlp-desk to omc /team/ralph/ralplan level of reliability **while preserving rlp-desk's self-driving advantages**.
5
+ > **Non-goal**: pivoting away from rlp-desk. omc is the **benchmark**, not the replacement.
6
+
7
+ ---
8
+
9
+ ## 0. Why this plan exists (correction note)
10
+
11
+ On 2026-05-07 morning I (the assistant) ran `plan-ceo-review` on the question "rlp-desk vs omc /team" and produced a recommendation to enter maintenance mode and pivot to omc. The user immediately corrected: *the goal was always to make rlp-desk work as reliably as omc, NOT to replace it*.
12
+
13
+ This plan is the corrected direction: stabilize rlp-desk by learning from omc's patterns, applying them to rlp-desk's substrate, while protecting the four real differentiators that make rlp-desk worth using in the first place.
14
+
15
+ The misdirected commit `229e1b6` (the "maintenance mode" banner + FROZEN doc) is now reverted in this PR. The pivot prompt-optimizer artifact and BOS validation plan stay on disk but are deferred — they may become useful later as a comparison study, but they are not the active path.
16
+
17
+ ---
18
+
19
+ ## 1. The vision (preserved verbatim)
20
+
21
+ 1. ralph-loop fresh-context per iteration (no context pollution)
22
+ 2. idea → plan distillation
23
+ 3. PRD formalization
24
+ 4. Worker/Verifier cycles with iterative improvement
25
+ 5. **Full autonomy — minimum operator intervention**
26
+
27
+ This vision is the core. Stabilization is in service of it, not a substitute for it.
28
+
29
+ ---
30
+
31
+ ## 2. Differentiators to preserve (rlp-desk-only)
32
+
33
+ These four are the reason rlp-desk exists separately from omc. Stabilization work MUST NOT compromise them:
34
+
35
+ 1. **Multi-engine parallel consensus per iteration**: `--consensus all` runs claude AND codex on every verification, then reconciles. omc /ralph supports `--critic=codex` but as a single critic, not parallel consensus.
36
+ 2. **Multi-mission queue + cross-mission analytics**: `RLP_BACKGROUND=1` chains missions and tracks cross-mission metrics. omc /team is single-task.
37
+ 3. **BLOCK_TAGS P1-D failure taxonomy**: structured `reason_category × recoverable × suggested_action` classification. omc emits simpler verdicts (pass/fail/blocked).
38
+ 4. **Structured SV reports**: post-campaign analytics at `~/.claude/ralph-desk/analytics/<slug>/self-verification-report-NNN.md`. omc has lighter `progress.txt`.
39
+
40
+ These four ARE the value proposition. The stabilization work below is about making the substrate that delivers them as reliable as omc's.
41
+
42
+ ---
43
+
44
+ ## 3. The 10-bug regression pattern (what we're hardening against)
45
+
46
+ Six weeks (2026-05-01 to 2026-05-07), 10 bugs, each prior fix exposing the next. Categorized:
47
+
48
+ | Cat | Bugs | Root cause cluster |
49
+ |---|---|---|
50
+ | (a) tmux/process lifecycle race | #5, #6, #7, #10 | Long-lived TUI processes in tmux panes; sentinel polling races; recovery hygiene |
51
+ | (b) artifact contract / schema | #3, #4, #8, #9 | Worker/Verifier output contract violations; LLM non-determinism on schema; verified_us persistence |
52
+ | (c) LLM-runtime constraint | #1 | Claude Code `.claude/` self-modification gate blocking sentinel writes |
53
+ | (d) recovery hygiene | #10 | Manual recovery on relaunch silently overwritten |
54
+
55
+ **Per category, what omc does differently** (preliminary — to be verified empirically in §5):
56
+
57
+ - **(a) Lifecycle race**: omc /team uses Claude Code native team primitives (`TeamCreate`, `TaskCreate`, `SendMessage`). No tmux, no long-lived TUI, no sentinel polling. Process lifecycle = subagent lifecycle = single Claude Code call. Race window does not exist.
58
+ - **(b) Contract violations**: omc /ralph uses `prd.json` with `passes: bool` per story + reviewer verifies acceptance criteria. Simpler schema = less surface for LLM to violate. omc also has mandatory deslop pass + regression re-verification (`ai-slop-cleaner` + Step 7.6).
59
+ - **(c) Self-modification gate**: omc skills are read by Claude Code via the Skill tool, not written by Workers. Workers don't touch `.claude/` paths. Gate not encountered.
60
+ - **(d) Recovery**: omc /ralph is session-scoped (`.omc/state/sessions/{sessionId}/prd.json`). Per-session state means relaunch starts fresh; there is no "manual recovery" surface to break.
61
+
62
+ These are the patterns to learn from. Adopting them does NOT require pivoting away from rlp-desk; it requires bringing equivalent semantics into rlp-desk's substrate.
63
+
64
+ ---
65
+
66
+ ## 4. Stabilization principles
67
+
68
+ 1. **omc is benchmark, not replacement.** Every change in this plan asks "how does omc avoid this failure mode?" then engineers an equivalent for rlp-desk's stack.
69
+ 2. **Preserve all 4 differentiators.** No change should compromise multi-engine consensus, multi-mission queue, BLOCK_TAGS taxonomy, or SV reports.
70
+ 3. **Substrate first, features second.** Bug categories (a) and (d) are substrate. Categories (b) and (c) are surface. Fix substrate first; surface improvements compound on a stable base.
71
+ 4. **Real-LLM SV gate.** The current SV gate's grep+unit-test labeling missed 10 production bugs. SV must be strengthened to actually catch production failure modes (subset of campaigns run with full claude/codex worker+verifier in CI-like mode).
72
+ 5. **Increment by category.** Each PR closes ONE bug category, not multiple. Avoids "fix-of-fix-of-fix" pattern that produced #4 (regression of #3).
73
+
74
+ ---
75
+
76
+ ## 5. Concrete workstream (revised, per category)
77
+
78
+ ### Phase A — Empirical omc baseline (W1, ~3 days)
79
+
80
+ Before changing rlp-desk, measure omc reliably. Three test campaigns:
81
+
82
+ | Test | Workload | Measure |
83
+ |---|---|---|
84
+ | A1 | omc /ralph "fix small TS error in BOS apps/web/" | operator-touch count, time, cost |
85
+ | A2 | omc /ralph + multi-iter (3+ stories) on a synthetic PRD | operator-touch count, recovery behavior |
86
+ | A3 | omc /team "implement small feature with 3:executor" on synthetic task | parallelism behavior, lock contention |
87
+
88
+ **Output**: `docs/plans/v0.15-stabilization-omc-baseline.md` with per-test metrics. Not a competition, a *measurement*. Establishes the bar rlp-desk needs to reach.
89
+
90
+ ### Phase B — Category (a) substrate hardening (W1-W3, ~2 weeks)
91
+
92
+ The largest cluster (4 of 10 bugs). Goal: tmux/process lifecycle race window → 0 in `--mode tmux`. `--mode native` already addresses this differently; the work here is `--mode tmux`.
93
+
94
+ Sub-deliverables:
95
+ - B1: lifecycle audit (every tmux send-keys / sentinel write / pane reuse — ASCII diagram of timing windows)
96
+ - B2: post-sentinel reaper invariant test (extend Bug #7 fix coverage to all sentinel writes, not just per-US)
97
+ - B3: real-LLM SV scenario for category (a) — actual claude/codex worker dispatched, lifecycle race triggered deterministically, fix verified
98
+ - B4: lifecycle observability (debug log emits race-window measurements per iteration)
99
+
100
+ ### Phase C — Category (d) recovery hygiene completion (W3-W4, ~1 week)
101
+
102
+ Bug #10's PR-A fix covers `phase=verify` honor. Remaining recovery surfaces:
103
+ - C1: phase=blocked recovery (operator clears blocked sentinel + restarts) — currently honored, verify with test
104
+ - C2: phase=worker mid-iter crash recovery (leader killed mid-worker dispatch) — verify, fix if broken
105
+ - C3: cross-mission queue recovery (one mission BLOCKED, queue advances) — verify
106
+ - C4: documented operator recovery cookbook with deterministic jq pipelines
107
+
108
+ ### Phase D — Category (b) contract hardening (W4-W6, ~2 weeks)
109
+
110
+ LLM contract violations are partly inevitable, but the harness can reduce the surface:
111
+ - D1: schema validator at every artifact write (already exists for some; extend to all done-claim/iter-signal/verdict variants)
112
+ - D2: feedback loop — when worker violates contract, next iteration's prompt includes the schema error verbatim (omc-style)
113
+ - D3: verified_us persistence audit (Bug #9) — `status.json` is the source of truth, memory.md is supplementary, contract clear in code
114
+ - D4: real-LLM SV scenario for category (b)
115
+
116
+ ### Phase E — Category (c) LLM-runtime constraint awareness (W6-W7, ~1 week)
117
+
118
+ `.claude/` self-modification gate (Bug #1):
119
+ - E1: Worker prompt explicitly states "do NOT touch `.claude/`; sentinel paths are at `.rlp-desk/memos/`" (already done in v0.13.0 path migration; verify)
120
+ - E2: claude worker pre-flight check — try a no-op write to `.rlp-desk/` before main work; fail fast if blocked
121
+ - E3: cross-engine fallback — when claude worker hits permission gate, mid-flight fallback to codex worker for that iter (already partial; complete)
122
+
123
+ ### Phase F — Real-LLM SV gate (W7-W8, ~2 weeks)
124
+
125
+ The biggest framework upgrade:
126
+ - F1: define "SV scenario" = complete real campaign (1-3 iter, real claude/codex, real tmux or native) executed in CI nightly
127
+ - F2: each merged PR adds at least one SV scenario covering the bug it fixed (Bug #1-#10 retroactively)
128
+ - F3: SV gate becomes "all real-LLM scenarios PASS" before npm publish — replaces the current grep-and-label SV
129
+ - F4: cost budget for SV gate (~$10-20/run nightly, ~$300-600/month — explicit budget approval needed before W7 starts)
130
+
131
+ ### Release cadence
132
+
133
+ - v0.15.2 (this PR): redirect + stabilization plan + Phase A start
134
+ - v0.15.3-v0.15.7: incremental Phase B-E PRs, each landing one category fix + real-LLM SV scenario for that category
135
+ - v0.16.0 (~8-10 weeks from 2026-05-07): real-LLM SV gate active + 10-bug regression pattern verified eliminated empirically (3 consecutive campaigns at omc baseline parity or better)
136
+
137
+ ---
138
+
139
+ ## 6. Success criteria (measurable)
140
+
141
+ | Metric | Current (2026-05-07) | v0.16.0 target | Measurement |
142
+ |---|---|---|---|
143
+ | Bug discovery rate | 1-2/week | <1/month | git log of bug-report-* files in BOS |
144
+ | Operator-touch per campaign | unmeasured (high) | <1 per 5 campaigns | new analytics field in `campaign.jsonl` |
145
+ | Campaign completion rate | unmeasured (low) | >80% | new analytics field |
146
+ | SV gate catches production bugs | 0/10 | >50% (5/10 if Bug #11 happens, caught pre-publish) | post-publish bug review |
147
+ | Differentiator preservation | 4/4 | 4/4 | regression test per differentiator |
148
+
149
+ ---
150
+
151
+ ## 7. What this plan is NOT
152
+
153
+ - NOT a pivot away from rlp-desk
154
+ - NOT a maintenance mode declaration
155
+ - NOT a plan to delete the Node leader (`--mode tmux` and `--mode agent` Node CLI both stay; deletion is a separate decision deferred until stabilization complete)
156
+ - NOT a promise that omc patterns will be copied verbatim — they're inspiration, the implementation is rlp-desk-native
157
+
158
+ ## 8. What this plan IS
159
+
160
+ - A correction of the 2026-05-07 misdirection
161
+ - A category-by-category hardening roadmap with empirical baselines (Phase A)
162
+ - A real-LLM SV gate replacement for the current theatrical SV (Phase F)
163
+ - A preservation contract for the 4 differentiators
164
+ - A concrete release cadence ending in v0.16.0 with measured success criteria
165
+
166
+ ---
167
+
168
+ ## 9. First action (this PR)
169
+
170
+ This PR (`feat/v0.15.2-stabilization-redirect`):
171
+ - Reverts the maintenance-mode banner in `src/node/run.mjs`
172
+ - Replaces with stabilization-in-progress banner
173
+ - Removes `docs/plans/v0.16-FROZEN-status.md` (misdirection artifact)
174
+ - Adds this `docs/plans/v0.15-stabilization-plan.md`
175
+ - Updates `tests/node/us008-cli-entrypoint.test.mjs` regex
176
+ - Bumps to v0.15.2 + npm publish so users see the corrected banner
177
+
178
+ After this lands: Phase A (omc baseline measurement) starts. That's a separate session.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.15.1",
3
+ "version": "0.15.2",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
package/src/node/run.mjs CHANGED
@@ -405,9 +405,18 @@ async function runRunCommand(args, deps) {
405
405
  deps.stderr,
406
406
  'For Claude Code Native Agent() campaigns, use `/rlp-desk run --mode native` from a Claude Code session.',
407
407
  );
408
+ // 2026-05-07 (v0.15.2): rlp-desk is in active stabilization. Goal: reach
409
+ // omc /team/ralph/ralplan level of reliability while preserving
410
+ // rlp-desk's self-driving advantages (multi-engine consensus, multi-mission
411
+ // queue, BLOCK_TAGS taxonomy, structured SV reports). omc is the BENCHMARK,
412
+ // not a replacement. See docs/plans/v0.15-stabilization-plan.md.
408
413
  write(
409
414
  deps.stderr,
410
- 'This mode will hard-error in the next major release.',
415
+ 'SCHEDULED REMOVAL: --mode agent (Node CLI alpha) will be removed in a future major release. Date TBD until stabilization milestones complete.',
416
+ );
417
+ write(
418
+ deps.stderr,
419
+ 'STABILIZATION IN PROGRESS: rlp-desk is hardening against the 10-bug regression pattern observed 2026-05-01..05-07. See docs/plans/v0.15-stabilization-plan.md.',
411
420
  );
412
421
  }
413
422