@ai-dev-methodologies/rlp-desk 0.15.3 → 0.15.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/CHANGELOG.md +98 -0
  2. package/README.md +34 -4
  3. package/docs/rlp-desk/failure-modes.md +191 -0
  4. package/package.json +10 -3
  5. package/src/node/MANIFEST.txt +3 -0
  6. package/src/node/prompts/prompt-assembler.mjs +2 -2
  7. package/src/node/run.mjs +70 -3
  8. package/src/node/runner/campaign-main-loop.mjs +97 -13
  9. package/src/node/util/debug-log.mjs +10 -6
  10. package/src/node/util/lifecycle-metrics.mjs +102 -0
  11. package/src/scripts/lib_ralph_desk.zsh +66 -0
  12. package/src/scripts/run_ralph_desk.zsh +23 -3
  13. package/docs/plans/bug-report-overhaul-backlog.md +0 -49
  14. package/docs/plans/bug-report-overhaul-v0.md +0 -238
  15. package/docs/plans/bug-report-overhaul-v1.md +0 -319
  16. package/docs/plans/native-agent-revert.md +0 -184
  17. package/docs/plans/polished-gliding-toucan.md +0 -234
  18. package/docs/plans/pr-e-phase-c1-blocked-recovery-hygiene-v0.md +0 -233
  19. package/docs/plans/spicy-booping-galaxy.md +0 -717
  20. package/docs/plans/strategic-review/rlp-desk-strategic-review.md +0 -125
  21. package/docs/plans/v0.15-stabilization-phase-a-prep.md +0 -130
  22. package/docs/plans/v0.15-stabilization-plan.md +0 -178
  23. package/docs/plans/v0.16-real-llm-sv-gate-spec.md +0 -177
  24. package/docs/rlp-desk/internal/verification-policy-gap-analysis.md +0 -523
  25. package/docs/rlp-desk/internal/verification-strategy-research.md +0 -2097
  26. package/docs/rlp-desk/plans/cozy-gliding-trinket.md +0 -53
  27. package/docs/rlp-desk/plans/frolicking-churning-honey.md +0 -253
  28. package/docs/rlp-desk/plans/keen-sauteeing-snowflake.md +0 -245
  29. package/docs/rlp-desk/plans/mutable-booping-corbato.md +0 -163
  30. package/docs/rlp-desk/plans/rlp-desk-0.11-handoff-7fixes.md +0 -352
  31. package/docs/rlp-desk/plans/rlp-desk-0.11.1-tmux-pane-disappearance.md +0 -260
  32. package/docs/rlp-desk/plans/rlp-desk-elegant-papert-agent-a8cd695ffca2a3ad8.md +0 -84
  33. package/docs/rlp-desk/plans/rlp-desk-elegant-papert.md +0 -270
  34. package/docs/rlp-desk/plans/rlp-desk-tmux-flywheel-routing.md +0 -730
  35. package/docs/rlp-desk/plans/toasty-whistling-diffie-agent-a6814625642e956da.md +0 -201
  36. package/docs/rlp-desk/plans/toasty-whistling-diffie.md +0 -117
  37. package/docs/rlp-desk/plans/validated-snacking-crayon.md +0 -204
  38. package/examples/calculator/.claude/ralph-desk/logs/loop-test/iter-001.worker-output.log +0 -0
  39. package/examples/calculator/.claude/ralph-desk/logs/loop-test/iter-001.worker-prompt.md +0 -38
  40. package/examples/calculator/.claude/ralph-desk/logs/loop-test/iter-001.worker-trigger.sh +0 -28
  41. package/examples/calculator/.claude/ralph-desk/logs/loop-test/session-config.json +0 -25
  42. package/examples/calculator/.claude/ralph-desk/logs/loop-test/status.json +0 -10
  43. package/examples/calculator/.claude/ralph-desk/logs/loop-test/worker-heartbeat.json +0 -1
@@ -1,125 +0,0 @@
1
- # rlp-desk 전략 재평가 — autoplan 입력
2
-
3
- > **목표**: rlp-desk를 계속 발전시킬지(patch/redesign), 폐기 후 재구성할지(rebuild), 기존 도구로 pivot할지 결정.
4
- > **핵심 KPI**: blueprint 처음 → 끝 자율 완료. 사람 개입 최소.
5
- > **현재 상태**: 6주간 10개 bug, 매주 1-2개 신규 발견, manual recovery마저 broken (#10).
6
-
7
- ---
8
-
9
- ## 1. Vision (BOS dev 합의, 2026 초)
10
-
11
- 1. ralph-loop를 **fresh-context**로 매번 시작 (컨텍스트 오염 없음)
12
- 2. idea → plan distillation
13
- 3. PRD 정형화
14
- 4. Worker/Verifier 사이클로 반복 개선
15
- 5. **완전 자율화** — 사람 개입 최소
16
-
17
- → 5번이 단일 success criterion. 1-4는 5번을 달성하기 위한 메커니즘.
18
-
19
- ---
20
-
21
- ## 2. Reality — 6주간 측정된 사실
22
-
23
- ### Bug 발생 패턴
24
-
25
- | # | 일자 | 카테고리 | 회수 가능 | 다음 fix가 노출한 다음 bug |
26
- |---|---|---|---|---|
27
- | #1 | 2026-05-01 | LLM-runtime (`.claude/` self-modification gate) | partial — codex worker로 우회 | #2 (tmux session) |
28
- | #2 | 2026-05-01 | tmux session lifecycle | yes | #3 |
29
- | #3 | 2026-05-04 | verifier no-progress | yes | #4 (regression of #3) |
30
- | #4 | 2026-05-05 | #3 regression | partial | #5 |
31
- | #5 | 2026-05-05 | worker dead on reuse (pane lifecycle) | yes | #6 |
32
- | #6 | 2026-05-06 | claude worker idle false-positive | yes | #7 |
33
- | #7 | 2026-05-06 | post-sentinel process race (1m43s drift) | yes | #8 + #10 |
34
- | #8 | 2026-05-06 | worker incomplete + leader A4 fallback | partial | #10 |
35
- | #9 | (별도) | verified_us 영속성 (status.json) | yes | — |
36
- | #10 | 2026-05-07 | leader ignores phase=verify on relaunch — **회수마저 깨짐** | NO (이게 회수 메커니즘) | ? |
37
-
38
- **Pattern observed:**
39
- - 매주 1-2개 신규 bug
40
- - 절반은 **이전 fix가 노출**한 새 failure mode (regression style)
41
- - Bug #10이 가장 심각 — 회수 메커니즘 자체가 broken → BLOCKED 시 operator의 manual recovery가 무효화
42
-
43
- ### Bug 카테고리 분류
44
-
45
- | 카테고리 | Bug | 비중 | architectural 필연성? |
46
- |---|---|---|---|
47
- | (a) tmux/process lifecycle race | #2, #5, #6, #7 | 40% | **YES** — tmux pane lifecycle은 claude/codex TUI lifecycle과 분리됨. race window가 본질적 |
48
- | (b) artifact contract / schema | #3, #4, #8, #9 | 40% | partial — schema가 더 strict하면 줄어들지만 LLM이 schema 어기는 빈도가 본질적 |
49
- | (c) LLM-runtime constraint | #1 | 10% | **YES** — Claude Code의 `.claude/` self-modification gate는 외부 변수 |
50
- | (d) recovery hygiene | #10 | 10% | accidental — fix 가능 |
51
-
52
- → **80%가 architectural inevitability** (a + c). schema strict + retry는 (b)를 줄이지만 LLM 비결정성이 한계.
53
-
54
- ### SV gate 한계
55
-
56
- - 모든 sv-self-verify-*.sh가 **Worker/Verifier metaphor만 사용** → 실제 LLM agent run 없음
57
- - grep + unit test + regression의 5-category labeling
58
- - → 10개 bug가 SV gate 통과 후 production에서 발견됐다는 사실 = **framework이 production failure mode를 cover 못 함**
59
- - Production failure mode는 **LLM/tmux/network/timing 비결정성** — unit test로 잡을 수 없는 영역
60
-
61
- ### In-flight 미완 branch
62
-
63
- - `feat/native-agent-revert`: P0(Bug #7 fix) + P1(slash native prose 회복) — 미완. Plan: 5 round ralplan 합의됨, codex critic APPROVED
64
- - `feat/bug10-relaunch-hygiene`: PR-A commit 95c0d4e + SV gate 추가 — 방금 완료, push 안 됨
65
-
66
- → 두 branch 모두 land 안 됨 = 새 bug 발견 시 기존 fix가 무력화될 위험
67
-
68
- ---
69
-
70
- ## 3. 4가지 옵션
71
-
72
- ### Option A — Continue patching (현재 path)
73
- PR-A (Bug #10) → PR-B (bundler) → PR-C (patterns) → 다음 bug → 반복.
74
-
75
- ### Option B — Fundamental redesign (vision 유지, architecture 재설계)
76
- **핵심 변경**: tmux pane orchestration 폐기 → Claude Code Native Agent() / subprocess 직접 dispatch. Sentinel file → in-memory channel. 두 변경으로 (a) 카테고리 80%가 제거.
77
-
78
- ### Option C — Scrap and rebuild (vision 일부 수정)
79
- 6주 코드 폐기. Vision 5번 (완전 자율화)은 유지하되 1번(fresh-context)은 task-isolated subprocess로 재정의. ralph-loop 자체 폐기 가능 — 그냥 plan→worker→verify 사이클로 단순화.
80
-
81
- ### Option D — Pivot to existing tool
82
- - ralph plugin (cradle): 가벼움, 기능 부족
83
- - omc (oh-my-claudecode): /ralph + /ralplan + /omc-teams + /autopilot — 이미 multi-model orchestration 있음
84
- - superpowers: subagent-driven-development, executing-plans, brainstorming — plan→exec 사이클은 이미 있음
85
- - claude-devfleet: dmux 기반, multi-agent
86
-
87
- ---
88
-
89
- ## 4. 평가 항목 (각 옵션마다)
90
-
91
- | 항목 | 정의 | 이유 |
92
- |---|---|---|
93
- | Vision 보존도 | 5개 vision 중 살아남는 개수 | "완전 자율화" 달성 가능성 |
94
- | Time-to-first-successful-blueprint | 처음으로 blueprint를 끝까지 자율 완료하는 데 걸리는 시간 | **단일 핵심 KPI** |
95
- | Sunk cost write-off % | 폐기되는 코드/SV 비율 | 결정의 reversibility |
96
- | Bug regression 위험 | 새 시스템에서 #1-#10 같은 bug가 다시 나올 확률 | "다음 6주에 또 10개?" |
97
- | Personal capacity ROI | 1주 투자 시 deliverable | sustainability |
98
-
99
- ---
100
-
101
- ## 5. In-flight branch 처리 결정
102
-
103
- 이 평가가 끝난 후 결정:
104
- - `feat/native-agent-revert`: land / abandon / re-scope?
105
- - `feat/bug10-relaunch-hygiene`: 이미 commit 95c0d4e, push할까 hold할까?
106
-
107
- 선택 옵션이 무엇이든 두 branch는 처리 필요.
108
-
109
- ---
110
-
111
- ## 6. 제약
112
-
113
- - BOS dev가 실제 캠페인을 돌려야 함 → short-term (1-2주) patching 불가피
114
- - rlp-desk는 npm published — breaking change는 user-facing (semver 고려)
115
- - 분석은 advisory only — 실제 코드 수정/commit/push는 사용자 승인 별도
116
-
117
- ---
118
-
119
- ## 7. autoplan에 요구 — 이 문서를 입력으로
120
-
121
- CEO 관점: rlp-desk가 푸는 문제가 옳은가? 다른 도구로 같은 가치를 이미 얻을 수 있나?
122
- Eng 관점: 6주간의 architectural pattern을 보면 (a) 80% 카테고리가 patching으로 해결될까? 아니면 redesign이 필연인가?
123
- DX 관점: rlp-desk는 dev tool — operator(BOS dev)가 매번 30분씩 hand-write recovery를 해야 한다는 사실이 DX failure인가?
124
-
125
- 각 phase에서 dual voices (Codex + Claude subagent) 실행, consensus table 생성, 4 options steelman.
@@ -1,130 +0,0 @@
1
- # Phase A — Empirical omc Baseline (Prep for Next Session)
2
-
3
- > **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase A
4
- > **Goal**: measure omc /ralph + /team reliability empirically. Establishes the bar rlp-desk needs to reach.
5
- > **Output**: `docs/plans/v0.15-stabilization-omc-baseline.md` with per-test metrics.
6
- > **NOT a competition**: this is a measurement to set the stabilization target. omc is benchmark, not replacement.
7
-
8
- ---
9
-
10
- ## Why baseline measurement comes first
11
-
12
- Before changing rlp-desk's substrate (Phase B-F), we need to know what "omc-level reliability" actually looks like for our workload. Otherwise stabilization targets are guesswork.
13
-
14
- The 4 differentiators rlp-desk MUST preserve are not measured by omc — they're rlp-desk-only by design. So the baseline measurement focuses on what omc DOES have: per-iter Worker→Verifier, PRD-driven loop, multi-agent coordination, mandatory deslop, regression re-verification.
15
-
16
- ## Three test campaigns
17
-
18
- ### A1 — single-iter, single-story
19
- - **Workload**: small TypeScript fix in a sandbox repo (or BOS apps/web/ on a contained file)
20
- - **Tool**: `/oh-my-claudecode:ralph "Fix the unused variable warning in <specific-file>"`
21
- - **Measure**:
22
- - operator-touch count (target = 0)
23
- - total time (entry → completion)
24
- - cost (token usage if available, otherwise rough estimate)
25
- - did mandatory deslop pass run?
26
- - did regression re-verification pass?
27
-
28
- ### A2 — multi-iter, multi-story
29
- - **Workload**: synthetic PRD with 3 stories, each story trivial but distinct
30
- - **Tool**: `/oh-my-claudecode:ralph` with auto-generated prd.json refined to 3 stories
31
- - **Measure**:
32
- - operator-touch count per story
33
- - story completion order (sequential as PRD'd, or chosen by ralph?)
34
- - failure recovery if a story blocks (does it advance to next or stop?)
35
- - prd.json final state (all stories `passes: true`?)
36
- - reviewer behavior (was each story actually verified against acceptance criteria?)
37
-
38
- ### A3 — parallel team
39
- - **Workload**: 3-task synthetic spec where tasks are truly independent
40
- - **Tool**: `/oh-my-claudecode:team 3:executor "<spec>"`
41
- - **Measure**:
42
- - parallelism (do 3 agents really run concurrently?)
43
- - lock contention (any deadlocks on shared task list?)
44
- - inter-agent messaging (any SendMessage between teammates?)
45
- - completion time vs sequential estimate
46
- - cleanup (does TeamDelete actually run on completion?)
47
-
48
- ## Sandbox setup
49
-
50
- Recommend a throwaway test repo to avoid contaminating BOS or rlp-desk:
51
-
52
- ```bash
53
- mkdir /tmp/omc-baseline-sandbox && cd /tmp/omc-baseline-sandbox
54
- git init
55
- # Add 1-2 simple TypeScript files with intentional issues for A1
56
- # Add a synthetic PRD for A2
57
- # Add a 3-task spec for A3
58
- ```
59
-
60
- This isolates measurement from real product work and lets us reset between tests.
61
-
62
- ## Output schema
63
-
64
- `docs/plans/v0.15-stabilization-omc-baseline.md`:
65
-
66
- ```markdown
67
- # omc Baseline Measurement — Phase A Output
68
-
69
- ## A1 (single-iter, single-story)
70
- - Workload: <description>
71
- - Result: PASS / PARTIAL / FAIL
72
- - Operator-touch: N
73
- - Time: Xm Ys
74
- - Cost: $X.XX
75
- - Deslop ran: YES/NO
76
- - Regression re-verify: PASS/FAIL/N-A
77
- - Subjective notes: ...
78
-
79
- ## A2 (multi-iter, multi-story)
80
- - Workload: <description>
81
- - Result: PASS / PARTIAL / FAIL
82
- - Stories completed: N/3
83
- - Operator-touch: N per story
84
- - Failure recovery observed: YES/NO + behavior
85
- - Total time: Xm Ys
86
- - prd.json final state: ...
87
- - Reviewer per-story verification: YES/NO
88
-
89
- ## A3 (parallel team)
90
- - Workload: <description>
91
- - Result: PASS / PARTIAL / FAIL
92
- - Parallelism observed: YES/NO + concurrent count
93
- - Lock contention: NONE/ONE/MULTIPLE
94
- - Total time vs sequential: Xm vs Ym
95
- - Cleanup: COMPLETE/PARTIAL/MANUAL
96
-
97
- ## Synthesis — what is "omc-level reliability"?
98
-
99
- [1-2 paragraphs translating the metrics into a target for rlp-desk]
100
-
101
- ## Phase B implications
102
-
103
- [which omc patterns should rlp-desk Phase B adopt for tmux/process lifecycle race?]
104
- ```
105
-
106
- ## Cost estimate
107
-
108
- 3 small test campaigns: total ~$5-15. Sandbox runs are bounded (no real product blast radius).
109
-
110
- ## How to run (next session)
111
-
112
- Open Claude Code in /tmp/omc-baseline-sandbox (after creating it per above):
113
-
114
- ```
115
- /oh-my-claudecode:ralph "Fix the unused variable warning in src/index.ts"
116
- ```
117
-
118
- Wait for completion. Record metrics. Reset sandbox. Repeat for A2 and A3 with the synthetic PRD/spec.
119
-
120
- After all three: write `docs/plans/v0.15-stabilization-omc-baseline.md` in rlp-desk repo with the synthesis.
121
-
122
- ## What this enables
123
-
124
- Once Phase A is done, Phase B (tmux/process lifecycle race hardening) has a concrete target: "match omc /ralph's A1 behavior on operator-touch within ±20%, while preserving multi-engine consensus + multi-mission queue."
125
-
126
- Without Phase A, Phase B targets are imaginary.
127
-
128
- ## Honest scope note
129
-
130
- This is preparation work. The Phase A run itself happens in a fresh session (sandbox cwd). This prep doc + the v0.15-stabilization-plan.md are the carry-over artifacts.
@@ -1,178 +0,0 @@
1
- # rlp-desk Stabilization Plan (v0.15.x → v0.16.x)
2
-
3
- > **Status**: ACTIVE. Replaces the misdirected 2026-05-07 "pivot to omc" decision (PR #8 redirect via this plan).
4
- > **Goal**: bring rlp-desk to omc /team/ralph/ralplan level of reliability **while preserving rlp-desk's self-driving advantages**.
5
- > **Non-goal**: pivoting away from rlp-desk. omc is the **benchmark**, not the replacement.
6
-
7
- ---
8
-
9
- ## 0. Why this plan exists (correction note)
10
-
11
- On 2026-05-07 morning I (the assistant) ran `plan-ceo-review` on the question "rlp-desk vs omc /team" and produced a recommendation to enter maintenance mode and pivot to omc. The user immediately corrected: *the goal was always to make rlp-desk work as reliably as omc, NOT to replace it*.
12
-
13
- This plan is the corrected direction: stabilize rlp-desk by learning from omc's patterns, applying them to rlp-desk's substrate, while protecting the four real differentiators that make rlp-desk worth using in the first place.
14
-
15
- The misdirected commit `229e1b6` (the "maintenance mode" banner + FROZEN doc) is now reverted in this PR. The pivot prompt-optimizer artifact and BOS validation plan stay on disk but are deferred — they may become useful later as a comparison study, but they are not the active path.
16
-
17
- ---
18
-
19
- ## 1. The vision (preserved verbatim)
20
-
21
- 1. ralph-loop fresh-context per iteration (no context pollution)
22
- 2. idea → plan distillation
23
- 3. PRD formalization
24
- 4. Worker/Verifier cycles with iterative improvement
25
- 5. **Full autonomy — minimum operator intervention**
26
-
27
- This vision is the core. Stabilization is in service of it, not a substitute for it.
28
-
29
- ---
30
-
31
- ## 2. Differentiators to preserve (rlp-desk-only)
32
-
33
- These four are the reason rlp-desk exists separately from omc. Stabilization work MUST NOT compromise them:
34
-
35
- 1. **Multi-engine parallel consensus per iteration**: `--consensus all` runs claude AND codex on every verification, then reconciles. omc /ralph supports `--critic=codex` but as a single critic, not parallel consensus.
36
- 2. **Multi-mission queue + cross-mission analytics**: `RLP_BACKGROUND=1` chains missions and tracks cross-mission metrics. omc /team is single-task.
37
- 3. **BLOCK_TAGS P1-D failure taxonomy**: structured `reason_category × recoverable × suggested_action` classification. omc emits simpler verdicts (pass/fail/blocked).
38
- 4. **Structured SV reports**: post-campaign analytics at `~/.claude/ralph-desk/analytics/<slug>/self-verification-report-NNN.md`. omc has lighter `progress.txt`.
39
-
40
- These four ARE the value proposition. The stabilization work below is about making the substrate that delivers them as reliable as omc's.
41
-
42
- ---
43
-
44
- ## 3. The 10-bug regression pattern (what we're hardening against)
45
-
46
- Six weeks (2026-05-01 to 2026-05-07), 10 bugs, each prior fix exposing the next. Categorized:
47
-
48
- | Cat | Bugs | Root cause cluster |
49
- |---|---|---|
50
- | (a) tmux/process lifecycle race | #5, #6, #7, #10 | Long-lived TUI processes in tmux panes; sentinel polling races; recovery hygiene |
51
- | (b) artifact contract / schema | #3, #4, #8, #9 | Worker/Verifier output contract violations; LLM non-determinism on schema; verified_us persistence |
52
- | (c) LLM-runtime constraint | #1 | Claude Code `.claude/` self-modification gate blocking sentinel writes |
53
- | (d) recovery hygiene | #10 | Manual recovery on relaunch silently overwritten |
54
-
55
- **Per category, what omc does differently** (preliminary — to be verified empirically in §5):
56
-
57
- - **(a) Lifecycle race**: omc /team uses Claude Code native team primitives (`TeamCreate`, `TaskCreate`, `SendMessage`). No tmux, no long-lived TUI, no sentinel polling. Process lifecycle = subagent lifecycle = single Claude Code call. Race window does not exist.
58
- - **(b) Contract violations**: omc /ralph uses `prd.json` with `passes: bool` per story + reviewer verifies acceptance criteria. Simpler schema = less surface for LLM to violate. omc also has mandatory deslop pass + regression re-verification (`ai-slop-cleaner` + Step 7.6).
59
- - **(c) Self-modification gate**: omc skills are read by Claude Code via the Skill tool, not written by Workers. Workers don't touch `.claude/` paths. Gate not encountered.
60
- - **(d) Recovery**: omc /ralph is session-scoped (`.omc/state/sessions/{sessionId}/prd.json`). Per-session state means relaunch starts fresh; there is no "manual recovery" surface to break.
61
-
62
- These are the patterns to learn from. Adopting them does NOT require pivoting away from rlp-desk; it requires bringing equivalent semantics into rlp-desk's substrate.
63
-
64
- ---
65
-
66
- ## 4. Stabilization principles
67
-
68
- 1. **omc is benchmark, not replacement.** Every change in this plan asks "how does omc avoid this failure mode?" then engineers an equivalent for rlp-desk's stack.
69
- 2. **Preserve all 4 differentiators.** No change should compromise multi-engine consensus, multi-mission queue, BLOCK_TAGS taxonomy, or SV reports.
70
- 3. **Substrate first, features second.** Bug categories (a) and (d) are substrate. Categories (b) and (c) are surface. Fix substrate first; surface improvements compound on a stable base.
71
- 4. **Real-LLM SV gate.** The current SV gate's grep+unit-test labeling missed 10 production bugs. SV must be strengthened to actually catch production failure modes (subset of campaigns run with full claude/codex worker+verifier in CI-like mode).
72
- 5. **Increment by category.** Each PR closes ONE bug category, not multiple. Avoids "fix-of-fix-of-fix" pattern that produced #4 (regression of #3).
73
-
74
- ---
75
-
76
- ## 5. Concrete workstream (revised, per category)
77
-
78
- ### Phase A — Empirical omc baseline (W1, ~3 days)
79
-
80
- Before changing rlp-desk, measure omc reliably. Three test campaigns:
81
-
82
- | Test | Workload | Measure |
83
- |---|---|---|
84
- | A1 | omc /ralph "fix small TS error in BOS apps/web/" | operator-touch count, time, cost |
85
- | A2 | omc /ralph + multi-iter (3+ stories) on a synthetic PRD | operator-touch count, recovery behavior |
86
- | A3 | omc /team "implement small feature with 3:executor" on synthetic task | parallelism behavior, lock contention |
87
-
88
- **Output**: `docs/plans/v0.15-stabilization-omc-baseline.md` with per-test metrics. Not a competition, a *measurement*. Establishes the bar rlp-desk needs to reach.
89
-
90
- ### Phase B — Category (a) substrate hardening (W1-W3, ~2 weeks)
91
-
92
- The largest cluster (4 of 10 bugs). Goal: tmux/process lifecycle race window → 0 in `--mode tmux`. `--mode native` already addresses this differently; the work here is `--mode tmux`.
93
-
94
- Sub-deliverables:
95
- - B1: lifecycle audit (every tmux send-keys / sentinel write / pane reuse — ASCII diagram of timing windows)
96
- - B2: post-sentinel reaper invariant test (extend Bug #7 fix coverage to all sentinel writes, not just per-US)
97
- - B3: real-LLM SV scenario for category (a) — actual claude/codex worker dispatched, lifecycle race triggered deterministically, fix verified
98
- - B4: lifecycle observability (debug log emits race-window measurements per iteration)
99
-
100
- ### Phase C — Category (d) recovery hygiene completion (W3-W4, ~1 week)
101
-
102
- Bug #10's PR-A fix covers `phase=verify` honor. Remaining recovery surfaces:
103
- - C1: phase=blocked recovery (operator clears blocked sentinel + restarts) — currently honored, verify with test
104
- - C2: phase=worker mid-iter crash recovery (leader killed mid-worker dispatch) — verify, fix if broken
105
- - C3: cross-mission queue recovery (one mission BLOCKED, queue advances) — verify
106
- - C4: documented operator recovery cookbook with deterministic jq pipelines
107
-
108
- ### Phase D — Category (b) contract hardening (W4-W6, ~2 weeks)
109
-
110
- LLM contract violations are partly inevitable, but the harness can reduce the surface:
111
- - D1: schema validator at every artifact write (already exists for some; extend to all done-claim/iter-signal/verdict variants)
112
- - D2: feedback loop — when worker violates contract, next iteration's prompt includes the schema error verbatim (omc-style)
113
- - D3: verified_us persistence audit (Bug #9) — `status.json` is the source of truth, memory.md is supplementary, contract clear in code
114
- - D4: real-LLM SV scenario for category (b)
115
-
116
- ### Phase E — Category (c) LLM-runtime constraint awareness (W6-W7, ~1 week)
117
-
118
- `.claude/` self-modification gate (Bug #1):
119
- - E1: Worker prompt explicitly states "do NOT touch `.claude/`; sentinel paths are at `.rlp-desk/memos/`" (already done in v0.13.0 path migration; verify)
120
- - E2: claude worker pre-flight check — try a no-op write to `.rlp-desk/` before main work; fail fast if blocked
121
- - E3: cross-engine fallback — when claude worker hits permission gate, mid-flight fallback to codex worker for that iter (already partial; complete)
122
-
123
- ### Phase F — Real-LLM SV gate (W7-W8, ~2 weeks)
124
-
125
- The biggest framework upgrade:
126
- - F1: define "SV scenario" = complete real campaign (1-3 iter, real claude/codex, real tmux or native) executed in CI nightly
127
- - F2: each merged PR adds at least one SV scenario covering the bug it fixed (Bug #1-#10 retroactively)
128
- - F3: SV gate becomes "all real-LLM scenarios PASS" before npm publish — replaces the current grep-and-label SV
129
- - F4: cost budget for SV gate (~$10-20/run nightly, ~$300-600/month — explicit budget approval needed before W7 starts)
130
-
131
- ### Release cadence
132
-
133
- - v0.15.2 (this PR): redirect + stabilization plan + Phase A start
134
- - v0.15.3-v0.15.7: incremental Phase B-E PRs, each landing one category fix + real-LLM SV scenario for that category
135
- - v0.16.0 (~8-10 weeks from 2026-05-07): real-LLM SV gate active + 10-bug regression pattern verified eliminated empirically (3 consecutive campaigns at omc baseline parity or better)
136
-
137
- ---
138
-
139
- ## 6. Success criteria (measurable)
140
-
141
- | Metric | Current (2026-05-07) | v0.16.0 target | Measurement |
142
- |---|---|---|---|
143
- | Bug discovery rate | 1-2/week | <1/month | git log of bug-report-* files in BOS |
144
- | Operator-touch per campaign | unmeasured (high) | <1 per 5 campaigns | new analytics field in `campaign.jsonl` |
145
- | Campaign completion rate | unmeasured (low) | >80% | new analytics field |
146
- | SV gate catches production bugs | 0/10 | >50% (5/10 if Bug #11 happens, caught pre-publish) | post-publish bug review |
147
- | Differentiator preservation | 4/4 | 4/4 | regression test per differentiator |
148
-
149
- ---
150
-
151
- ## 7. What this plan is NOT
152
-
153
- - NOT a pivot away from rlp-desk
154
- - NOT a maintenance mode declaration
155
- - NOT a plan to delete the Node leader (`--mode tmux` and `--mode agent` Node CLI both stay; deletion is a separate decision deferred until stabilization complete)
156
- - NOT a promise that omc patterns will be copied verbatim — they're inspiration, the implementation is rlp-desk-native
157
-
158
- ## 8. What this plan IS
159
-
160
- - A correction of the 2026-05-07 misdirection
161
- - A category-by-category hardening roadmap with empirical baselines (Phase A)
162
- - A real-LLM SV gate replacement for the current theatrical SV (Phase F)
163
- - A preservation contract for the 4 differentiators
164
- - A concrete release cadence ending in v0.16.0 with measured success criteria
165
-
166
- ---
167
-
168
- ## 9. First action (this PR)
169
-
170
- This PR (`feat/v0.15.2-stabilization-redirect`):
171
- - Reverts the maintenance-mode banner in `src/node/run.mjs`
172
- - Replaces with stabilization-in-progress banner
173
- - Removes `docs/plans/v0.16-FROZEN-status.md` (misdirection artifact)
174
- - Adds this `docs/plans/v0.15-stabilization-plan.md`
175
- - Updates `tests/node/us008-cli-entrypoint.test.mjs` regex
176
- - Bumps to v0.15.2 + npm publish so users see the corrected banner
177
-
178
- After this lands: Phase A (omc baseline measurement) starts. That's a separate session.
@@ -1,177 +0,0 @@
1
- # Real-LLM SV Gate Framework — Spec (Phase F bootstrap)
2
-
3
- > **Status**: design + first scenario bootstrap. Active stabilization Phase F.
4
- > **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase F.
5
- > **Goal**: replace today's grep+unit-test "SV gate theater" with a framework that ACTUALLY catches production failure modes by running real claude/codex agents against real campaigns.
6
-
7
- ---
8
-
9
- ## 1. The problem in one paragraph
10
-
11
- Today's SV gate (`tests/sv-self-verify-*.sh`, `tests/sv-gate-*.sh`) is grep-and-label theater. Each "scenario" is a shell command (grep, unit test, regression) labeled with five categories (`correctness | integration | security | perf | error-path`) and reported as "Worker → Verifier → PASS". **No real LLM agent runs**. None of the 10 production bugs (#1-#10, 2026-05-01..05-07) were caught by SV gate before shipping. Every Bug #N+1 will follow the same pattern unless the framework changes.
12
-
13
- ## 2. Why current SV gate misses production failures
14
-
15
- Production failure modes that today's gate cannot reproduce:
16
-
17
- | Failure mode | Why grep+unit can't catch | Real-LLM gate can |
18
- |---|---|---|
19
- | Worker hang on `.claude/` self-modification gate (Bug #1) | LLM platform-level constraint, not testable in unit | YES (real claude worker writes sentinel, observes hang) |
20
- | tmux pane lifecycle race (Bug #5/#6/#7) | timing-dependent, requires real TUI lifecycle | YES (real worker + real pane + real reaper) |
21
- | Worker contract violation under non-determinism (Bug #3/#4/#8/#9) | LLM produces malformed/incomplete artifact non-deterministically | YES (run actual worker with real prompt; observe artifact shape over N runs) |
22
- | Recovery hygiene under partial state (Bug #10) | requires real campaign + real BLOCKED + real operator recovery | YES (real campaign brought to BLOCKED, real operator-style recovery, real relaunch) |
23
-
24
- The common thread: **production failures require production-like state**. Unit tests can't produce it; only real campaigns can.
25
-
26
- ## 3. Framework requirements
27
-
28
- A real-LLM SV gate must:
29
-
30
- R1. **Run a complete real campaign** (1-3 iterations) per scenario, not isolated function calls.
31
- R2. **Use real claude or codex worker + verifier**, not stub/mock LLM responses.
32
- R3. **Run inside CI-like environment** (real tmux session + real Claude Code CLI / codex CLI).
33
- R4. **Be deterministic enough to gate a release**: same scenario produces same PASS/FAIL outcome >95% of the time.
34
- R5. **Stay within cost budget**: each scenario <$5; total nightly run <$50.
35
- R6. **Run on a schedule, not per-PR**: too expensive to gate every PR. Nightly is the target. Per-PR can run a subset (sub-$10).
36
- R7. **Capture failure provenance**: when scenario fails, the failure must be reproducible from the captured campaign artifacts (logs, status, sentinel state).
37
- R8. **Cover all 10 historical bug categories**: each merged PR adds at least one scenario covering the bug class it touched.
38
-
39
- ## 4. Architecture
40
-
41
- ```
42
- ┌─────────────────────────────────────────────────────────────────┐
43
- │ tests/sv-real-llm/ │
44
- │ │
45
- │ scenarios/ │
46
- │ bug-01-claude-self-modification-gate.test.sh │
47
- │ bug-05-worker-dead-on-reuse.test.sh │
48
- │ bug-07-post-sentinel-race.test.sh │
49
- │ bug-10-relaunch-phase-verify-hygiene.test.sh │
50
- │ ... │
51
- │ │
52
- │ harness/ │
53
- │ run-scenario.sh (single scenario runner) │
54
- │ run-all.sh (nightly suite runner) │
55
- │ budget-guard.sh (cost budget enforcement) │
56
- │ capture-artifacts.sh (post-run forensic bundle) │
57
- │ │
58
- │ fixtures/ │
59
- │ minimal-prd-1us.md (1 user story, deterministic) │
60
- │ minimal-prd-3us.md (3 user stories) │
61
- │ │
62
- │ results/ │
63
- │ <date>-<scenario>.json (per-run outcome + cost + time) │
64
- │ <date>-<scenario>.bundle/ (full campaign artifacts) │
65
- └─────────────────────────────────────────────────────────────────┘
66
- ```
67
-
68
- Each scenario is a self-contained shell test:
69
- - Sets up a sandbox campaign
70
- - Runs `node ~/.claude/ralph-desk/node/run.mjs run <slug>` or via slash command
71
- - Asserts on final state (sentinel + status + artifacts)
72
- - Reports PASS/FAIL with provenance
73
-
74
- ## 5. Scenario authoring contract
75
-
76
- Every scenario must define:
77
-
78
- ```bash
79
- # scenario header (machine-readable)
80
- SCENARIO_ID="bug-10-relaunch-phase-verify-hygiene"
81
- SCENARIO_DESCRIPTION="Relaunch with operator-written phase=verify artifacts honored"
82
- SCENARIO_BUG_CATEGORY="d-recovery-hygiene"
83
- SCENARIO_HISTORICAL_BUG="Bug #10 (BOS 2026-05-07)"
84
- SCENARIO_COST_BUDGET_USD="3"
85
- SCENARIO_TIMEOUT_SECONDS="600"
86
- SCENARIO_REQUIRES="claude_cli OR codex_cli; tmux; jq"
87
-
88
- # scenario body: 4 parts
89
- # 1. SETUP — sandbox dir, init campaign, prepare fixture state
90
- # 2. EXERCISE — run actual rlp-desk against the fixture
91
- # 3. ASSERT — check final state matches expected
92
- # 4. REPORT — emit JSON outcome + capture artifacts
93
- ```
94
-
95
- The 4-part shape is non-negotiable. Scenarios that skip ASSERT or REPORT do not count toward gate coverage.
96
-
97
- ## 6. Cost + time discipline
98
-
99
- - **Budget per scenario**: $5 default, $10 max. Scenarios exceeding budget must be marked `EXPENSIVE` and only run weekly, not nightly.
100
- - **Timeout per scenario**: 600s default, 1800s max. Scenarios exceeding timeout are FAIL automatically.
101
- - **Total nightly budget**: $50. Suite must fit within budget or be sharded across nights.
102
- - **Per-PR subset**: scenarios marked `LIGHT` (under $1, under 60s) run on every PR. Others run nightly.
103
-
104
- Cost tracking via `tests/sv-real-llm/harness/budget-guard.sh` — reads `cost-log.jsonl` from rlp-desk and aggregates pre-run/post-run deltas.
105
-
106
- ## 7. First scenario (this PR's bootstrap deliverable)
107
-
108
- `tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` — converts the existing PR-A test (Bug #10 fix) into a real-LLM scenario:
109
-
110
- 1. SETUP: create sandbox campaign in `/tmp/sv-real-llm-bug-10/`. init with minimal PRD (1 US, "fix the typo in foo.txt" — deterministic). Run iter-1 normally to BLOCKED state (deliberately fail verifier). Operator-style recovery: write iter-signal.json + done-claim.json by hand, set status.phase=verify, remove blocked sentinel.
111
- 2. EXERCISE: relaunch leader (`/oh-my-claudecode:autopilot` or `node run.mjs run`).
112
- 3. ASSERT:
113
- - Leader logs `[recovery] Resuming verify phase` (PR-A audit line)
114
- - No new `iter-001.worker-prompt.md` written (worker dispatch skipped)
115
- - Verifier dispatched and produces verdict
116
- - Final state: complete sentinel OR continue to iter-2
117
- 4. REPORT: write `<date>-bug-10.json` with `{outcome, time_seconds, cost_usd, log_path, captured_state_path}`.
118
-
119
- Cost estimate: $1-3 (single iter-2 with fast worker model + verifier).
120
-
121
- ## 8. Scenario coverage roadmap
122
-
123
- | Scenario | Bug class | Status | Cost | Test type |
124
- |---|---|---|---|---|
125
- | bug-10-relaunch-phase-verify-hygiene | (d) recovery | ✅ landed PR #12 | $3 | real-LLM |
126
- | bug-10-blocked-recovery-counters | (d) recovery | ✅ landed (this PR set) | $2 | real-LLM |
127
- | bug-07-post-sentinel-race | (a) lifecycle | ✅ landed | $3 | real-LLM |
128
- | bug-08-worker-incomplete-leader-fallback | (b) contract | ✅ landed | $3 | real-LLM |
129
- | bug-01-claude-self-modification-gate | (c) LLM-runtime | ✅ landed | $3 | real-LLM |
130
- | bug-05-worker-dead-on-reuse | (a) lifecycle | ✅ landed | $2 | real-LLM |
131
- | bug-03-verifier-noprogress | (b) contract | ✅ landed | $0 | structural |
132
- | bug-04-verifier-noprogress-regression | (b) contract | ✅ landed | $0 | structural |
133
- | bug-06-claude-worker-idle-noprogress | (a) lifecycle | ✅ landed | $0 | structural |
134
- | bug-09-verified-us-persistence | (b) contract | ✅ landed | $0 | structural |
135
-
136
- **All 10 historical bugs (#1-#10) covered.** 6 scenarios are real-LLM (require campaigns) gated by `RLP_REAL_LLM_GATE=1`; 4 are structural (zero cost, regression-guard the fix code itself). Future PRs fixing new bug classes MUST add a scenario for that class.
137
-
138
- Real-LLM total budget: ~$16 for full nightly run (6 × ~$3). Structural scenarios contribute zero cost. Per-PR LIGHT subset: structural-only (4 scenarios, <60s, $0).
139
-
140
- ## 9. Integration with existing SV gate
141
-
142
- The current `tests/sv-gate-fast.sh` becomes the LIGHT subset. New `tests/sv-gate-real-llm.sh` becomes the FULL nightly suite. Existing scenario format is preserved for grep/unit checks; real-LLM scenarios are additive, not a replacement.
143
-
144
- `tests/sv-self-verify-*.sh` pattern stays — those are per-PR scenarios. New scenarios under `tests/sv-real-llm/scenarios/` are the real-LLM additions.
145
-
146
- ## 10. Out of scope (deferred)
147
-
148
- - **Continuous deployment**: nightly schedule + GHA workflow. Add after first 3 scenarios prove the harness works.
149
- - **Cost dashboard**: aggregate cost trends over time. Add after first $50 month spent.
150
- - **Cross-engine A/B**: same scenario claude vs codex. Add after baseline established.
151
- - **Stress scenarios**: high-load, concurrent campaigns. Add after Phase B-E stabilization complete.
152
-
153
- ## 11. This PR's scope (bootstrap only)
154
-
155
- This PR ships:
156
- 1. This spec doc
157
- 2. `tests/sv-real-llm/harness/run-scenario.sh` — single-scenario runner skeleton
158
- 3. `tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` — first real-LLM scenario, **gated by `RLP_REAL_LLM_GATE=1`** environment variable. Default: NOT executed. Operator runs explicitly.
159
- 4. `tests/sv-real-llm/README.md` — quick how-to-run
160
-
161
- NOT in this PR:
162
- - nightly schedule (deferred)
163
- - remaining 9 scenarios (deferred)
164
- - integration into `sv-gate-fast.sh` or any existing gate (deferred)
165
-
166
- The bootstrap is small ON PURPOSE. Establish the pattern, prove it works on one bug, then scale.
167
-
168
- ## 12. Verification
169
-
170
- - `bash tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` runs in dry-mode (env not set) and prints "SKIPPED — RLP_REAL_LLM_GATE=1 to enable". No cost.
171
- - With `RLP_REAL_LLM_GATE=1` AND claude CLI available: scenario runs end-to-end. PASS = audit log line present + worker prompt iter-001 NOT rewritten + verifier dispatched. Cost ~$1-3.
172
- - Existing sv-gate-fast: 48/48 PASS unchanged.
173
- - Full Node suite: 339/339 PASS unchanged.
174
-
175
- ## 13. Plan integration
176
-
177
- This PR advances Phase F from §5 of `docs/plans/v0.15-stabilization-plan.md`. After this PR lands, Phase F state changes from "planned" to "bootstrap complete; coverage roadmap active". Each subsequent PR closing a Phase B/C/D/E bug-class fix MUST add the corresponding real-LLM scenario per §8 table.