@ai-dev-methodologies/rlp-desk 0.15.1 → 0.15.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/plans/pr-e-phase-c1-blocked-recovery-hygiene-v0.md +233 -0
- package/docs/plans/v0.15-stabilization-phase-a-prep.md +130 -0
- package/docs/plans/v0.15-stabilization-plan.md +178 -0
- package/docs/plans/v0.16-real-llm-sv-gate-spec.md +177 -0
- package/package.json +1 -1
- package/src/node/run.mjs +10 -1
- package/src/node/runner/campaign-main-loop.mjs +90 -0
- package/src/scripts/lib_ralph_desk.zsh +75 -0
- package/src/scripts/run_ralph_desk.zsh +26 -0
|
@@ -0,0 +1,233 @@
|
|
|
1
|
+
# PR-E: Phase C1 — Blocked Sentinel Recovery Hygiene (Planner v0)
|
|
2
|
+
|
|
3
|
+
> **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase C
|
|
4
|
+
> **Continuation of**: PR-A (Bug #10 phase=verify recovery, commit `95c0d4e`)
|
|
5
|
+
> **Stop rule**: codex critic APPROVE (P0+P1=0) before merge
|
|
6
|
+
> **Critic instruction**: approve unless P0 or P1 found
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 1. Problem
|
|
11
|
+
|
|
12
|
+
After PR-A landed (`phase=verify` recovery honored), the next recovery surface is **operator-cleared BLOCKED**.
|
|
13
|
+
|
|
14
|
+
Today, when operator clears `<slug>-blocked.md` to recover (the documented manual recovery for some BLOCKED reasons), `status.json` retains:
|
|
15
|
+
- `phase: "blocked"` (stale)
|
|
16
|
+
- `consecutive_failures` and `consecutive_blocks` counters at their pre-BLOCKED values
|
|
17
|
+
- `last_block_reason` populated
|
|
18
|
+
|
|
19
|
+
On leader relaunch:
|
|
20
|
+
1. `readCurrentState` (`src/node/runner/campaign-main-loop.mjs:364`) preserves all of these
|
|
21
|
+
2. Main loop iterates, tries to dispatch worker
|
|
22
|
+
3. If worker fails for any reason, `consecutive_failures` increments from its stale base
|
|
23
|
+
4. Circuit breaker may trip immediately even though operator's intent was "fresh start"
|
|
24
|
+
5. Result: campaign re-BLOCKs on first failure, operator's recovery effort wasted
|
|
25
|
+
|
|
26
|
+
This is the same class as Bug #10 (PR-A): operator's recovery intent silently discarded because leader doesn't recognize the recovery surface.
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## 2. Principles (3)
|
|
31
|
+
|
|
32
|
+
1. **Operator's recovery intent is the source of truth.** When BLOCKED sentinel is gone but status.json still says blocked + counters stale, the operator clearly meant to reset state. Leader must recognize and honor.
|
|
33
|
+
2. **Recovery validation must be strict (mirror PR-A).** Auto-honoring without checks risks accidental honor of crashed-mid-write states. PR-A's 5-check pattern applied to the blocked-recovery context.
|
|
34
|
+
3. **Defensive default — fall through, don't break.** If validation fails, log the reason and proceed with current behavior (no auto-reset). Recovery feature can never make existing flows worse.
|
|
35
|
+
|
|
36
|
+
## 3. Decision drivers (top 3)
|
|
37
|
+
|
|
38
|
+
| # | Driver | Why |
|
|
39
|
+
|---|---|---|
|
|
40
|
+
| D1 | **Operator recovery completeness** | PR-A covered phase=verify; phase=blocked is the most-common operator recovery (clear sentinel → relaunch). Closing this gap completes the pair. |
|
|
41
|
+
| D2 | **Mirror PR-A pattern** | Same shape (entry-time validate + flag + audit log) reduces cognitive load for future readers. Both Node + zsh same way. |
|
|
42
|
+
| D3 | **Counter reset honesty** | Operator clearing the sentinel implies intent to retry from clean state. Stale counters silently re-BLOCK = surprise. |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## 4. Viable options
|
|
47
|
+
|
|
48
|
+
### Option A — Entry-time blocked-recovery branch (mirror of PR-A) **[recommended]**
|
|
49
|
+
|
|
50
|
+
After `readCurrentState`, before main loop, add a second recovery branch:
|
|
51
|
+
- IF `state.phase === 'blocked'` AND blocked sentinel does NOT exist (operator cleared) AND counters are non-zero → **operator-cleared recovery detected**
|
|
52
|
+
- Validate (5 checks, see §7)
|
|
53
|
+
- On pass: reset phase to 'worker', reset counters to 0, log audit line
|
|
54
|
+
- On fail: fall through (current behavior — campaign continues with stale state, may immediately re-BLOCK)
|
|
55
|
+
|
|
56
|
+
Pros: surgical (single branch, ~30 LOC each side), pattern matches PR-A exactly, defensive default.
|
|
57
|
+
Cons: adds another entry-time check (small overhead).
|
|
58
|
+
|
|
59
|
+
### Option B — Reset on every relaunch unless explicit "preserve counters" flag
|
|
60
|
+
|
|
61
|
+
Always reset counters when relaunching with no BLOCKED sentinel. Add `--preserve-counters` flag for users who want stale-counter behavior.
|
|
62
|
+
|
|
63
|
+
Pros: simpler logic.
|
|
64
|
+
Cons: changes existing behavior for users who didn't experience this issue. Breaks back-compat for anyone relying on counter persistence across relaunches.
|
|
65
|
+
|
|
66
|
+
→ **Rejected**: violates principle 3 (defensive default).
|
|
67
|
+
|
|
68
|
+
### Option C — Document operator workaround instead of code change
|
|
69
|
+
|
|
70
|
+
Add cookbook entry: "after clearing blocked sentinel, also `jq` zero out counters in status.json".
|
|
71
|
+
|
|
72
|
+
Pros: zero code change.
|
|
73
|
+
Cons: pushes burden to operator. Same class of failure as Bug #10's pre-PR-A state — the leader should recognize recovery, not require operator jq pipelines.
|
|
74
|
+
|
|
75
|
+
→ **Rejected**: violates principle 1.
|
|
76
|
+
|
|
77
|
+
**Recommendation: A.**
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## 5. Scope
|
|
82
|
+
|
|
83
|
+
### P0 — must land
|
|
84
|
+
|
|
85
|
+
1. **Node leader entry-time blocked-recovery branch** (`src/node/runner/campaign-main-loop.mjs`):
|
|
86
|
+
- New helper `_validateBlockedRecovery({ paths, state })` — returns `{ ok: bool, reason: string }`. 5 checks (§7).
|
|
87
|
+
- Branch after readCurrentState (around line 1392, where PR-A branch sits) — if `phase === 'blocked'` and validator passes, reset phase + counters + log.
|
|
88
|
+
|
|
89
|
+
2. **zsh runner mirror** (`src/scripts/run_ralph_desk.zsh`):
|
|
90
|
+
- Mirror helper `_validate_blocked_recovery` in `lib_ralph_desk.zsh`
|
|
91
|
+
- Mirror entry-time branch (similar location to PR-A's site at `:3047-3071` range)
|
|
92
|
+
|
|
93
|
+
### P1 — must land
|
|
94
|
+
|
|
95
|
+
3. **Tests**:
|
|
96
|
+
- `tests/node/test-blocked-recovery-hygiene.test.mjs` (NEW, 5 ACs):
|
|
97
|
+
- AC-BR1: phase=blocked + sentinel absent + counters non-zero + valid → reset + dispatch worker normally
|
|
98
|
+
- AC-BR2: phase=blocked + sentinel PRESENT → don't auto-recover, throw "Run clean first" (existing behavior preserved)
|
|
99
|
+
- AC-BR3: phase=blocked + sentinel absent + counters all zero → fall through (nothing to reset)
|
|
100
|
+
- AC-BR4: phase=verify + sentinel absent → defer to PR-A's branch (no double-handling)
|
|
101
|
+
- AC-BR5: phase=blocked + sentinel absent + last_block_reason indicates non-recoverable category (`mission_abort`) → fall through, log "non-recoverable category, manual review needed"
|
|
102
|
+
- `tests/test-blocked-recovery-zsh.sh` (NEW, 5 helper-level scenarios mirroring AC-BR1..5)
|
|
103
|
+
|
|
104
|
+
### P2 — nice-to-have (deferred)
|
|
105
|
+
|
|
106
|
+
- Cookbook entry in `docs/rlp-desk/getting-started.md` documenting the recovery flow now that leader honors it
|
|
107
|
+
- Telemetry analytics — track how often operator-cleared recovery is detected (signal of campaign reliability)
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## 6. Files to modify
|
|
112
|
+
|
|
113
|
+
| File | Change | Risk |
|
|
114
|
+
|---|---|---|
|
|
115
|
+
| `src/node/runner/campaign-main-loop.mjs` | `_validateBlockedRecovery` helper + entry-time branch | LOW (pattern proven by PR-A) |
|
|
116
|
+
| `src/scripts/lib_ralph_desk.zsh` | `_validate_blocked_recovery` helper | LOW |
|
|
117
|
+
| `src/scripts/run_ralph_desk.zsh` | Entry-time branch (near `:3047-3071` range) | LOW |
|
|
118
|
+
| `tests/node/test-blocked-recovery-hygiene.test.mjs` (NEW) | 5 ACs | LOW |
|
|
119
|
+
| `tests/test-blocked-recovery-zsh.sh` (NEW) | 5 zsh scenarios | LOW |
|
|
120
|
+
|
|
121
|
+
Total: 3 modified + 2 new = 5 files. Smaller surface than PR-A.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## 7. Validator: 4 checks (`_validateBlockedRecovery`) — Codex-revised v2
|
|
126
|
+
|
|
127
|
+
Codex critic P1-1 finding: v1's Check 4 depended on `last_block_reason` field but **no code path persists that field to status.json**. Both Node `_emitBlockedSentinel` and zsh `write_blocked_sentinel` skip it. So the v1 validator would never block auto-recovery for mission_abort/repeat_axis — exactly the safety case it was designed for.
|
|
128
|
+
|
|
129
|
+
**v2 fix**: detect non-recoverable categories from the **`<slug>-blocked.json` sidecar** (which `_emitBlockedSentinel` DOES write at L942-965, with `reason_category` + `recoverable` fields), not from status.json. The sidecar persists even when operator manually `rm <slug>-blocked.md` — they don't usually delete the sidecar.
|
|
130
|
+
|
|
131
|
+
Returns `{ ok: bool, reason: string }`:
|
|
132
|
+
|
|
133
|
+
1. `state.phase === 'blocked'` (precondition)
|
|
134
|
+
2. Blocked sentinel `<slug>-blocked.md` does NOT exist (operator cleared)
|
|
135
|
+
3. At least one of: `consecutive_failures > 0`, `consecutive_blocks > 0` (something to reset; if all counters zero, fall through — nothing to recover from)
|
|
136
|
+
4. **Sidecar safety check**: if `<slug>-blocked.json` exists AND parses AND has `recoverable: false` → fall through with audit log "non-recoverable category <reason_category> from sidecar". If sidecar absent (e.g. user ran full `clean`) OR sidecar `recoverable: true` → proceed with auto-recovery. Mirrors `_classifyBlock` `recoverable` invariant; no new status field needed.
|
|
137
|
+
|
|
138
|
+
(Check 5 from v0 — 30-day staleness — DROPPED. Architect-flagged as arbitrary.)
|
|
139
|
+
|
|
140
|
+
On pass: caller resets `state.phase = 'worker'`, `state.consecutive_failures = 0`, `state.consecutive_blocks = 0`. Sidecar (if exists) is RENAMED to `<slug>-blocked.json.recovered-<iso>` for audit trail rather than deleted, so operator can inspect what was recovered from. Then logs:
|
|
141
|
+
|
|
142
|
+
```
|
|
143
|
+
[recovery] Operator-cleared BLOCKED detected (was: <last_block_reason>). Resetting counters and resuming as worker. iter=N us_id=<current_us>
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
On fail: log `[recovery] phase=blocked ignored: <reason>` and fall through to existing behavior.
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## 8. Pre-mortem (3 scenarios)
|
|
151
|
+
|
|
152
|
+
### S1 — Auto-recovery hides genuine problem
|
|
153
|
+
Campaign keeps BLOCKING because of a real architectural issue. Operator clears sentinel each time. Auto-recovery resets counters → CB never trips → infinite loop of fail+clear.
|
|
154
|
+
|
|
155
|
+
**Mitigation**: operator-cleared recovery is exactly that — operator chose to retry. If they keep clearing without fixing, the bug pattern is operator behavior, not leader's. Counters resetting is correct; CB still trips on the freshly-accumulated counters from current session. Leader doesn't enable infinite loops, operator does.
|
|
156
|
+
|
|
157
|
+
**Residual risk**: low. If operator wants CB to persist across relaunches, they can leave the sentinel and use `clean` workflow instead.
|
|
158
|
+
|
|
159
|
+
### S2 — Mid-write status.json read produces inconsistent state
|
|
160
|
+
A previous leader instance crashed mid-`writeStatus`. Relaunch reads partial JSON.
|
|
161
|
+
|
|
162
|
+
**Mitigation (corrected per Codex critic P2 backlog)**: `writeStatus` uses `writeJson` → `fs.writeFile` directly (NOT atomic rename). Partial writes are theoretically possible. If JSON is malformed, `readJsonIfExists` THROWS (not returns null) — leader fails fast at startup with parse error, surfacing the corruption to operator. Auto-recovery never proceeds because leader doesn't even reach the validator. This is acceptable: corrupted status.json is operator-visible immediately, not silently recovered. P2 backlog item: consider migrating writeStatus to atomic rename for crash safety, but that's a separate PR.
|
|
163
|
+
|
|
164
|
+
### S3 — Race: operator clears sentinel while leader is starting
|
|
165
|
+
Operator deletes blocked.md just as leader's `await exists(paths.blockedSentinel)` runs. Two outcomes:
|
|
166
|
+
- Sentinel exists during check → existing "Run clean first" error throws (existing behavior, unchanged)
|
|
167
|
+
- Sentinel missing during check → enters validator → if checks pass, recovery proceeds
|
|
168
|
+
|
|
169
|
+
**Mitigation**: this is a benign race. Both outcomes are valid (operator either succeeded in clearing or didn't). No corruption possible.
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## 9. Test plan
|
|
174
|
+
|
|
175
|
+
### Unit (Node)
|
|
176
|
+
|
|
177
|
+
`tests/node/test-blocked-recovery-hygiene.test.mjs`:
|
|
178
|
+
|
|
179
|
+
Each AC sets up a fixture (status.json + memos/) per its scenario, runs the leader to first dispatch decision, asserts on dispatch behavior + log content.
|
|
180
|
+
|
|
181
|
+
- AC-BR1 happy: setup phase=blocked, sentinel absent, consecutive_failures=3 → assert leader dispatches worker (not throw), assert state.consecutive_failures === 0 in next status write, assert audit log line matches
|
|
182
|
+
- AC-BR2 sentinel present: setup phase=blocked, sentinel exists → assert leader throws "Run clean first" (existing behavior preserved)
|
|
183
|
+
- AC-BR3 nothing to reset: phase=blocked, sentinel absent, all counters zero, last_block_reason empty → assert fall-through (no log line, no reset, normal worker dispatch)
|
|
184
|
+
- AC-BR4 phase=verify defers: phase=verify, sentinel absent → assert PR-A logic runs (no blocked recovery handling)
|
|
185
|
+
- AC-BR5 non-recoverable category: phase=blocked, sentinel absent, last_block_reason='mission_abort' → assert fall-through with log "non-recoverable category"
|
|
186
|
+
|
|
187
|
+
### Integration (zsh)
|
|
188
|
+
|
|
189
|
+
`tests/test-blocked-recovery-zsh.sh` (helper-level, mirrors `test-bug10-zsh-relaunch-hygiene.sh`):
|
|
190
|
+
|
|
191
|
+
- Scenario BR-Z1: all 5 checks pass → `_validate_blocked_recovery` returns 0
|
|
192
|
+
- Scenarios BR-Z2..BR-Z5: each check fails → returns 1 with reason matching expected substring
|
|
193
|
+
|
|
194
|
+
### Regression
|
|
195
|
+
|
|
196
|
+
- Full Node suite: 334/334 must remain green
|
|
197
|
+
- Bug #10 PR-A tests (test-relaunch-phase-verify-hygiene.test.mjs) must remain green
|
|
198
|
+
- Bug #7 zsh regression must remain green
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## 10. Verification end-to-end
|
|
203
|
+
|
|
204
|
+
1. `node --test tests/node/test-blocked-recovery-hygiene.test.mjs` → 5/5 PASS
|
|
205
|
+
2. `bash tests/test-blocked-recovery-zsh.sh` → 5/5 PASS
|
|
206
|
+
3. Full Node suite + Bug #7 regression unchanged green
|
|
207
|
+
4. **`zsh tests/sv-gate-fast.sh` PASS** (governance §1g pre-merge gate, Codex critic P1-2)
|
|
208
|
+
5. Manual sandbox: deliberately BLOCKED campaign → operator clears blocked.md → relaunch → leader logs `[recovery] Operator-cleared BLOCKED detected`, counters reset, worker dispatches, campaign continues. Repeat with `recoverable: false` sidecar → leader logs fall-through, no auto-recovery.
|
|
209
|
+
6. **AC-BR5 fixture must use real `_emitBlockedSentinel` flow** (Codex critic P1-1) — write a sentinel via the actual code path, then test recovery against it. Not hand-authored status.json.
|
|
210
|
+
|
|
211
|
+
**Release (NOT part of this PR's verification — per Codex critic P1-2 + CLAUDE.md absolute rules)**: any version bump, GitHub release, or npm publish is a SEPARATE user-approved action that follows merge. This PR's verification ends at items 1-6 above. Release decisions are not auto-flow.
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
## 11. ADR (preview)
|
|
216
|
+
|
|
217
|
+
- **Decision**: extend PR-A's recovery-hygiene pattern to `phase=blocked` operator-cleared scenario
|
|
218
|
+
- **Drivers**: D1 operator-recovery completeness, D2 mirror PR-A pattern, D3 counter reset honesty
|
|
219
|
+
- **Alternatives considered**: B (always reset, breaks back-compat), C (doc-only, pushes burden to operator)
|
|
220
|
+
- **Why chosen**: A surgical, pattern-proven, defensive default, completes Phase C1 without scope creep
|
|
221
|
+
- **Consequences**: operator-cleared BLOCKED relaunches now work as intended; no need for jq counter-reset cookbook; logs add `[recovery]` lines visible via `/rlp-desk logs`
|
|
222
|
+
- **Follow-ups**: Phase C2 (mid-iter crash recovery), Phase C3 (cross-mission queue recovery), Phase C4 (cookbook entry)
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## 12. Round-by-round resolution log
|
|
227
|
+
|
|
228
|
+
| Round | Reviewer | Verdict | Findings |
|
|
229
|
+
|---|---|---|---|
|
|
230
|
+
| 0 | — | Planner v0 | initial draft |
|
|
231
|
+
| 1 | Architect (Claude inline) | ITERATE | 5 edits applied → v1: drop 30-day, add previous_block_reason, expand Check 4 prose, branch ordering, _skipNextWorkerDispatch comment |
|
|
232
|
+
| 2 | Codex Critic | ITERATE — 0 P0, 2 P1 | P1-1: Check 4 redesigned to use `<slug>-blocked.json` sidecar `recoverable` field (status.json never persists `last_block_reason`). P1-2: §10.5 release auto-flow removed; SV gate + user approval explicit. P2: S2 mitigation prose corrected (writeJson is not atomic rename; readJsonIfExists throws not returns null). All applied → v2 (current). |
|
|
233
|
+
| 3 | Codex Critic | **APPROVE** — 0 P0, 0 P1 | P1-1/P1-2 both closed. §7 sidecar-based gate validated. §10 sv-gate + user-approved release confirmed. **Loop terminated. Implementation can proceed.** |
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# Phase A — Empirical omc Baseline (Prep for Next Session)
|
|
2
|
+
|
|
3
|
+
> **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase A
|
|
4
|
+
> **Goal**: measure omc /ralph + /team reliability empirically. Establishes the bar rlp-desk needs to reach.
|
|
5
|
+
> **Output**: `docs/plans/v0.15-stabilization-omc-baseline.md` with per-test metrics.
|
|
6
|
+
> **NOT a competition**: this is a measurement to set the stabilization target. omc is benchmark, not replacement.
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Why baseline measurement comes first
|
|
11
|
+
|
|
12
|
+
Before changing rlp-desk's substrate (Phase B-F), we need to know what "omc-level reliability" actually looks like for our workload. Otherwise stabilization targets are guesswork.
|
|
13
|
+
|
|
14
|
+
The 4 differentiators rlp-desk MUST preserve are not measured by omc — they're rlp-desk-only by design. So the baseline measurement focuses on what omc DOES have: per-iter Worker→Verifier, PRD-driven loop, multi-agent coordination, mandatory deslop, regression re-verification.
|
|
15
|
+
|
|
16
|
+
## Three test campaigns
|
|
17
|
+
|
|
18
|
+
### A1 — single-iter, single-story
|
|
19
|
+
- **Workload**: small TypeScript fix in a sandbox repo (or BOS apps/web/ on a contained file)
|
|
20
|
+
- **Tool**: `/oh-my-claudecode:ralph "Fix the unused variable warning in <specific-file>"`
|
|
21
|
+
- **Measure**:
|
|
22
|
+
- operator-touch count (target = 0)
|
|
23
|
+
- total time (entry → completion)
|
|
24
|
+
- cost (token usage if available, otherwise rough estimate)
|
|
25
|
+
- did mandatory deslop pass run?
|
|
26
|
+
- did regression re-verification pass?
|
|
27
|
+
|
|
28
|
+
### A2 — multi-iter, multi-story
|
|
29
|
+
- **Workload**: synthetic PRD with 3 stories, each story trivial but distinct
|
|
30
|
+
- **Tool**: `/oh-my-claudecode:ralph` with auto-generated prd.json refined to 3 stories
|
|
31
|
+
- **Measure**:
|
|
32
|
+
- operator-touch count per story
|
|
33
|
+
- story completion order (sequential as PRD'd, or chosen by ralph?)
|
|
34
|
+
- failure recovery if a story blocks (does it advance to next or stop?)
|
|
35
|
+
- prd.json final state (all stories `passes: true`?)
|
|
36
|
+
- reviewer behavior (was each story actually verified against acceptance criteria?)
|
|
37
|
+
|
|
38
|
+
### A3 — parallel team
|
|
39
|
+
- **Workload**: 3-task synthetic spec where tasks are truly independent
|
|
40
|
+
- **Tool**: `/oh-my-claudecode:team 3:executor "<spec>"`
|
|
41
|
+
- **Measure**:
|
|
42
|
+
- parallelism (do 3 agents really run concurrently?)
|
|
43
|
+
- lock contention (any deadlocks on shared task list?)
|
|
44
|
+
- inter-agent messaging (any SendMessage between teammates?)
|
|
45
|
+
- completion time vs sequential estimate
|
|
46
|
+
- cleanup (does TeamDelete actually run on completion?)
|
|
47
|
+
|
|
48
|
+
## Sandbox setup
|
|
49
|
+
|
|
50
|
+
Recommend a throwaway test repo to avoid contaminating BOS or rlp-desk:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
mkdir /tmp/omc-baseline-sandbox && cd /tmp/omc-baseline-sandbox
|
|
54
|
+
git init
|
|
55
|
+
# Add 1-2 simple TypeScript files with intentional issues for A1
|
|
56
|
+
# Add a synthetic PRD for A2
|
|
57
|
+
# Add a 3-task spec for A3
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
This isolates measurement from real product work and lets us reset between tests.
|
|
61
|
+
|
|
62
|
+
## Output schema
|
|
63
|
+
|
|
64
|
+
`docs/plans/v0.15-stabilization-omc-baseline.md`:
|
|
65
|
+
|
|
66
|
+
```markdown
|
|
67
|
+
# omc Baseline Measurement — Phase A Output
|
|
68
|
+
|
|
69
|
+
## A1 (single-iter, single-story)
|
|
70
|
+
- Workload: <description>
|
|
71
|
+
- Result: PASS / PARTIAL / FAIL
|
|
72
|
+
- Operator-touch: N
|
|
73
|
+
- Time: Xm Ys
|
|
74
|
+
- Cost: $X.XX
|
|
75
|
+
- Deslop ran: YES/NO
|
|
76
|
+
- Regression re-verify: PASS/FAIL/N-A
|
|
77
|
+
- Subjective notes: ...
|
|
78
|
+
|
|
79
|
+
## A2 (multi-iter, multi-story)
|
|
80
|
+
- Workload: <description>
|
|
81
|
+
- Result: PASS / PARTIAL / FAIL
|
|
82
|
+
- Stories completed: N/3
|
|
83
|
+
- Operator-touch: N per story
|
|
84
|
+
- Failure recovery observed: YES/NO + behavior
|
|
85
|
+
- Total time: Xm Ys
|
|
86
|
+
- prd.json final state: ...
|
|
87
|
+
- Reviewer per-story verification: YES/NO
|
|
88
|
+
|
|
89
|
+
## A3 (parallel team)
|
|
90
|
+
- Workload: <description>
|
|
91
|
+
- Result: PASS / PARTIAL / FAIL
|
|
92
|
+
- Parallelism observed: YES/NO + concurrent count
|
|
93
|
+
- Lock contention: NONE/ONE/MULTIPLE
|
|
94
|
+
- Total time vs sequential: Xm vs Ym
|
|
95
|
+
- Cleanup: COMPLETE/PARTIAL/MANUAL
|
|
96
|
+
|
|
97
|
+
## Synthesis — what is "omc-level reliability"?
|
|
98
|
+
|
|
99
|
+
[1-2 paragraphs translating the metrics into a target for rlp-desk]
|
|
100
|
+
|
|
101
|
+
## Phase B implications
|
|
102
|
+
|
|
103
|
+
[which omc patterns should rlp-desk Phase B adopt for tmux/process lifecycle race?]
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
## Cost estimate
|
|
107
|
+
|
|
108
|
+
3 small test campaigns: total ~$5-15. Sandbox runs are bounded (no real product blast radius).
|
|
109
|
+
|
|
110
|
+
## How to run (next session)
|
|
111
|
+
|
|
112
|
+
Open Claude Code in /tmp/omc-baseline-sandbox (after creating it per above):
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
/oh-my-claudecode:ralph "Fix the unused variable warning in src/index.ts"
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
Wait for completion. Record metrics. Reset sandbox. Repeat for A2 and A3 with the synthetic PRD/spec.
|
|
119
|
+
|
|
120
|
+
After all three: write `docs/plans/v0.15-stabilization-omc-baseline.md` in rlp-desk repo with the synthesis.
|
|
121
|
+
|
|
122
|
+
## What this enables
|
|
123
|
+
|
|
124
|
+
Once Phase A is done, Phase B (tmux/process lifecycle race hardening) has a concrete target: "match omc /ralph's A1 behavior on operator-touch within ±20%, while preserving multi-engine consensus + multi-mission queue."
|
|
125
|
+
|
|
126
|
+
Without Phase A, Phase B targets are imaginary.
|
|
127
|
+
|
|
128
|
+
## Honest scope note
|
|
129
|
+
|
|
130
|
+
This is preparation work. The Phase A run itself happens in a fresh session (sandbox cwd). This prep doc + the v0.15-stabilization-plan.md are the carry-over artifacts.
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
# rlp-desk Stabilization Plan (v0.15.x → v0.16.x)
|
|
2
|
+
|
|
3
|
+
> **Status**: ACTIVE. Replaces the misdirected 2026-05-07 "pivot to omc" decision (PR #8 redirect via this plan).
|
|
4
|
+
> **Goal**: bring rlp-desk to omc /team/ralph/ralplan level of reliability **while preserving rlp-desk's self-driving advantages**.
|
|
5
|
+
> **Non-goal**: pivoting away from rlp-desk. omc is the **benchmark**, not the replacement.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 0. Why this plan exists (correction note)
|
|
10
|
+
|
|
11
|
+
On 2026-05-07 morning I (the assistant) ran `plan-ceo-review` on the question "rlp-desk vs omc /team" and produced a recommendation to enter maintenance mode and pivot to omc. The user immediately corrected: *the goal was always to make rlp-desk work as reliably as omc, NOT to replace it*.
|
|
12
|
+
|
|
13
|
+
This plan is the corrected direction: stabilize rlp-desk by learning from omc's patterns, applying them to rlp-desk's substrate, while protecting the four real differentiators that make rlp-desk worth using in the first place.
|
|
14
|
+
|
|
15
|
+
The misdirected commit `229e1b6` (the "maintenance mode" banner + FROZEN doc) is now reverted in this PR. The pivot prompt-optimizer artifact and BOS validation plan stay on disk but are deferred — they may become useful later as a comparison study, but they are not the active path.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## 1. The vision (preserved verbatim)
|
|
20
|
+
|
|
21
|
+
1. ralph-loop fresh-context per iteration (no context pollution)
|
|
22
|
+
2. idea → plan distillation
|
|
23
|
+
3. PRD formalization
|
|
24
|
+
4. Worker/Verifier cycles with iterative improvement
|
|
25
|
+
5. **Full autonomy — minimum operator intervention**
|
|
26
|
+
|
|
27
|
+
This vision is the core. Stabilization is in service of it, not a substitute for it.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 2. Differentiators to preserve (rlp-desk-only)
|
|
32
|
+
|
|
33
|
+
These four are the reason rlp-desk exists separately from omc. Stabilization work MUST NOT compromise them:
|
|
34
|
+
|
|
35
|
+
1. **Multi-engine parallel consensus per iteration**: `--consensus all` runs claude AND codex on every verification, then reconciles. omc /ralph supports `--critic=codex` but as a single critic, not parallel consensus.
|
|
36
|
+
2. **Multi-mission queue + cross-mission analytics**: `RLP_BACKGROUND=1` chains missions and tracks cross-mission metrics. omc /team is single-task.
|
|
37
|
+
3. **BLOCK_TAGS P1-D failure taxonomy**: structured `reason_category × recoverable × suggested_action` classification. omc emits simpler verdicts (pass/fail/blocked).
|
|
38
|
+
4. **Structured SV reports**: post-campaign analytics at `~/.claude/ralph-desk/analytics/<slug>/self-verification-report-NNN.md`. omc has lighter `progress.txt`.
|
|
39
|
+
|
|
40
|
+
These four ARE the value proposition. The stabilization work below is about making the substrate that delivers them as reliable as omc's.
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## 3. The 10-bug regression pattern (what we're hardening against)
|
|
45
|
+
|
|
46
|
+
Six weeks (2026-05-01 to 2026-05-07), 10 bugs, each prior fix exposing the next. Categorized:
|
|
47
|
+
|
|
48
|
+
| Cat | Bugs | Root cause cluster |
|
|
49
|
+
|---|---|---|
|
|
50
|
+
| (a) tmux/process lifecycle race | #5, #6, #7, #10 | Long-lived TUI processes in tmux panes; sentinel polling races; recovery hygiene |
|
|
51
|
+
| (b) artifact contract / schema | #3, #4, #8, #9 | Worker/Verifier output contract violations; LLM non-determinism on schema; verified_us persistence |
|
|
52
|
+
| (c) LLM-runtime constraint | #1 | Claude Code `.claude/` self-modification gate blocking sentinel writes |
|
|
53
|
+
| (d) recovery hygiene | #10 | Manual recovery on relaunch silently overwritten |
|
|
54
|
+
|
|
55
|
+
**Per category, what omc does differently** (preliminary — to be verified empirically in §5):
|
|
56
|
+
|
|
57
|
+
- **(a) Lifecycle race**: omc /team uses Claude Code native team primitives (`TeamCreate`, `TaskCreate`, `SendMessage`). No tmux, no long-lived TUI, no sentinel polling. Process lifecycle = subagent lifecycle = single Claude Code call. Race window does not exist.
|
|
58
|
+
- **(b) Contract violations**: omc /ralph uses `prd.json` with `passes: bool` per story + reviewer verifies acceptance criteria. Simpler schema = less surface for LLM to violate. omc also has mandatory deslop pass + regression re-verification (`ai-slop-cleaner` + Step 7.6).
|
|
59
|
+
- **(c) Self-modification gate**: omc skills are read by Claude Code via the Skill tool, not written by Workers. Workers don't touch `.claude/` paths. Gate not encountered.
|
|
60
|
+
- **(d) Recovery**: omc /ralph is session-scoped (`.omc/state/sessions/{sessionId}/prd.json`). Per-session state means relaunch starts fresh; there is no "manual recovery" surface to break.
|
|
61
|
+
|
|
62
|
+
These are the patterns to learn from. Adopting them does NOT require pivoting away from rlp-desk; it requires bringing equivalent semantics into rlp-desk's substrate.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## 4. Stabilization principles
|
|
67
|
+
|
|
68
|
+
1. **omc is benchmark, not replacement.** Every change in this plan asks "how does omc avoid this failure mode?" then engineers an equivalent for rlp-desk's stack.
|
|
69
|
+
2. **Preserve all 4 differentiators.** No change should compromise multi-engine consensus, multi-mission queue, BLOCK_TAGS taxonomy, or SV reports.
|
|
70
|
+
3. **Substrate first, features second.** Bug categories (a) and (d) are substrate. Categories (b) and (c) are surface. Fix substrate first; surface improvements compound on a stable base.
|
|
71
|
+
4. **Real-LLM SV gate.** The current SV gate's grep+unit-test labeling missed 10 production bugs. SV must be strengthened to actually catch production failure modes (subset of campaigns run with full claude/codex worker+verifier in CI-like mode).
|
|
72
|
+
5. **Increment by category.** Each PR closes ONE bug category, not multiple. Avoids "fix-of-fix-of-fix" pattern that produced #4 (regression of #3).
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## 5. Concrete workstream (revised, per category)
|
|
77
|
+
|
|
78
|
+
### Phase A — Empirical omc baseline (W1, ~3 days)
|
|
79
|
+
|
|
80
|
+
Before changing rlp-desk, measure omc reliably. Three test campaigns:
|
|
81
|
+
|
|
82
|
+
| Test | Workload | Measure |
|
|
83
|
+
|---|---|---|
|
|
84
|
+
| A1 | omc /ralph "fix small TS error in BOS apps/web/" | operator-touch count, time, cost |
|
|
85
|
+
| A2 | omc /ralph + multi-iter (3+ stories) on a synthetic PRD | operator-touch count, recovery behavior |
|
|
86
|
+
| A3 | omc /team "implement small feature with 3:executor" on synthetic task | parallelism behavior, lock contention |
|
|
87
|
+
|
|
88
|
+
**Output**: `docs/plans/v0.15-stabilization-omc-baseline.md` with per-test metrics. Not a competition, a *measurement*. Establishes the bar rlp-desk needs to reach.
|
|
89
|
+
|
|
90
|
+
### Phase B — Category (a) substrate hardening (W1-W3, ~2 weeks)
|
|
91
|
+
|
|
92
|
+
The largest cluster (4 of 10 bugs). Goal: tmux/process lifecycle race window → 0 in `--mode tmux`. `--mode native` already addresses this differently; the work here is `--mode tmux`.
|
|
93
|
+
|
|
94
|
+
Sub-deliverables:
|
|
95
|
+
- B1: lifecycle audit (every tmux send-keys / sentinel write / pane reuse — ASCII diagram of timing windows)
|
|
96
|
+
- B2: post-sentinel reaper invariant test (extend Bug #7 fix coverage to all sentinel writes, not just per-US)
|
|
97
|
+
- B3: real-LLM SV scenario for category (a) — actual claude/codex worker dispatched, lifecycle race triggered deterministically, fix verified
|
|
98
|
+
- B4: lifecycle observability (debug log emits race-window measurements per iteration)
|
|
99
|
+
|
|
100
|
+
### Phase C — Category (d) recovery hygiene completion (W3-W4, ~1 week)
|
|
101
|
+
|
|
102
|
+
Bug #10's PR-A fix covers `phase=verify` honor. Remaining recovery surfaces:
|
|
103
|
+
- C1: phase=blocked recovery (operator clears blocked sentinel + restarts) — currently honored, verify with test
|
|
104
|
+
- C2: phase=worker mid-iter crash recovery (leader killed mid-worker dispatch) — verify, fix if broken
|
|
105
|
+
- C3: cross-mission queue recovery (one mission BLOCKED, queue advances) — verify
|
|
106
|
+
- C4: documented operator recovery cookbook with deterministic jq pipelines
|
|
107
|
+
|
|
108
|
+
### Phase D — Category (b) contract hardening (W4-W6, ~2 weeks)
|
|
109
|
+
|
|
110
|
+
LLM contract violations are partly inevitable, but the harness can reduce the surface:
|
|
111
|
+
- D1: schema validator at every artifact write (already exists for some; extend to all done-claim/iter-signal/verdict variants)
|
|
112
|
+
- D2: feedback loop — when worker violates contract, next iteration's prompt includes the schema error verbatim (omc-style)
|
|
113
|
+
- D3: verified_us persistence audit (Bug #9) — `status.json` is the source of truth, memory.md is supplementary, contract clear in code
|
|
114
|
+
- D4: real-LLM SV scenario for category (b)
|
|
115
|
+
|
|
116
|
+
### Phase E — Category (c) LLM-runtime constraint awareness (W6-W7, ~1 week)
|
|
117
|
+
|
|
118
|
+
`.claude/` self-modification gate (Bug #1):
|
|
119
|
+
- E1: Worker prompt explicitly states "do NOT touch `.claude/`; sentinel paths are at `.rlp-desk/memos/`" (already done in v0.13.0 path migration; verify)
|
|
120
|
+
- E2: claude worker pre-flight check — try a no-op write to `.rlp-desk/` before main work; fail fast if blocked
|
|
121
|
+
- E3: cross-engine fallback — when claude worker hits permission gate, mid-flight fallback to codex worker for that iter (already partial; complete)
|
|
122
|
+
|
|
123
|
+
### Phase F — Real-LLM SV gate (W7-W8, ~2 weeks)
|
|
124
|
+
|
|
125
|
+
The biggest framework upgrade:
|
|
126
|
+
- F1: define "SV scenario" = complete real campaign (1-3 iter, real claude/codex, real tmux or native) executed in CI nightly
|
|
127
|
+
- F2: each merged PR adds at least one SV scenario covering the bug it fixed (Bug #1-#10 retroactively)
|
|
128
|
+
- F3: SV gate becomes "all real-LLM scenarios PASS" before npm publish — replaces the current grep-and-label SV
|
|
129
|
+
- F4: cost budget for SV gate (~$10-20/run nightly, ~$300-600/month — explicit budget approval needed before W7 starts)
|
|
130
|
+
|
|
131
|
+
### Release cadence
|
|
132
|
+
|
|
133
|
+
- v0.15.2 (this PR): redirect + stabilization plan + Phase A start
|
|
134
|
+
- v0.15.3-v0.15.7: incremental Phase B-E PRs, each landing one category fix + real-LLM SV scenario for that category
|
|
135
|
+
- v0.16.0 (~8-10 weeks from 2026-05-07): real-LLM SV gate active + 10-bug regression pattern verified eliminated empirically (3 consecutive campaigns at omc baseline parity or better)
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## 6. Success criteria (measurable)
|
|
140
|
+
|
|
141
|
+
| Metric | Current (2026-05-07) | v0.16.0 target | Measurement |
|
|
142
|
+
|---|---|---|---|
|
|
143
|
+
| Bug discovery rate | 1-2/week | <1/month | git log of bug-report-* files in BOS |
|
|
144
|
+
| Operator-touch per campaign | unmeasured (high) | <1 per 5 campaigns | new analytics field in `campaign.jsonl` |
|
|
145
|
+
| Campaign completion rate | unmeasured (low) | >80% | new analytics field |
|
|
146
|
+
| SV gate catches production bugs | 0/10 | >50% (5/10 if Bug #11 happens, caught pre-publish) | post-publish bug review |
|
|
147
|
+
| Differentiator preservation | 4/4 | 4/4 | regression test per differentiator |
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## 7. What this plan is NOT
|
|
152
|
+
|
|
153
|
+
- NOT a pivot away from rlp-desk
|
|
154
|
+
- NOT a maintenance mode declaration
|
|
155
|
+
- NOT a plan to delete the Node leader (`--mode tmux` and `--mode agent` Node CLI both stay; deletion is a separate decision deferred until stabilization complete)
|
|
156
|
+
- NOT a promise that omc patterns will be copied verbatim — they're inspiration, the implementation is rlp-desk-native
|
|
157
|
+
|
|
158
|
+
## 8. What this plan IS
|
|
159
|
+
|
|
160
|
+
- A correction of the 2026-05-07 misdirection
|
|
161
|
+
- A category-by-category hardening roadmap with empirical baselines (Phase A)
|
|
162
|
+
- A real-LLM SV gate replacement for the current theatrical SV (Phase F)
|
|
163
|
+
- A preservation contract for the 4 differentiators
|
|
164
|
+
- A concrete release cadence ending in v0.16.0 with measured success criteria
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## 9. First action (this PR)
|
|
169
|
+
|
|
170
|
+
This PR (`feat/v0.15.2-stabilization-redirect`):
|
|
171
|
+
- Reverts the maintenance-mode banner in `src/node/run.mjs`
|
|
172
|
+
- Replaces with stabilization-in-progress banner
|
|
173
|
+
- Removes `docs/plans/v0.16-FROZEN-status.md` (misdirection artifact)
|
|
174
|
+
- Adds this `docs/plans/v0.15-stabilization-plan.md`
|
|
175
|
+
- Updates `tests/node/us008-cli-entrypoint.test.mjs` regex
|
|
176
|
+
- Bumps to v0.15.2 + npm publish so users see the corrected banner
|
|
177
|
+
|
|
178
|
+
After this lands: Phase A (omc baseline measurement) starts. That's a separate session.
|
|
@@ -0,0 +1,177 @@
|
|
|
1
|
+
# Real-LLM SV Gate Framework — Spec (Phase F bootstrap)
|
|
2
|
+
|
|
3
|
+
> **Status**: design + first scenario bootstrap. Active stabilization Phase F.
|
|
4
|
+
> **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase F.
|
|
5
|
+
> **Goal**: replace today's grep+unit-test "SV gate theater" with a framework that ACTUALLY catches production failure modes by running real claude/codex agents against real campaigns.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. The problem in one paragraph
|
|
10
|
+
|
|
11
|
+
Today's SV gate (`tests/sv-self-verify-*.sh`, `tests/sv-gate-*.sh`) is grep-and-label theater. Each "scenario" is a shell command (grep, unit test, regression) labeled with five categories (`correctness | integration | security | perf | error-path`) and reported as "Worker → Verifier → PASS". **No real LLM agent runs**. None of the 10 production bugs (#1-#10, 2026-05-01..05-07) were caught by SV gate before shipping. Every Bug #N+1 will follow the same pattern unless the framework changes.
|
|
12
|
+
|
|
13
|
+
## 2. Why current SV gate misses production failures
|
|
14
|
+
|
|
15
|
+
Production failure modes that today's gate cannot reproduce:
|
|
16
|
+
|
|
17
|
+
| Failure mode | Why grep+unit can't catch | Real-LLM gate can |
|
|
18
|
+
|---|---|---|
|
|
19
|
+
| Worker hang on `.claude/` self-modification gate (Bug #1) | LLM platform-level constraint, not testable in unit | YES (real claude worker writes sentinel, observes hang) |
|
|
20
|
+
| tmux pane lifecycle race (Bug #5/#6/#7) | timing-dependent, requires real TUI lifecycle | YES (real worker + real pane + real reaper) |
|
|
21
|
+
| Worker contract violation under non-determinism (Bug #3/#4/#8/#9) | LLM produces malformed/incomplete artifact non-deterministically | YES (run actual worker with real prompt; observe artifact shape over N runs) |
|
|
22
|
+
| Recovery hygiene under partial state (Bug #10) | requires real campaign + real BLOCKED + real operator recovery | YES (real campaign brought to BLOCKED, real operator-style recovery, real relaunch) |
|
|
23
|
+
|
|
24
|
+
The common thread: **production failures require production-like state**. Unit tests can't produce it; only real campaigns can.
|
|
25
|
+
|
|
26
|
+
## 3. Framework requirements
|
|
27
|
+
|
|
28
|
+
A real-LLM SV gate must:
|
|
29
|
+
|
|
30
|
+
R1. **Run a complete real campaign** (1-3 iterations) per scenario, not isolated function calls.
|
|
31
|
+
R2. **Use real claude or codex worker + verifier**, not stub/mock LLM responses.
|
|
32
|
+
R3. **Run inside CI-like environment** (real tmux session + real Claude Code CLI / codex CLI).
|
|
33
|
+
R4. **Be deterministic enough to gate a release**: same scenario produces same PASS/FAIL outcome >95% of the time.
|
|
34
|
+
R5. **Stay within cost budget**: each scenario <$5; total nightly run <$50.
|
|
35
|
+
R6. **Run on a schedule, not per-PR**: too expensive to gate every PR. Nightly is the target. Per-PR can run a subset (sub-$10).
|
|
36
|
+
R7. **Capture failure provenance**: when scenario fails, the failure must be reproducible from the captured campaign artifacts (logs, status, sentinel state).
|
|
37
|
+
R8. **Cover all 10 historical bug categories**: each merged PR adds at least one scenario covering the bug class it touched.
|
|
38
|
+
|
|
39
|
+
## 4. Architecture
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
43
|
+
│ tests/sv-real-llm/ │
|
|
44
|
+
│ │
|
|
45
|
+
│ scenarios/ │
|
|
46
|
+
│ bug-01-claude-self-modification-gate.test.sh │
|
|
47
|
+
│ bug-05-worker-dead-on-reuse.test.sh │
|
|
48
|
+
│ bug-07-post-sentinel-race.test.sh │
|
|
49
|
+
│ bug-10-relaunch-phase-verify-hygiene.test.sh │
|
|
50
|
+
│ ... │
|
|
51
|
+
│ │
|
|
52
|
+
│ harness/ │
|
|
53
|
+
│ run-scenario.sh (single scenario runner) │
|
|
54
|
+
│ run-all.sh (nightly suite runner) │
|
|
55
|
+
│ budget-guard.sh (cost budget enforcement) │
|
|
56
|
+
│ capture-artifacts.sh (post-run forensic bundle) │
|
|
57
|
+
│ │
|
|
58
|
+
│ fixtures/ │
|
|
59
|
+
│ minimal-prd-1us.md (1 user story, deterministic) │
|
|
60
|
+
│ minimal-prd-3us.md (3 user stories) │
|
|
61
|
+
│ │
|
|
62
|
+
│ results/ │
|
|
63
|
+
│ <date>-<scenario>.json (per-run outcome + cost + time) │
|
|
64
|
+
│ <date>-<scenario>.bundle/ (full campaign artifacts) │
|
|
65
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Each scenario is a self-contained shell test:
|
|
69
|
+
- Sets up a sandbox campaign
|
|
70
|
+
- Runs `node ~/.claude/ralph-desk/node/run.mjs run <slug>` or via slash command
|
|
71
|
+
- Asserts on final state (sentinel + status + artifacts)
|
|
72
|
+
- Reports PASS/FAIL with provenance
|
|
73
|
+
|
|
74
|
+
## 5. Scenario authoring contract
|
|
75
|
+
|
|
76
|
+
Every scenario must define:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
# scenario header (machine-readable)
|
|
80
|
+
SCENARIO_ID="bug-10-relaunch-phase-verify-hygiene"
|
|
81
|
+
SCENARIO_DESCRIPTION="Relaunch with operator-written phase=verify artifacts honored"
|
|
82
|
+
SCENARIO_BUG_CATEGORY="d-recovery-hygiene"
|
|
83
|
+
SCENARIO_HISTORICAL_BUG="Bug #10 (BOS 2026-05-07)"
|
|
84
|
+
SCENARIO_COST_BUDGET_USD="3"
|
|
85
|
+
SCENARIO_TIMEOUT_SECONDS="600"
|
|
86
|
+
SCENARIO_REQUIRES="claude_cli OR codex_cli; tmux; jq"
|
|
87
|
+
|
|
88
|
+
# scenario body: 4 parts
|
|
89
|
+
# 1. SETUP — sandbox dir, init campaign, prepare fixture state
|
|
90
|
+
# 2. EXERCISE — run actual rlp-desk against the fixture
|
|
91
|
+
# 3. ASSERT — check final state matches expected
|
|
92
|
+
# 4. REPORT — emit JSON outcome + capture artifacts
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
The 4-part shape is non-negotiable. Scenarios that skip ASSERT or REPORT do not count toward gate coverage.
|
|
96
|
+
|
|
97
|
+
## 6. Cost + time discipline
|
|
98
|
+
|
|
99
|
+
- **Budget per scenario**: $5 default, $10 max. Scenarios exceeding budget must be marked `EXPENSIVE` and only run weekly, not nightly.
|
|
100
|
+
- **Timeout per scenario**: 600s default, 1800s max. Scenarios exceeding timeout are FAIL automatically.
|
|
101
|
+
- **Total nightly budget**: $50. Suite must fit within budget or be sharded across nights.
|
|
102
|
+
- **Per-PR subset**: scenarios marked `LIGHT` (under $1, under 60s) run on every PR. Others run nightly.
|
|
103
|
+
|
|
104
|
+
Cost tracking via `tests/sv-real-llm/harness/budget-guard.sh` — reads `cost-log.jsonl` from rlp-desk and aggregates pre-run/post-run deltas.
|
|
105
|
+
|
|
106
|
+
## 7. First scenario (this PR's bootstrap deliverable)
|
|
107
|
+
|
|
108
|
+
`tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` — converts the existing PR-A test (Bug #10 fix) into a real-LLM scenario:
|
|
109
|
+
|
|
110
|
+
1. SETUP: create sandbox campaign in `/tmp/sv-real-llm-bug-10/`. init with minimal PRD (1 US, "fix the typo in foo.txt" — deterministic). Run iter-1 normally to BLOCKED state (deliberately fail verifier). Operator-style recovery: write iter-signal.json + done-claim.json by hand, set status.phase=verify, remove blocked sentinel.
|
|
111
|
+
2. EXERCISE: relaunch leader (`/oh-my-claudecode:autopilot` or `node run.mjs run`).
|
|
112
|
+
3. ASSERT:
|
|
113
|
+
- Leader logs `[recovery] Resuming verify phase` (PR-A audit line)
|
|
114
|
+
- No new `iter-001.worker-prompt.md` written (worker dispatch skipped)
|
|
115
|
+
- Verifier dispatched and produces verdict
|
|
116
|
+
- Final state: complete sentinel OR continue to iter-2
|
|
117
|
+
4. REPORT: write `<date>-bug-10.json` with `{outcome, time_seconds, cost_usd, log_path, captured_state_path}`.
|
|
118
|
+
|
|
119
|
+
Cost estimate: $1-3 (single iter-2 with fast worker model + verifier).
|
|
120
|
+
|
|
121
|
+
## 8. Scenario coverage roadmap
|
|
122
|
+
|
|
123
|
+
| Scenario | Bug class | Status | Cost | Test type |
|
|
124
|
+
|---|---|---|---|---|
|
|
125
|
+
| bug-10-relaunch-phase-verify-hygiene | (d) recovery | ✅ landed PR #12 | $3 | real-LLM |
|
|
126
|
+
| bug-10-blocked-recovery-counters | (d) recovery | ✅ landed (this PR set) | $2 | real-LLM |
|
|
127
|
+
| bug-07-post-sentinel-race | (a) lifecycle | ✅ landed | $3 | real-LLM |
|
|
128
|
+
| bug-08-worker-incomplete-leader-fallback | (b) contract | ✅ landed | $3 | real-LLM |
|
|
129
|
+
| bug-01-claude-self-modification-gate | (c) LLM-runtime | ✅ landed | $3 | real-LLM |
|
|
130
|
+
| bug-05-worker-dead-on-reuse | (a) lifecycle | ✅ landed | $2 | real-LLM |
|
|
131
|
+
| bug-03-verifier-noprogress | (b) contract | ✅ landed | $0 | structural |
|
|
132
|
+
| bug-04-verifier-noprogress-regression | (b) contract | ✅ landed | $0 | structural |
|
|
133
|
+
| bug-06-claude-worker-idle-noprogress | (a) lifecycle | ✅ landed | $0 | structural |
|
|
134
|
+
| bug-09-verified-us-persistence | (b) contract | ✅ landed | $0 | structural |
|
|
135
|
+
|
|
136
|
+
**All 10 historical bugs (#1-#10) covered.** 6 scenarios are real-LLM (require campaigns) gated by `RLP_REAL_LLM_GATE=1`; 4 are structural (zero cost, regression-guard the fix code itself). Future PRs fixing new bug classes MUST add a scenario for that class.
|
|
137
|
+
|
|
138
|
+
Real-LLM total budget: ~$16 for full nightly run (6 × ~$3). Structural scenarios contribute zero cost. Per-PR LIGHT subset: structural-only (4 scenarios, <60s, $0).
|
|
139
|
+
|
|
140
|
+
## 9. Integration with existing SV gate
|
|
141
|
+
|
|
142
|
+
The current `tests/sv-gate-fast.sh` becomes the LIGHT subset. New `tests/sv-gate-real-llm.sh` becomes the FULL nightly suite. Existing scenario format is preserved for grep/unit checks; real-LLM scenarios are additive, not a replacement.
|
|
143
|
+
|
|
144
|
+
`tests/sv-self-verify-*.sh` pattern stays — those are per-PR scenarios. New scenarios under `tests/sv-real-llm/scenarios/` are the real-LLM additions.
|
|
145
|
+
|
|
146
|
+
## 10. Out of scope (deferred)
|
|
147
|
+
|
|
148
|
+
- **Continuous deployment**: nightly schedule + GHA workflow. Add after first 3 scenarios prove the harness works.
|
|
149
|
+
- **Cost dashboard**: aggregate cost trends over time. Add after first $50 month spent.
|
|
150
|
+
- **Cross-engine A/B**: same scenario claude vs codex. Add after baseline established.
|
|
151
|
+
- **Stress scenarios**: high-load, concurrent campaigns. Add after Phase B-E stabilization complete.
|
|
152
|
+
|
|
153
|
+
## 11. This PR's scope (bootstrap only)
|
|
154
|
+
|
|
155
|
+
This PR ships:
|
|
156
|
+
1. This spec doc
|
|
157
|
+
2. `tests/sv-real-llm/harness/run-scenario.sh` — single-scenario runner skeleton
|
|
158
|
+
3. `tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` — first real-LLM scenario, **gated by `RLP_REAL_LLM_GATE=1`** environment variable. Default: NOT executed. Operator runs explicitly.
|
|
159
|
+
4. `tests/sv-real-llm/README.md` — quick how-to-run
|
|
160
|
+
|
|
161
|
+
NOT in this PR:
|
|
162
|
+
- nightly schedule (deferred)
|
|
163
|
+
- remaining 9 scenarios (deferred)
|
|
164
|
+
- integration into `sv-gate-fast.sh` or any existing gate (deferred)
|
|
165
|
+
|
|
166
|
+
The bootstrap is small ON PURPOSE. Establish the pattern, prove it works on one bug, then scale.
|
|
167
|
+
|
|
168
|
+
## 12. Verification
|
|
169
|
+
|
|
170
|
+
- `bash tests/sv-real-llm/scenarios/bug-10-relaunch-phase-verify-hygiene.test.sh` runs in dry-mode (env not set) and prints "SKIPPED — RLP_REAL_LLM_GATE=1 to enable". No cost.
|
|
171
|
+
- With `RLP_REAL_LLM_GATE=1` AND claude CLI available: scenario runs end-to-end. PASS = audit log line present + worker prompt iter-001 NOT rewritten + verifier dispatched. Cost ~$1-3.
|
|
172
|
+
- Existing sv-gate-fast: 48/48 PASS unchanged.
|
|
173
|
+
- Full Node suite: 339/339 PASS unchanged.
|
|
174
|
+
|
|
175
|
+
## 13. Plan integration
|
|
176
|
+
|
|
177
|
+
This PR advances Phase F from §5 of `docs/plans/v0.15-stabilization-plan.md`. After this PR lands, Phase F state changes from "planned" to "bootstrap complete; coverage roadmap active". Each subsequent PR closing a Phase B/C/D/E bug-class fix MUST add the corresponding real-LLM scenario per §8 table.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ai-dev-methodologies/rlp-desk",
|
|
3
|
-
"version": "0.15.
|
|
3
|
+
"version": "0.15.3",
|
|
4
4
|
"description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
|
|
5
5
|
"scripts": {
|
|
6
6
|
"postinstall": "node scripts/postinstall.js",
|
package/src/node/run.mjs
CHANGED
|
@@ -405,9 +405,18 @@ async function runRunCommand(args, deps) {
|
|
|
405
405
|
deps.stderr,
|
|
406
406
|
'For Claude Code Native Agent() campaigns, use `/rlp-desk run --mode native` from a Claude Code session.',
|
|
407
407
|
);
|
|
408
|
+
// 2026-05-07 (v0.15.2): rlp-desk is in active stabilization. Goal: reach
|
|
409
|
+
// omc /team/ralph/ralplan level of reliability while preserving
|
|
410
|
+
// rlp-desk's self-driving advantages (multi-engine consensus, multi-mission
|
|
411
|
+
// queue, BLOCK_TAGS taxonomy, structured SV reports). omc is the BENCHMARK,
|
|
412
|
+
// not a replacement. See docs/plans/v0.15-stabilization-plan.md.
|
|
408
413
|
write(
|
|
409
414
|
deps.stderr,
|
|
410
|
-
'
|
|
415
|
+
'SCHEDULED REMOVAL: --mode agent (Node CLI alpha) will be removed in a future major release. Date TBD until stabilization milestones complete.',
|
|
416
|
+
);
|
|
417
|
+
write(
|
|
418
|
+
deps.stderr,
|
|
419
|
+
'STABILIZATION IN PROGRESS: rlp-desk is hardening against the 10-bug regression pattern observed 2026-05-01..05-07. See docs/plans/v0.15-stabilization-plan.md.',
|
|
411
420
|
);
|
|
412
421
|
}
|
|
413
422
|
|
|
@@ -492,6 +492,69 @@ async function _validateOperatorRecoveryArtifacts({ paths, state }) {
|
|
|
492
492
|
return { ok: true, reason: 'all five checks passed' };
|
|
493
493
|
}
|
|
494
494
|
|
|
495
|
+
// PR-E (Phase C1, stabilization): operator-cleared BLOCKED recovery.
|
|
496
|
+
// When operator manually deletes <slug>-blocked.md to recover (a documented
|
|
497
|
+
// flow), counters in status.json (consecutive_failures / consecutive_blocks)
|
|
498
|
+
// stay populated. Without this branch, leader relaunches with stale counters
|
|
499
|
+
// and may immediately re-BLOCK on the first failure even though operator's
|
|
500
|
+
// intent was a fresh start. Pair to PR-A (phase=verify recovery, Bug #10).
|
|
501
|
+
//
|
|
502
|
+
// 4-check validator. Returns { ok, reason }. On any failure, caller falls
|
|
503
|
+
// through to existing behavior — defensive default, never auto-recovers
|
|
504
|
+
// against ambiguous state.
|
|
505
|
+
//
|
|
506
|
+
// Check 4 reads <slug>-blocked.json sidecar (NOT status.json), because
|
|
507
|
+
// status.json never persists `last_block_reason` (blocked-write code path
|
|
508
|
+
// at L920-968 doesn't write that field). The sidecar DOES carry
|
|
509
|
+
// `recoverable: bool` per _classifyBlock contract — that's the canonical
|
|
510
|
+
// non-recoverable signal.
|
|
511
|
+
async function _validateBlockedRecovery({ paths, state }) {
|
|
512
|
+
// Check 1: precondition
|
|
513
|
+
if (state.phase !== 'blocked') {
|
|
514
|
+
return { ok: false, reason: `state.phase is ${state.phase}, not 'blocked'` };
|
|
515
|
+
}
|
|
516
|
+
// Check 2: sentinel cleared by operator
|
|
517
|
+
if (await exists(paths.blockedSentinel)) {
|
|
518
|
+
return { ok: false, reason: 'blocked sentinel still present (operator did not clear)' };
|
|
519
|
+
}
|
|
520
|
+
// Check 3: counters non-zero (something to reset)
|
|
521
|
+
const failures = state.consecutive_failures ?? 0;
|
|
522
|
+
const blocks = state.consecutive_blocks ?? 0;
|
|
523
|
+
if (failures === 0 && blocks === 0) {
|
|
524
|
+
return { ok: false, reason: 'counters already zero, nothing to recover' };
|
|
525
|
+
}
|
|
526
|
+
// Check 4: sidecar safety check
|
|
527
|
+
const sidecarPath = paths.blockedSentinel.replace(/\.md$/, '.json');
|
|
528
|
+
let sidecar = null;
|
|
529
|
+
try {
|
|
530
|
+
sidecar = await readJsonIfExists(sidecarPath);
|
|
531
|
+
} catch (err) {
|
|
532
|
+
// Malformed sidecar — be defensive and fall through.
|
|
533
|
+
return { ok: false, reason: `blocked.json sidecar parse error: ${err?.message ?? err}` };
|
|
534
|
+
}
|
|
535
|
+
if (sidecar && sidecar.recoverable === false) {
|
|
536
|
+
return {
|
|
537
|
+
ok: false,
|
|
538
|
+
reason: `non-recoverable category ${sidecar.reason_category ?? 'unknown'} from sidecar (use clean to reset)`,
|
|
539
|
+
};
|
|
540
|
+
}
|
|
541
|
+
return { ok: true, reason: 'sidecar absent or recoverable=true; recovery permitted' };
|
|
542
|
+
}
|
|
543
|
+
|
|
544
|
+
// PR-E helper: rename the recovered sidecar so operator can audit what was
|
|
545
|
+
// recovered from. Best-effort — failure here is non-fatal.
|
|
546
|
+
async function _archiveRecoveredSidecar(paths) {
|
|
547
|
+
const sidecarPath = paths.blockedSentinel.replace(/\.md$/, '.json');
|
|
548
|
+
if (!(await exists(sidecarPath))) return;
|
|
549
|
+
const iso = new Date().toISOString().replace(/[:.]/g, '-');
|
|
550
|
+
const archivePath = `${sidecarPath}.recovered-${iso}`;
|
|
551
|
+
try {
|
|
552
|
+
await fs.rename(sidecarPath, archivePath);
|
|
553
|
+
} catch (err) {
|
|
554
|
+
console.error(`[recovery] failed to archive sidecar: ${err?.message ?? err}`);
|
|
555
|
+
}
|
|
556
|
+
}
|
|
557
|
+
|
|
495
558
|
async function appendIterationAnalytics(paths, state, usId, verdict, options) {
|
|
496
559
|
await appendCampaignAnalytics(paths.analyticsFile, {
|
|
497
560
|
iter: state.iteration,
|
|
@@ -1414,6 +1477,33 @@ async function _runCampaignBody(slug, options, paths, rootDir) {
|
|
|
1414
1477
|
}
|
|
1415
1478
|
}
|
|
1416
1479
|
|
|
1480
|
+
// PR-E (Phase C1, stabilization): operator-cleared BLOCKED recovery.
|
|
1481
|
+
// Pair to PR-A above. PR-E runs AFTER PR-A so phase=verify takes precedence
|
|
1482
|
+
// when both apply (defensive ordering: never auto-recover phase=blocked if
|
|
1483
|
+
// the operator's actual intent was phase=verify hygiene). Does NOT use
|
|
1484
|
+
// _skipNextWorkerDispatch — counters reset is enough; worker dispatches
|
|
1485
|
+
// normally on the next iteration with a clean state.
|
|
1486
|
+
if (state.phase === 'blocked' && !state._skipNextWorkerDispatch) {
|
|
1487
|
+
const validation = await _validateBlockedRecovery({ paths, state });
|
|
1488
|
+
if (validation.ok) {
|
|
1489
|
+
const previousReason = state.last_block_reason ?? '';
|
|
1490
|
+
console.error(
|
|
1491
|
+
`[recovery] Operator-cleared BLOCKED detected (was: ${previousReason || 'unrecorded'}). Resetting counters and resuming as worker. iter=${state.iteration} us_id=${state.current_us}: ${validation.reason}`,
|
|
1492
|
+
);
|
|
1493
|
+
state.phase = 'worker';
|
|
1494
|
+
state.consecutive_failures = 0;
|
|
1495
|
+
state.consecutive_blocks = 0;
|
|
1496
|
+
state.last_block_reason = '';
|
|
1497
|
+
// Archive sidecar (rename, not delete) so operator can audit the
|
|
1498
|
+
// recovered-from state. Best-effort.
|
|
1499
|
+
await _archiveRecoveredSidecar(paths);
|
|
1500
|
+
} else {
|
|
1501
|
+
console.error(
|
|
1502
|
+
`[recovery] phase=blocked ignored, falling through to existing behavior: ${validation.reason}`,
|
|
1503
|
+
);
|
|
1504
|
+
}
|
|
1505
|
+
}
|
|
1506
|
+
|
|
1417
1507
|
// P1-E Lane Enforcement: snapshot lane mtimes before each iteration,
|
|
1418
1508
|
// compare at the top of the next iteration. Drift on read-only artifacts
|
|
1419
1509
|
// (PRD, test-spec, context) emits a lane_violation_warning event + audit
|
|
@@ -369,6 +369,81 @@ _validate_operator_recovery_artifacts() {
|
|
|
369
369
|
return 0
|
|
370
370
|
}
|
|
371
371
|
|
|
372
|
+
# PR-E (Phase C1, stabilization) — operator-cleared BLOCKED recovery validator.
|
|
373
|
+
# Pair to PR-A (_validate_operator_recovery_artifacts above). Together they
|
|
374
|
+
# close two recovery surfaces: phase=verify (PR-A) and phase=blocked
|
|
375
|
+
# sentinel-cleared (PR-E this helper).
|
|
376
|
+
#
|
|
377
|
+
# Returns 0 when all 4 checks pass; 1 otherwise. Sets BLOCKED_RECOVERY_FAIL_REASON
|
|
378
|
+
# (global) on failure for caller logging. Mirrors Node `_validateBlockedRecovery`
|
|
379
|
+
# in src/node/runner/campaign-main-loop.mjs.
|
|
380
|
+
#
|
|
381
|
+
# Args:
|
|
382
|
+
# $1 blocked sentinel path (.md)
|
|
383
|
+
# $2 blocked sidecar path (.json)
|
|
384
|
+
# $3 status.json path
|
|
385
|
+
_validate_blocked_recovery() {
|
|
386
|
+
local sentinel_md="$1" sidecar_json="$2" status_file="$3"
|
|
387
|
+
BLOCKED_RECOVERY_FAIL_REASON=""
|
|
388
|
+
|
|
389
|
+
# Check 1: precondition — caller verified phase=blocked already
|
|
390
|
+
# (passed in via status read; no need to re-read here)
|
|
391
|
+
|
|
392
|
+
# Check 2: sentinel cleared by operator
|
|
393
|
+
if [[ -f "$sentinel_md" ]]; then
|
|
394
|
+
BLOCKED_RECOVERY_FAIL_REASON="blocked sentinel still present (operator did not clear)"
|
|
395
|
+
return 1
|
|
396
|
+
fi
|
|
397
|
+
|
|
398
|
+
# Check 3: status.json must exist + counters non-zero
|
|
399
|
+
if [[ ! -f "$status_file" ]]; then
|
|
400
|
+
BLOCKED_RECOVERY_FAIL_REASON="status.json missing"
|
|
401
|
+
return 1
|
|
402
|
+
fi
|
|
403
|
+
if ! command -v jq >/dev/null 2>&1; then
|
|
404
|
+
BLOCKED_RECOVERY_FAIL_REASON="jq unavailable; cannot validate"
|
|
405
|
+
return 1
|
|
406
|
+
fi
|
|
407
|
+
local fails blocks
|
|
408
|
+
fails=$(jq -r '.consecutive_failures // 0' "$status_file" 2>/dev/null)
|
|
409
|
+
blocks=$(jq -r '.consecutive_blocks // 0' "$status_file" 2>/dev/null)
|
|
410
|
+
if [[ "$fails" == "0" && "$blocks" == "0" ]]; then
|
|
411
|
+
BLOCKED_RECOVERY_FAIL_REASON="counters already zero, nothing to recover"
|
|
412
|
+
return 1
|
|
413
|
+
fi
|
|
414
|
+
|
|
415
|
+
# Check 4: sidecar safety — if sidecar exists and recoverable=false, fall through
|
|
416
|
+
if [[ -f "$sidecar_json" ]]; then
|
|
417
|
+
if ! jq -e . "$sidecar_json" >/dev/null 2>&1; then
|
|
418
|
+
BLOCKED_RECOVERY_FAIL_REASON="blocked.json sidecar parse error"
|
|
419
|
+
return 1
|
|
420
|
+
fi
|
|
421
|
+
local recoverable category
|
|
422
|
+
recoverable=$(jq -r '.recoverable' "$sidecar_json" 2>/dev/null)
|
|
423
|
+
category=$(jq -r '.reason_category // "unknown"' "$sidecar_json" 2>/dev/null)
|
|
424
|
+
if [[ "$recoverable" == "false" ]]; then
|
|
425
|
+
BLOCKED_RECOVERY_FAIL_REASON="non-recoverable category $category from sidecar (use clean to reset)"
|
|
426
|
+
return 1
|
|
427
|
+
fi
|
|
428
|
+
fi
|
|
429
|
+
|
|
430
|
+
return 0
|
|
431
|
+
}
|
|
432
|
+
|
|
433
|
+
# PR-E helper: rename the recovered sidecar so operator can audit what was
|
|
434
|
+
# recovered from. Best-effort — failure is non-fatal.
|
|
435
|
+
#
|
|
436
|
+
# Args:
|
|
437
|
+
# $1 blocked sidecar path (.json)
|
|
438
|
+
_archive_recovered_sidecar() {
|
|
439
|
+
local sidecar_json="$1"
|
|
440
|
+
[[ -f "$sidecar_json" ]] || return 0
|
|
441
|
+
local iso
|
|
442
|
+
iso=$(date -u +%Y-%m-%dT%H-%M-%SZ)
|
|
443
|
+
mv "$sidecar_json" "${sidecar_json}.recovered-${iso}" 2>/dev/null || true
|
|
444
|
+
return 0
|
|
445
|
+
}
|
|
446
|
+
|
|
372
447
|
# PR-0b-narrow (Plan v6) — stamp leader handshake ack onto the sentinel.
|
|
373
448
|
# Mirror of src/node/shared/fs.mjs::stampAckField. Best-effort, audit-only:
|
|
374
449
|
# any failure is silently swallowed. Sequence:
|
|
@@ -3069,6 +3069,32 @@ main() {
|
|
|
3069
3069
|
fi
|
|
3070
3070
|
fi
|
|
3071
3071
|
|
|
3072
|
+
# PR-E (Phase C1, stabilization): operator-cleared BLOCKED recovery.
|
|
3073
|
+
# Pair to PR-A above. Runs AFTER PR-A (so phase=verify wins) and skipped
|
|
3074
|
+
# when SKIP_NEXT_WORKER=1 (PR-A already honored). Resets stale counters
|
|
3075
|
+
# in status.json when operator manually deleted the BLOCKED sentinel.
|
|
3076
|
+
# Mirrors Node `_validateBlockedRecovery` + branch in campaign-main-loop.mjs.
|
|
3077
|
+
if [[ "$LAST_PHASE" == "blocked" && "$SKIP_NEXT_WORKER" -eq 0 ]]; then
|
|
3078
|
+
local _blocked_sidecar="$MEMOS_DIR/${SLUG}-blocked.json"
|
|
3079
|
+
if _validate_blocked_recovery \
|
|
3080
|
+
"$BLOCKED_SENTINEL" "$_blocked_sidecar" "$STATUS_FILE"; then
|
|
3081
|
+
local _prev_reason
|
|
3082
|
+
_prev_reason=$(jq -r '.last_block_reason // ""' "$STATUS_FILE" 2>/dev/null)
|
|
3083
|
+
log "[recovery] Operator-cleared BLOCKED detected (was: ${_prev_reason:-unrecorded}). Resetting counters and resuming as worker. iter=$ITERATION"
|
|
3084
|
+
log_debug "[recovery] iter=$ITERATION blocked_recovery=applied reason=\"${BLOCKED_RECOVERY_FAIL_REASON:-sidecar absent or recoverable=true}\""
|
|
3085
|
+
# Reset counters in-process. update_status writes fresh status when
|
|
3086
|
+
# next phase transition fires. Operator's intent was a clean restart.
|
|
3087
|
+
CONSECUTIVE_FAILURES=0
|
|
3088
|
+
CONSECUTIVE_BLOCKS=0
|
|
3089
|
+
LAST_BLOCK_REASON=""
|
|
3090
|
+
# Archive sidecar (rename, not delete) for audit trail.
|
|
3091
|
+
_archive_recovered_sidecar "$_blocked_sidecar"
|
|
3092
|
+
else
|
|
3093
|
+
log "[recovery] phase=blocked ignored: ${BLOCKED_RECOVERY_FAIL_REASON}"
|
|
3094
|
+
log_debug "[recovery] iter=$ITERATION blocked_recovery=skipped reason=\"${BLOCKED_RECOVERY_FAIL_REASON}\""
|
|
3095
|
+
fi
|
|
3096
|
+
fi
|
|
3097
|
+
|
|
3072
3098
|
if (( ! SKIP_NEXT_WORKER )); then
|
|
3073
3099
|
# --- governance.md s7 step 8 (cleanup): Clean previous iteration signals ---
|
|
3074
3100
|
# Bug #7 Fix-R cleanup: unlock 0o444 sentinels written by the previous
|