@ai-dev-methodologies/rlp-desk 0.15.3 → 0.15.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +82 -0
- package/README.md +34 -4
- package/docs/rlp-desk/failure-modes.md +191 -0
- package/package.json +3 -2
- package/src/node/runner/campaign-main-loop.mjs +84 -11
- package/src/node/util/debug-log.mjs +10 -6
- package/src/node/util/lifecycle-metrics.mjs +102 -0
- package/src/scripts/lib_ralph_desk.zsh +66 -0
- package/src/scripts/run_ralph_desk.zsh +18 -0
- package/docs/plans/bug-report-overhaul-backlog.md +0 -49
- package/docs/plans/bug-report-overhaul-v0.md +0 -238
- package/docs/plans/bug-report-overhaul-v1.md +0 -319
- package/docs/plans/native-agent-revert.md +0 -184
- package/docs/plans/polished-gliding-toucan.md +0 -234
- package/docs/plans/pr-e-phase-c1-blocked-recovery-hygiene-v0.md +0 -233
- package/docs/plans/spicy-booping-galaxy.md +0 -717
- package/docs/plans/strategic-review/rlp-desk-strategic-review.md +0 -125
- package/docs/plans/v0.15-stabilization-phase-a-prep.md +0 -130
- package/docs/plans/v0.15-stabilization-plan.md +0 -178
- package/docs/plans/v0.16-real-llm-sv-gate-spec.md +0 -177
|
@@ -1,234 +0,0 @@
|
|
|
1
|
-
# Bug Report #7 — Post-Sentinel Process Race Fix
|
|
2
|
-
|
|
3
|
-
## Context
|
|
4
|
-
|
|
5
|
-
BOS 사용자가 19th launch에서 측정한 race window:
|
|
6
|
-
- iter-1 verifier가 verdict detect 후 **1m 43s** 뒤 `verify-verdict.json` 재수정 (file mtime 증거)
|
|
7
|
-
- iter-1 verifier post-verdict 후속 활동 **2m 1s**
|
|
8
|
-
- iter-1 verifier ↔ iter-2 worker 동시 작업 약 **2분**
|
|
9
|
-
|
|
10
|
-
Bug report:
|
|
11
|
-
`/Users/kyjin/dev/doul/bos/docs/exec-plans/active/2026-05-06-rlp-desk-bug-report-7-post-sentinel-process-race.md`
|
|
12
|
-
|
|
13
|
-
### Root cause
|
|
14
|
-
|
|
15
|
-
Leader는 `iter-signal.json` / `verify-verdict.json` 발견 즉시 다음 iter로 진입하지만, 그 sentinel을 쓴 Worker/Verifier process(claude/codex TUI)는 **명시적으로 종료되지 않는다**. tmux pane은 살아 있고 TUI는 idle prompt로 회귀 후 자체 self-review를 수행 → sentinel 재수정·working tree 오염·토큰 낭비.
|
|
16
|
-
|
|
17
|
-
### 모드 영향 범위 (중요)
|
|
18
|
-
|
|
19
|
-
`--mode tmux`(zsh runner)와 `--mode agent`(Node leader) **둘 다 영향**. Node leader도 `defaultSendKeys`/`defaultCreatePane`(`src/node/tmux/pane-manager.mjs`)을 통해 실제 tmux pane 위에서 worker/verifier를 실행한다 (`src/node/runner/campaign-main-loop.mjs:1077-1080`, `1116-1133`). Agent 모드 면역이라는 초기 가설은 부정확.
|
|
20
|
-
|
|
21
|
-
### 비대칭 (현 상태)
|
|
22
|
-
|
|
23
|
-
| 경로 | Worker 후처리 | Verifier 후처리 |
|
|
24
|
-
|---|---|---|
|
|
25
|
-
| Node leader | 없음 | 없음 |
|
|
26
|
-
| zsh runner | 다음 iter 시작 시 cleanup (`run_ralph_desk.zsh:2948-2956`) — race window 5s+ | dispatch 직전 cleanup (`3160-3180`) — 같은 iter 내에선 보호되나 final iter 종료 후 또는 cross-iter race는 불보호 |
|
|
27
|
-
|
|
28
|
-
---
|
|
29
|
-
|
|
30
|
-
## Approach (Fix-Q + Fix-R, 최소 surgical 조합)
|
|
31
|
-
|
|
32
|
-
| Fix | 효과 | 채택 |
|
|
33
|
-
|---|---|---|
|
|
34
|
-
| **Q** Sentinel detect 즉시 producing pane에 Ctrl+C → process 종료 | race를 ~1초 안에 직접 차단 | **YES (primary)** |
|
|
35
|
-
| **R** Sentinel 파일 chmod 0444로 재수정 차단 | Q가 늦거나 fail해도 mtime 동결 | **YES (defense-in-depth)** |
|
|
36
|
-
| S Pane lifecycle 전면 리팩토링 | 효과는 있으나 surface가 너무 큼. 기존 prep cleanup (zsh 2948-2956)으로 부분 커버됨. Karpathy "surgical changes" 원칙 위반 | NO |
|
|
37
|
-
| T post-sentinel 30s 안전망 timeout | Q가 fail-open이고 다음 iter prep cleanup이 backup이라 중복 | NO |
|
|
38
|
-
|
|
39
|
-
근거:
|
|
40
|
-
- Q는 producer를 ~1초 내 죽여서 root cause 차단. 기존 패턴 정확히 미러 (zsh `run_ralph_desk.zsh:2384-2397`, Ctrl+C 더블 송신 + `wait_for_pane_ready`).
|
|
41
|
-
- R은 chmod 실패에 관대(EPERM/ENOTSUP 무시 — `scripts/postinstall.js:104` `tryLockFile` 선례). WSL1/NTFS/tmpfs 등 chmod no-op 환경에서도 graceful degradation.
|
|
42
|
-
- S/T 제거로 review surface 최소화.
|
|
43
|
-
|
|
44
|
-
---
|
|
45
|
-
|
|
46
|
-
## Concrete code changes
|
|
47
|
-
|
|
48
|
-
### Node leader
|
|
49
|
-
|
|
50
|
-
#### 1. `src/node/tmux/pane-manager.mjs` — helper 추가 (line 77 뒤)
|
|
51
|
-
|
|
52
|
-
신규 export:
|
|
53
|
-
- `sendRawKey(paneId, key)` — `runTmux(['send-keys', '-t', paneId, key])`. `sendKeys`(`-l --` literal text)와 분리: C-c 같은 raw key용.
|
|
54
|
-
- `killPaneProcess(paneId, { sendRawKey, waitForExit, gracePeriodMs=800, exitTimeoutMs=5000, log })`:
|
|
55
|
-
1. `sendRawKey('C-c')` → `await sleep(gracePeriodMs)` → `sendRawKey('C-c')` (double press, zsh `375-376` 미러).
|
|
56
|
-
2. `await waitForExit(paneId, { timeoutMs: exitTimeoutMs }).catch(log)` — fail-open.
|
|
57
|
-
3. raw key 송신 자체의 TmuxError도 catch+log (이미 죽은 pane에 안전).
|
|
58
|
-
|
|
59
|
-
기존 `waitForProcessExit` (line 55) 그대로 재사용.
|
|
60
|
-
|
|
61
|
-
#### 2. `src/node/shared/fs.mjs` — helper 추가 (line 61 뒤)
|
|
62
|
-
|
|
63
|
-
- `lockSentinelFile(filePath, { log })` — `fs.chmod(filePath, 0o444)`, error 시 한 번만 경고 로그. `tryLockFile`(`scripts/postinstall.js:104`) 선례 미러.
|
|
64
|
-
- `unlockSentinelFile(filePath)` — `fs.chmod(filePath, 0o644)`, 실패 무시. iter cleanup 직전에 호출.
|
|
65
|
-
|
|
66
|
-
#### 3. `src/node/runner/campaign-main-loop.mjs` — wire + call sites
|
|
67
|
-
|
|
68
|
-
DI 슬롯 추가 (line 1077-1080):
|
|
69
|
-
```
|
|
70
|
-
const sendRawKey = options.sendRawKey ?? defaultSendRawKey;
|
|
71
|
-
const waitForProcessExit = options.waitForProcessExit ?? defaultWaitForProcessExit;
|
|
72
|
-
const killPaneProcess = options.killPaneProcess ?? defaultKillPaneProcess;
|
|
73
|
-
const lockSentinel = options.lockSentinelFile ?? lockSentinelFile;
|
|
74
|
-
```
|
|
75
|
-
|
|
76
|
-
내부 wrapper:
|
|
77
|
-
```
|
|
78
|
-
async function reapProducer(paneId, sentinelFile) {
|
|
79
|
-
await killPaneProcess(paneId, { sendRawKey, waitForExit: waitForProcessExit, log: console.error });
|
|
80
|
-
if (sentinelFile) await lockSentinel(sentinelFile, { log: console.error });
|
|
81
|
-
}
|
|
82
|
-
```
|
|
83
|
-
|
|
84
|
-
호출 사이트 (성공 + `validateArtifact` 통과 직후):
|
|
85
|
-
|
|
86
|
-
| Site | Line | 호출 |
|
|
87
|
-
|---|---|---|
|
|
88
|
-
| Flywheel poll | 1267-1277 다음 (1285 앞) | `reapProducer(state.flywheel_pane_id ?? state.verifier_pane_id, paths.flywheelSignalFile)` |
|
|
89
|
-
| Guard poll | 1305-1315 다음 (1323 앞) | `reapProducer(guardPaneId, paths.flywheelGuardVerdictFile)` |
|
|
90
|
-
| Worker poll | 1422-1432 다음 (1456 앞) | `reapProducer(state.worker_pane_id, paths.signalFile)` |
|
|
91
|
-
| Verifier poll | 1489-1513 다음 (1522 앞) | `reapProducer(state.verifier_pane_id, paths.verdictFile)` |
|
|
92
|
-
| Final per-US verifier (`runFinalSequentialVerify`) | 890-894 다음 (896 앞) | `reapProducer(verifierPaneId, paths.verdictFile)` — `runFinalSequentialVerify` 시그니처에 `reapProducer` 추가 + 호출처(1185-1194) 전달 |
|
|
93
|
-
|
|
94
|
-
iter cleanup unlock — `fs.unlink(...)` 호출 직전 `unlockSentinelFile` 호출:
|
|
95
|
-
- L1291 (`flywheelSignalFile`)
|
|
96
|
-
- L1328 (`flywheelGuardVerdictFile`)
|
|
97
|
-
- 루프 상단 (1145 직후) — Worker `signalFile` / Verifier `verdictFile` 방어적 unlock (다음 iter producer가 atomic rename으로 덮어쓸 때 대비)
|
|
98
|
-
|
|
99
|
-
### zsh runner
|
|
100
|
-
|
|
101
|
-
#### 4. `src/scripts/lib_ralph_desk.zsh` — helper 추가 (`atomic_write` 다음, line 245 뒤)
|
|
102
|
-
|
|
103
|
-
```
|
|
104
|
-
_kill_pane_process() {
|
|
105
|
-
local pane_id="$1" role="${2:-producer}"
|
|
106
|
-
log_debug "[bug7] kill_pane_process pane=$pane_id role=$role"
|
|
107
|
-
tmux send-keys -t "$pane_id" C-c 2>/dev/null
|
|
108
|
-
sleep 0.5
|
|
109
|
-
tmux send-keys -t "$pane_id" C-c 2>/dev/null
|
|
110
|
-
sleep 1
|
|
111
|
-
wait_for_pane_ready "$pane_id" 5 2>/dev/null || true
|
|
112
|
-
}
|
|
113
|
-
|
|
114
|
-
_lock_sentinel() {
|
|
115
|
-
local file="$1"
|
|
116
|
-
[[ -f "$file" ]] || return 0
|
|
117
|
-
chmod 0444 "$file" 2>/dev/null || true
|
|
118
|
-
}
|
|
119
|
-
|
|
120
|
-
_unlock_sentinel() {
|
|
121
|
-
local file="$1"
|
|
122
|
-
[[ -f "$file" ]] || return 0
|
|
123
|
-
chmod 0644 "$file" 2>/dev/null || true
|
|
124
|
-
}
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
#### 5. `src/scripts/run_ralph_desk.zsh` — call sites
|
|
128
|
-
|
|
129
|
-
| Site | Line | 호출 |
|
|
130
|
-
|---|---|---|
|
|
131
|
-
| Worker poll 성공 직후 | 3003 (`worker_poll_done=1` 분기 안, `log_debug` 다음) | `_kill_pane_process "$WORKER_PANE" "worker"; _lock_sentinel "$SIGNAL_FILE"` |
|
|
132
|
-
| Verifier poll 성공 직후 (main path) | 3202 통과 후, 3215 앞 (`ITER_VERIFIER_END`) | `_kill_pane_process "$VERIFIER_PANE" "verifier"; _lock_sentinel "$VERDICT_FILE"` |
|
|
133
|
-
| Final-verify per-US (`run_sequential_final_verify`) | 2524 통과 후, 다음 iter 진입 전 | `_kill_pane_process "$VERIFIER_PANE" "verifier-final"; _lock_sentinel "$VERDICT_FILE"` |
|
|
134
|
-
| Codex grace path | `dispatch_verifier_per_us` (2420 그레이스 종료 직후, 2471 `cp` 앞) | `_kill_pane_process "$VERIFIER_PANE" "verifier-${suffix}"; _lock_sentinel "$VERDICT_FILE"` |
|
|
135
|
-
| Consensus path | `run_consensus_verification` 내 각 `poll_for_signal` 성공 직후 | 동일 패턴 |
|
|
136
|
-
|
|
137
|
-
prep cleanup unlock — line 2948-2956 cleanup 직전:
|
|
138
|
-
```
|
|
139
|
-
_unlock_sentinel "$SIGNAL_FILE"; _unlock_sentinel "$VERDICT_FILE"
|
|
140
|
-
rm -f "$SIGNAL_FILE" "$DONE_CLAIM_FILE" "$VERDICT_FILE" 2>/dev/null
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
---
|
|
144
|
-
|
|
145
|
-
## Files to modify
|
|
146
|
-
|
|
147
|
-
| 파일 | 변경 |
|
|
148
|
-
|---|---|
|
|
149
|
-
| `src/node/tmux/pane-manager.mjs` | `sendRawKey`, `killPaneProcess` export 추가 |
|
|
150
|
-
| `src/node/shared/fs.mjs` | `lockSentinelFile`, `unlockSentinelFile` export 추가 |
|
|
151
|
-
| `src/node/runner/campaign-main-loop.mjs` | DI + `reapProducer` + 5개 call site + iter cleanup unlock |
|
|
152
|
-
| `src/scripts/lib_ralph_desk.zsh` | `_kill_pane_process`, `_lock_sentinel`, `_unlock_sentinel` 추가 |
|
|
153
|
-
| `src/scripts/run_ralph_desk.zsh` | 4-5개 call site + prep cleanup unlock |
|
|
154
|
-
| `tests/node/us006-campaign-main-loop.test.mjs` | `createTmuxFakes()`에 `killPaneProcess`/`lockSentinelFile` 레코더 추가 + Bug-7 테스트 3건 |
|
|
155
|
-
| `tests/node/test-kill-pane-process.test.mjs` | NEW — helper 단위 테스트 |
|
|
156
|
-
| `tests/node/test-lock-sentinel-file.test.mjs` | NEW — chmod 단위 테스트 |
|
|
157
|
-
| `tests/test-bug7-post-sentinel-race.sh` | NEW — 실제 tmux 통합 테스트 (Bug #6 패턴 미러) |
|
|
158
|
-
|
|
159
|
-
배포는 단일 PR (helper는 call site 없으면 no-op이라 review surface 작음).
|
|
160
|
-
|
|
161
|
-
---
|
|
162
|
-
|
|
163
|
-
## Reused functions (참조)
|
|
164
|
-
|
|
165
|
-
- Node: `pane-manager.mjs:50` `sendKeys`, `pane-manager.mjs:55` `waitForProcessExit` (5s timeout, shell 감지)
|
|
166
|
-
- Node: `shared/fs.mjs:6-23` `writeFileAtomic`, `42-61` `writeSentinelExclusive`
|
|
167
|
-
- Node: `scripts/postinstall.js:104` `tryLockFile` (chmod 0o444 선례)
|
|
168
|
-
- zsh: `lib_ralph_desk.zsh:240-245` `atomic_write`, `1075-1137` `wait_for_pane_ready`
|
|
169
|
-
- zsh: `run_ralph_desk.zsh:2384-2397` 검증된 verifier-cleanup 패턴 (Ctrl+C + /exit + wait), `375-376/529-530` 더블 Ctrl+C 패턴
|
|
170
|
-
|
|
171
|
-
---
|
|
172
|
-
|
|
173
|
-
## Testing strategy
|
|
174
|
-
|
|
175
|
-
### 단위 테스트 (Node)
|
|
176
|
-
|
|
177
|
-
`tests/node/test-kill-pane-process.test.mjs` (NEW):
|
|
178
|
-
- AC1 정상: C-c → sleep → C-c → waitForExit 순서 (fake recorder 검증).
|
|
179
|
-
- AC2 fail-open: `waitForExit` 가 TmuxError throw 시 helper resolve.
|
|
180
|
-
- AC3 dead-pane: `sendRawKey` throw 시 resolve.
|
|
181
|
-
- AC4 grace: gracePeriodMs 준수 (fake clock 또는 tolerance 검증).
|
|
182
|
-
|
|
183
|
-
`tests/node/test-lock-sentinel-file.test.mjs` (NEW):
|
|
184
|
-
- AC1: lock 후 mode `& 0o222 === 0` (chmod 무시 FS는 skip).
|
|
185
|
-
- AC2: 존재하지 않는 path에 lock — throw 안 함.
|
|
186
|
-
- AC3: unlock 후 writable.
|
|
187
|
-
|
|
188
|
-
### 통합 테스트 (Node)
|
|
189
|
-
|
|
190
|
-
`tests/node/us006-campaign-main-loop.test.mjs` 확장:
|
|
191
|
-
1. **Bug-7-A**: Worker pollForSignal 성공 → next dispatchVerifier 전에 `killPaneProcess('%worker')` + `lockSentinelFile(signalFile)` 호출 순서 검증.
|
|
192
|
-
2. **Bug-7-B**: Verifier verdict pass 후 next iter dispatchWorker 전에 `killPaneProcess('%verifier')` + `lockSentinelFile(verdictFile)`.
|
|
193
|
-
3. **Bug-7-C**: `killPaneProcess`가 throw해도 run() 정상 완료.
|
|
194
|
-
|
|
195
|
-
`createTmuxFakes()`(line 83)에 fake `killPaneProcess`/`lockSentinelFile` 레코더 추가 (기존 30+ 테스트 호환 보장).
|
|
196
|
-
|
|
197
|
-
### 통합 테스트 (zsh)
|
|
198
|
-
|
|
199
|
-
`tests/test-bug7-post-sentinel-race.sh` (NEW, `test-bug6-worker-idle-false-positive.sh` 패턴 미러):
|
|
200
|
-
- Scenario 1: tmux 세션에 `sleep 600` 띄우고 `_kill_pane_process` 호출 → 2s 안에 `pane_current_command`가 zsh/bash로 회귀.
|
|
201
|
-
- Scenario 2: `_lock_sentinel` → mode 0444 검증 → `_unlock_sentinel` → writable → `rm -f` 성공.
|
|
202
|
-
- Scenario 3 (REAL_E2E gated): 1-iter 캠페인 + stub claude(sentinel write 후 sleep 120) → 10s 후 verdict file mtime delta == 0.
|
|
203
|
-
|
|
204
|
-
### Self-Verification 시나리오 (CLAUDE.md gate, 3건 필수)
|
|
205
|
-
|
|
206
|
-
`src/scripts/run_ralph_desk.zsh` 수정 — MEDIUM-HIGH risk:
|
|
207
|
-
- **LOW**: helper 단위 테스트 + 기존 Node/zsh 회귀 테스트 통과.
|
|
208
|
-
- **MEDIUM**: 1-iter 실제 캠페인. Worker → Verifier 전이 시점에 `pane_current_command` 캡처, 2s 내 shell 회귀 검증. Verdict file mtime 동결 검증.
|
|
209
|
-
- **CRITICAL**: 2-iter 캠페인 (verify→fail→verify→pass). iter-N+1 worker dispatch가 iter-N verifier `pane_current_command == zsh` 확인 후에만 발생 — 타임스탬프 로그 캡처. `--mode agent`와 `--mode tmux` 둘 다 실행.
|
|
210
|
-
|
|
211
|
-
---
|
|
212
|
-
|
|
213
|
-
## Verification end-to-end
|
|
214
|
-
|
|
215
|
-
1. **단위**: `node --test tests/node/test-kill-pane-process.test.mjs tests/node/test-lock-sentinel-file.test.mjs` 통과.
|
|
216
|
-
2. **통합 (Node)**: `node --test tests/node/us006-campaign-main-loop.test.mjs` 통과 — call order 단언이 회귀 가드.
|
|
217
|
-
3. **라이브 tmux**: `_kill_pane_process` 호출 후 2s 내 `tmux display-message -p '#{pane_current_command}' -t $pane`가 `zsh`/`bash` 반환.
|
|
218
|
-
4. **mtime 동결**: `stat -f %m verify-verdict.json`을 detect 시점과 +10s 시점에 측정해 delta == 0. Bug report의 1m43s 증거를 직접 반박.
|
|
219
|
-
5. **Pane 출력**: `tmux capture-pane -p` 결과에 `Worked for Xm Ys` / `esc to interrupt` 신규 표식 없음.
|
|
220
|
-
6. **두 모드**: 스모크 테스트를 `--mode tmux`(zsh runner)와 `--mode agent`(Node leader) 각각 실행 — 둘 다 4초 내 shell 회귀 검증.
|
|
221
|
-
7. **재현 시나리오**: 19th launch와 동일 조건(claude opus 1m worker + gpt-5.5:high codex verifier)으로 캠페인 1회 실행 후 leader log + file mtime 비교 — race 0.
|
|
222
|
-
|
|
223
|
-
---
|
|
224
|
-
|
|
225
|
-
## Risk / mitigation
|
|
226
|
-
|
|
227
|
-
| Risk | 가능성 | 완화 |
|
|
228
|
-
|---|---|---|
|
|
229
|
-
| C-c가 producer artifact 쓰기 중간 인터럽트 | LOW — sentinel은 detect 시점에 이미 디스크에 존재 | `MalformedArtifactError` 경로가 partial write 처리 |
|
|
230
|
-
| chmod 0444가 다음 iter cleanup의 `unlink` 차단 | LOW | `_unlock_sentinel` / `unlockSentinelFile`이 unlink 직전 실행. 대부분 Unix FS는 dir-perms 기준이라 0444 파일도 unlink 가능 |
|
|
231
|
-
| Producer가 atomic rename으로 sentinel 재기록 (chmod 우회) | POSSIBLE | Q(kill)이 ~1s 내 producer 죽이므로 rewrite window가 2분 → 1초로 축소. 게다가 leader는 이미 in-band로 sentinel 소비 |
|
|
232
|
-
| `killPaneProcess`가 죽은 pane에 throw | POSSIBLE | helper 내부 catch + 단위 테스트 AC2/AC3로 회귀 가드 |
|
|
233
|
-
| chmod 0444 silent no-op (WSL1/NTFS/tmpfs) | OBSERVED (postinstall.js 선례) | 한 번만 경고 로그. Q(kill)이 primary defense라 graceful degradation |
|
|
234
|
-
| 기존 us006 테스트 회귀 | MEDIUM | `createTmuxFakes()`에 fake helper 레코더 추가 — 기존 호출자는 자동 주입 받음 |
|
|
@@ -1,233 +0,0 @@
|
|
|
1
|
-
# PR-E: Phase C1 — Blocked Sentinel Recovery Hygiene (Planner v0)
|
|
2
|
-
|
|
3
|
-
> **Plan reference**: `docs/plans/v0.15-stabilization-plan.md` §5 Phase C
|
|
4
|
-
> **Continuation of**: PR-A (Bug #10 phase=verify recovery, commit `95c0d4e`)
|
|
5
|
-
> **Stop rule**: codex critic APPROVE (P0+P1=0) before merge
|
|
6
|
-
> **Critic instruction**: approve unless P0 or P1 found
|
|
7
|
-
|
|
8
|
-
---
|
|
9
|
-
|
|
10
|
-
## 1. Problem
|
|
11
|
-
|
|
12
|
-
After PR-A landed (`phase=verify` recovery honored), the next recovery surface is **operator-cleared BLOCKED**.
|
|
13
|
-
|
|
14
|
-
Today, when operator clears `<slug>-blocked.md` to recover (the documented manual recovery for some BLOCKED reasons), `status.json` retains:
|
|
15
|
-
- `phase: "blocked"` (stale)
|
|
16
|
-
- `consecutive_failures` and `consecutive_blocks` counters at their pre-BLOCKED values
|
|
17
|
-
- `last_block_reason` populated
|
|
18
|
-
|
|
19
|
-
On leader relaunch:
|
|
20
|
-
1. `readCurrentState` (`src/node/runner/campaign-main-loop.mjs:364`) preserves all of these
|
|
21
|
-
2. Main loop iterates, tries to dispatch worker
|
|
22
|
-
3. If worker fails for any reason, `consecutive_failures` increments from its stale base
|
|
23
|
-
4. Circuit breaker may trip immediately even though operator's intent was "fresh start"
|
|
24
|
-
5. Result: campaign re-BLOCKs on first failure, operator's recovery effort wasted
|
|
25
|
-
|
|
26
|
-
This is the same class as Bug #10 (PR-A): operator's recovery intent silently discarded because leader doesn't recognize the recovery surface.
|
|
27
|
-
|
|
28
|
-
---
|
|
29
|
-
|
|
30
|
-
## 2. Principles (3)
|
|
31
|
-
|
|
32
|
-
1. **Operator's recovery intent is the source of truth.** When BLOCKED sentinel is gone but status.json still says blocked + counters stale, the operator clearly meant to reset state. Leader must recognize and honor.
|
|
33
|
-
2. **Recovery validation must be strict (mirror PR-A).** Auto-honoring without checks risks accidental honor of crashed-mid-write states. PR-A's 5-check pattern applied to the blocked-recovery context.
|
|
34
|
-
3. **Defensive default — fall through, don't break.** If validation fails, log the reason and proceed with current behavior (no auto-reset). Recovery feature can never make existing flows worse.
|
|
35
|
-
|
|
36
|
-
## 3. Decision drivers (top 3)
|
|
37
|
-
|
|
38
|
-
| # | Driver | Why |
|
|
39
|
-
|---|---|---|
|
|
40
|
-
| D1 | **Operator recovery completeness** | PR-A covered phase=verify; phase=blocked is the most-common operator recovery (clear sentinel → relaunch). Closing this gap completes the pair. |
|
|
41
|
-
| D2 | **Mirror PR-A pattern** | Same shape (entry-time validate + flag + audit log) reduces cognitive load for future readers. Both Node + zsh same way. |
|
|
42
|
-
| D3 | **Counter reset honesty** | Operator clearing the sentinel implies intent to retry from clean state. Stale counters silently re-BLOCK = surprise. |
|
|
43
|
-
|
|
44
|
-
---
|
|
45
|
-
|
|
46
|
-
## 4. Viable options
|
|
47
|
-
|
|
48
|
-
### Option A — Entry-time blocked-recovery branch (mirror of PR-A) **[recommended]**
|
|
49
|
-
|
|
50
|
-
After `readCurrentState`, before main loop, add a second recovery branch:
|
|
51
|
-
- IF `state.phase === 'blocked'` AND blocked sentinel does NOT exist (operator cleared) AND counters are non-zero → **operator-cleared recovery detected**
|
|
52
|
-
- Validate (5 checks, see §7)
|
|
53
|
-
- On pass: reset phase to 'worker', reset counters to 0, log audit line
|
|
54
|
-
- On fail: fall through (current behavior — campaign continues with stale state, may immediately re-BLOCK)
|
|
55
|
-
|
|
56
|
-
Pros: surgical (single branch, ~30 LOC each side), pattern matches PR-A exactly, defensive default.
|
|
57
|
-
Cons: adds another entry-time check (small overhead).
|
|
58
|
-
|
|
59
|
-
### Option B — Reset on every relaunch unless explicit "preserve counters" flag
|
|
60
|
-
|
|
61
|
-
Always reset counters when relaunching with no BLOCKED sentinel. Add `--preserve-counters` flag for users who want stale-counter behavior.
|
|
62
|
-
|
|
63
|
-
Pros: simpler logic.
|
|
64
|
-
Cons: changes existing behavior for users who didn't experience this issue. Breaks back-compat for anyone relying on counter persistence across relaunches.
|
|
65
|
-
|
|
66
|
-
→ **Rejected**: violates principle 3 (defensive default).
|
|
67
|
-
|
|
68
|
-
### Option C — Document operator workaround instead of code change
|
|
69
|
-
|
|
70
|
-
Add cookbook entry: "after clearing blocked sentinel, also `jq` zero out counters in status.json".
|
|
71
|
-
|
|
72
|
-
Pros: zero code change.
|
|
73
|
-
Cons: pushes burden to operator. Same class of failure as Bug #10's pre-PR-A state — the leader should recognize recovery, not require operator jq pipelines.
|
|
74
|
-
|
|
75
|
-
→ **Rejected**: violates principle 1.
|
|
76
|
-
|
|
77
|
-
**Recommendation: A.**
|
|
78
|
-
|
|
79
|
-
---
|
|
80
|
-
|
|
81
|
-
## 5. Scope
|
|
82
|
-
|
|
83
|
-
### P0 — must land
|
|
84
|
-
|
|
85
|
-
1. **Node leader entry-time blocked-recovery branch** (`src/node/runner/campaign-main-loop.mjs`):
|
|
86
|
-
- New helper `_validateBlockedRecovery({ paths, state })` — returns `{ ok: bool, reason: string }`. 5 checks (§7).
|
|
87
|
-
- Branch after readCurrentState (around line 1392, where PR-A branch sits) — if `phase === 'blocked'` and validator passes, reset phase + counters + log.
|
|
88
|
-
|
|
89
|
-
2. **zsh runner mirror** (`src/scripts/run_ralph_desk.zsh`):
|
|
90
|
-
- Mirror helper `_validate_blocked_recovery` in `lib_ralph_desk.zsh`
|
|
91
|
-
- Mirror entry-time branch (similar location to PR-A's site at `:3047-3071` range)
|
|
92
|
-
|
|
93
|
-
### P1 — must land
|
|
94
|
-
|
|
95
|
-
3. **Tests**:
|
|
96
|
-
- `tests/node/test-blocked-recovery-hygiene.test.mjs` (NEW, 5 ACs):
|
|
97
|
-
- AC-BR1: phase=blocked + sentinel absent + counters non-zero + valid → reset + dispatch worker normally
|
|
98
|
-
- AC-BR2: phase=blocked + sentinel PRESENT → don't auto-recover, throw "Run clean first" (existing behavior preserved)
|
|
99
|
-
- AC-BR3: phase=blocked + sentinel absent + counters all zero → fall through (nothing to reset)
|
|
100
|
-
- AC-BR4: phase=verify + sentinel absent → defer to PR-A's branch (no double-handling)
|
|
101
|
-
- AC-BR5: phase=blocked + sentinel absent + last_block_reason indicates non-recoverable category (`mission_abort`) → fall through, log "non-recoverable category, manual review needed"
|
|
102
|
-
- `tests/test-blocked-recovery-zsh.sh` (NEW, 5 helper-level scenarios mirroring AC-BR1..5)
|
|
103
|
-
|
|
104
|
-
### P2 — nice-to-have (deferred)
|
|
105
|
-
|
|
106
|
-
- Cookbook entry in `docs/rlp-desk/getting-started.md` documenting the recovery flow now that leader honors it
|
|
107
|
-
- Telemetry analytics — track how often operator-cleared recovery is detected (signal of campaign reliability)
|
|
108
|
-
|
|
109
|
-
---
|
|
110
|
-
|
|
111
|
-
## 6. Files to modify
|
|
112
|
-
|
|
113
|
-
| File | Change | Risk |
|
|
114
|
-
|---|---|---|
|
|
115
|
-
| `src/node/runner/campaign-main-loop.mjs` | `_validateBlockedRecovery` helper + entry-time branch | LOW (pattern proven by PR-A) |
|
|
116
|
-
| `src/scripts/lib_ralph_desk.zsh` | `_validate_blocked_recovery` helper | LOW |
|
|
117
|
-
| `src/scripts/run_ralph_desk.zsh` | Entry-time branch (near `:3047-3071` range) | LOW |
|
|
118
|
-
| `tests/node/test-blocked-recovery-hygiene.test.mjs` (NEW) | 5 ACs | LOW |
|
|
119
|
-
| `tests/test-blocked-recovery-zsh.sh` (NEW) | 5 zsh scenarios | LOW |
|
|
120
|
-
|
|
121
|
-
Total: 3 modified + 2 new = 5 files. Smaller surface than PR-A.
|
|
122
|
-
|
|
123
|
-
---
|
|
124
|
-
|
|
125
|
-
## 7. Validator: 4 checks (`_validateBlockedRecovery`) — Codex-revised v2
|
|
126
|
-
|
|
127
|
-
Codex critic P1-1 finding: v1's Check 4 depended on `last_block_reason` field but **no code path persists that field to status.json**. Both Node `_emitBlockedSentinel` and zsh `write_blocked_sentinel` skip it. So the v1 validator would never block auto-recovery for mission_abort/repeat_axis — exactly the safety case it was designed for.
|
|
128
|
-
|
|
129
|
-
**v2 fix**: detect non-recoverable categories from the **`<slug>-blocked.json` sidecar** (which `_emitBlockedSentinel` DOES write at L942-965, with `reason_category` + `recoverable` fields), not from status.json. The sidecar persists even when operator manually `rm <slug>-blocked.md` — they don't usually delete the sidecar.
|
|
130
|
-
|
|
131
|
-
Returns `{ ok: bool, reason: string }`:
|
|
132
|
-
|
|
133
|
-
1. `state.phase === 'blocked'` (precondition)
|
|
134
|
-
2. Blocked sentinel `<slug>-blocked.md` does NOT exist (operator cleared)
|
|
135
|
-
3. At least one of: `consecutive_failures > 0`, `consecutive_blocks > 0` (something to reset; if all counters zero, fall through — nothing to recover from)
|
|
136
|
-
4. **Sidecar safety check**: if `<slug>-blocked.json` exists AND parses AND has `recoverable: false` → fall through with audit log "non-recoverable category <reason_category> from sidecar". If sidecar absent (e.g. user ran full `clean`) OR sidecar `recoverable: true` → proceed with auto-recovery. Mirrors `_classifyBlock` `recoverable` invariant; no new status field needed.
|
|
137
|
-
|
|
138
|
-
(Check 5 from v0 — 30-day staleness — DROPPED. Architect-flagged as arbitrary.)
|
|
139
|
-
|
|
140
|
-
On pass: caller resets `state.phase = 'worker'`, `state.consecutive_failures = 0`, `state.consecutive_blocks = 0`. Sidecar (if exists) is RENAMED to `<slug>-blocked.json.recovered-<iso>` for audit trail rather than deleted, so operator can inspect what was recovered from. Then logs:
|
|
141
|
-
|
|
142
|
-
```
|
|
143
|
-
[recovery] Operator-cleared BLOCKED detected (was: <last_block_reason>). Resetting counters and resuming as worker. iter=N us_id=<current_us>
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
On fail: log `[recovery] phase=blocked ignored: <reason>` and fall through to existing behavior.
|
|
147
|
-
|
|
148
|
-
---
|
|
149
|
-
|
|
150
|
-
## 8. Pre-mortem (3 scenarios)
|
|
151
|
-
|
|
152
|
-
### S1 — Auto-recovery hides genuine problem
|
|
153
|
-
Campaign keeps BLOCKING because of a real architectural issue. Operator clears sentinel each time. Auto-recovery resets counters → CB never trips → infinite loop of fail+clear.
|
|
154
|
-
|
|
155
|
-
**Mitigation**: operator-cleared recovery is exactly that — operator chose to retry. If they keep clearing without fixing, the bug pattern is operator behavior, not leader's. Counters resetting is correct; CB still trips on the freshly-accumulated counters from current session. Leader doesn't enable infinite loops, operator does.
|
|
156
|
-
|
|
157
|
-
**Residual risk**: low. If operator wants CB to persist across relaunches, they can leave the sentinel and use `clean` workflow instead.
|
|
158
|
-
|
|
159
|
-
### S2 — Mid-write status.json read produces inconsistent state
|
|
160
|
-
A previous leader instance crashed mid-`writeStatus`. Relaunch reads partial JSON.
|
|
161
|
-
|
|
162
|
-
**Mitigation (corrected per Codex critic P2 backlog)**: `writeStatus` uses `writeJson` → `fs.writeFile` directly (NOT atomic rename). Partial writes are theoretically possible. If JSON is malformed, `readJsonIfExists` THROWS (not returns null) — leader fails fast at startup with parse error, surfacing the corruption to operator. Auto-recovery never proceeds because leader doesn't even reach the validator. This is acceptable: corrupted status.json is operator-visible immediately, not silently recovered. P2 backlog item: consider migrating writeStatus to atomic rename for crash safety, but that's a separate PR.
|
|
163
|
-
|
|
164
|
-
### S3 — Race: operator clears sentinel while leader is starting
|
|
165
|
-
Operator deletes blocked.md just as leader's `await exists(paths.blockedSentinel)` runs. Two outcomes:
|
|
166
|
-
- Sentinel exists during check → existing "Run clean first" error throws (existing behavior, unchanged)
|
|
167
|
-
- Sentinel missing during check → enters validator → if checks pass, recovery proceeds
|
|
168
|
-
|
|
169
|
-
**Mitigation**: this is a benign race. Both outcomes are valid (operator either succeeded in clearing or didn't). No corruption possible.
|
|
170
|
-
|
|
171
|
-
---
|
|
172
|
-
|
|
173
|
-
## 9. Test plan
|
|
174
|
-
|
|
175
|
-
### Unit (Node)
|
|
176
|
-
|
|
177
|
-
`tests/node/test-blocked-recovery-hygiene.test.mjs`:
|
|
178
|
-
|
|
179
|
-
Each AC sets up a fixture (status.json + memos/) per its scenario, runs the leader to first dispatch decision, asserts on dispatch behavior + log content.
|
|
180
|
-
|
|
181
|
-
- AC-BR1 happy: setup phase=blocked, sentinel absent, consecutive_failures=3 → assert leader dispatches worker (not throw), assert state.consecutive_failures === 0 in next status write, assert audit log line matches
|
|
182
|
-
- AC-BR2 sentinel present: setup phase=blocked, sentinel exists → assert leader throws "Run clean first" (existing behavior preserved)
|
|
183
|
-
- AC-BR3 nothing to reset: phase=blocked, sentinel absent, all counters zero, last_block_reason empty → assert fall-through (no log line, no reset, normal worker dispatch)
|
|
184
|
-
- AC-BR4 phase=verify defers: phase=verify, sentinel absent → assert PR-A logic runs (no blocked recovery handling)
|
|
185
|
-
- AC-BR5 non-recoverable category: phase=blocked, sentinel absent, last_block_reason='mission_abort' → assert fall-through with log "non-recoverable category"
|
|
186
|
-
|
|
187
|
-
### Integration (zsh)
|
|
188
|
-
|
|
189
|
-
`tests/test-blocked-recovery-zsh.sh` (helper-level, mirrors `test-bug10-zsh-relaunch-hygiene.sh`):
|
|
190
|
-
|
|
191
|
-
- Scenario BR-Z1: all 5 checks pass → `_validate_blocked_recovery` returns 0
|
|
192
|
-
- Scenarios BR-Z2..BR-Z5: each check fails → returns 1 with reason matching expected substring
|
|
193
|
-
|
|
194
|
-
### Regression
|
|
195
|
-
|
|
196
|
-
- Full Node suite: 334/334 must remain green
|
|
197
|
-
- Bug #10 PR-A tests (test-relaunch-phase-verify-hygiene.test.mjs) must remain green
|
|
198
|
-
- Bug #7 zsh regression must remain green
|
|
199
|
-
|
|
200
|
-
---
|
|
201
|
-
|
|
202
|
-
## 10. Verification end-to-end
|
|
203
|
-
|
|
204
|
-
1. `node --test tests/node/test-blocked-recovery-hygiene.test.mjs` → 5/5 PASS
|
|
205
|
-
2. `bash tests/test-blocked-recovery-zsh.sh` → 5/5 PASS
|
|
206
|
-
3. Full Node suite + Bug #7 regression unchanged green
|
|
207
|
-
4. **`zsh tests/sv-gate-fast.sh` PASS** (governance §1g pre-merge gate, Codex critic P1-2)
|
|
208
|
-
5. Manual sandbox: deliberately BLOCKED campaign → operator clears blocked.md → relaunch → leader logs `[recovery] Operator-cleared BLOCKED detected`, counters reset, worker dispatches, campaign continues. Repeat with `recoverable: false` sidecar → leader logs fall-through, no auto-recovery.
|
|
209
|
-
6. **AC-BR5 fixture must use real `_emitBlockedSentinel` flow** (Codex critic P1-1) — write a sentinel via the actual code path, then test recovery against it. Not hand-authored status.json.
|
|
210
|
-
|
|
211
|
-
**Release (NOT part of this PR's verification — per Codex critic P1-2 + CLAUDE.md absolute rules)**: any version bump, GitHub release, or npm publish is a SEPARATE user-approved action that follows merge. This PR's verification ends at items 1-6 above. Release decisions are not auto-flow.
|
|
212
|
-
|
|
213
|
-
---
|
|
214
|
-
|
|
215
|
-
## 11. ADR (preview)
|
|
216
|
-
|
|
217
|
-
- **Decision**: extend PR-A's recovery-hygiene pattern to `phase=blocked` operator-cleared scenario
|
|
218
|
-
- **Drivers**: D1 operator-recovery completeness, D2 mirror PR-A pattern, D3 counter reset honesty
|
|
219
|
-
- **Alternatives considered**: B (always reset, breaks back-compat), C (doc-only, pushes burden to operator)
|
|
220
|
-
- **Why chosen**: A surgical, pattern-proven, defensive default, completes Phase C1 without scope creep
|
|
221
|
-
- **Consequences**: operator-cleared BLOCKED relaunches now work as intended; no need for jq counter-reset cookbook; logs add `[recovery]` lines visible via `/rlp-desk logs`
|
|
222
|
-
- **Follow-ups**: Phase C2 (mid-iter crash recovery), Phase C3 (cross-mission queue recovery), Phase C4 (cookbook entry)
|
|
223
|
-
|
|
224
|
-
---
|
|
225
|
-
|
|
226
|
-
## 12. Round-by-round resolution log
|
|
227
|
-
|
|
228
|
-
| Round | Reviewer | Verdict | Findings |
|
|
229
|
-
|---|---|---|---|
|
|
230
|
-
| 0 | — | Planner v0 | initial draft |
|
|
231
|
-
| 1 | Architect (Claude inline) | ITERATE | 5 edits applied → v1: drop 30-day, add previous_block_reason, expand Check 4 prose, branch ordering, _skipNextWorkerDispatch comment |
|
|
232
|
-
| 2 | Codex Critic | ITERATE — 0 P0, 2 P1 | P1-1: Check 4 redesigned to use `<slug>-blocked.json` sidecar `recoverable` field (status.json never persists `last_block_reason`). P1-2: §10.5 release auto-flow removed; SV gate + user approval explicit. P2: S2 mitigation prose corrected (writeJson is not atomic rename; readJsonIfExists throws not returns null). All applied → v2 (current). |
|
|
233
|
-
| 3 | Codex Critic | **APPROVE** — 0 P0, 0 P1 | P1-1/P1-2 both closed. §7 sidecar-based gate validated. §10 sv-gate + user-approved release confirmed. **Loop terminated. Implementation can proceed.** |
|