@ai-dev-methodologies/rlp-desk 0.11.0 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/docs/rlp-desk/artifact-schema.md +99 -0
  2. package/docs/rlp-desk/ci-setup.md +100 -0
  3. package/docs/rlp-desk/e2e-scenarios.md +102 -0
  4. package/docs/rlp-desk/plans/rlp-desk-0.11.1-tmux-pane-disappearance.md +260 -0
  5. package/docs/rlp-desk/plans/rlp-desk-tmux-flywheel-routing.md +730 -0
  6. package/install.sh +93 -20
  7. package/package.json +8 -2
  8. package/scripts/build-node-manifest.js +52 -0
  9. package/scripts/postinstall.js +162 -8
  10. package/src/commands/rlp-desk.md +48 -25
  11. package/src/governance.md +68 -6
  12. package/src/node/MANIFEST.txt +15 -0
  13. package/src/node/cli/command-builder.mjs +25 -5
  14. package/src/node/constants.mjs +19 -0
  15. package/src/node/polling/signal-poller.mjs +119 -3
  16. package/src/node/runner/campaign-main-loop.mjs +470 -41
  17. package/src/node/runner/leader-registry.mjs +100 -0
  18. package/src/node/runner/prompt-dismisser.mjs +200 -0
  19. package/src/node/shared/fs.mjs +38 -0
  20. package/src/node/util/debug-log.mjs +56 -0
  21. package/src/node/util/shell-quote.mjs +12 -0
  22. package/docs/superpowers/plans/2026-04-24-gpt-5-5-default.md +0 -517
  23. package/docs/superpowers/specs/2026-04-24-gpt-5-5-default.md +0 -107
  24. /package/docs/{TODO-verification-next.md → rlp-desk/TODO-verification-next.md} +0 -0
  25. /package/docs/{architecture.md → rlp-desk/architecture.md} +0 -0
  26. /package/docs/{blueprints → rlp-desk/blueprints}/blueprint-flywheel-enhancement.md +0 -0
  27. /package/docs/{blueprints → rlp-desk/blueprints}/blueprint-pivot-step.md +0 -0
  28. /package/docs/{blueprints → rlp-desk/blueprints}/plan-flywheel-enhancement.md +0 -0
  29. /package/docs/{blueprints → rlp-desk/blueprints}/sv-architecture-rethink.md +0 -0
  30. /package/docs/{getting-started.md → rlp-desk/getting-started.md} +0 -0
  31. /package/docs/{internal → rlp-desk/internal}/verification-policy-gap-analysis.md +0 -0
  32. /package/docs/{internal → rlp-desk/internal}/verification-strategy-research.md +0 -0
  33. /package/docs/{multi-mission-orchestration.md → rlp-desk/multi-mission-orchestration.md} +0 -0
  34. /package/docs/{plans → rlp-desk/plans}/cozy-gliding-trinket.md +0 -0
  35. /package/docs/{plans → rlp-desk/plans}/frolicking-churning-honey.md +0 -0
  36. /package/docs/{plans → rlp-desk/plans}/keen-sauteeing-snowflake.md +0 -0
  37. /package/docs/{plans → rlp-desk/plans}/mutable-booping-corbato.md +0 -0
  38. /package/docs/{plans → rlp-desk/plans}/rlp-desk-0.11-handoff-7fixes.md +0 -0
  39. /package/docs/{plans → rlp-desk/plans}/rlp-desk-elegant-papert-agent-a8cd695ffca2a3ad8.md +0 -0
  40. /package/docs/{plans → rlp-desk/plans}/rlp-desk-elegant-papert.md +0 -0
  41. /package/docs/{plans → rlp-desk/plans}/toasty-whistling-diffie-agent-a6814625642e956da.md +0 -0
  42. /package/docs/{plans → rlp-desk/plans}/toasty-whistling-diffie.md +0 -0
  43. /package/docs/{plans → rlp-desk/plans}/validated-snacking-crayon.md +0 -0
  44. /package/docs/{protocol-reference.md → rlp-desk/protocol-reference.md} +0 -0
@@ -0,0 +1,99 @@
1
+ # rlp-desk Artifact Schema (v5.7 §4.25)
2
+
3
+ > Worker/Verifier write JSON artifacts that the Leader reads. The schema validator at the READ boundary enforces these contracts. **Violation → BLOCKED `contract_violation/malformed_artifact`** (recoverable).
4
+
5
+ ## Validated artifacts
6
+
7
+ | File | Written by | Read by | `signal_type` |
8
+ |------|-----------|---------|---------------|
9
+ | `<slug>-iter-signal.json` | Worker | Leader (worker poll) | `signal` |
10
+ | `<slug>-verify-verdict.json` (per-US) | Verifier | Leader (verifier poll) | `verdict` |
11
+ | `<slug>-verify-verdict.json` (final ALL) | Verifier | Leader (final-verifier poll) | `verdict` |
12
+ | `<slug>-flywheel-signal.json` | Flywheel | Leader (flywheel poll) | `flywheel_signal` |
13
+ | `<slug>-flywheel-guard-verdict.json` | Guard | Leader (guard poll) | `flywheel_guard_verdict` |
14
+ | `<slug>-done-claim.json` | Worker | Leader (analytics, A4 fallback) | `done_claim` |
15
+
16
+ ## Required structural fields (validated by `validateArtifact`)
17
+
18
+ | Field | Type | Constraint | Notes |
19
+ |-------|------|------------|-------|
20
+ | `slug` | string | === campaign slug | OPTIONAL for backward compat. If present, must match. |
21
+ | `iteration` | integer | ≥ `iteration_floor` (current state.iteration) | OPTIONAL for backward compat. Worker may advance, never regress. |
22
+ | `signal_type` | string | === expected per read context | OPTIONAL for backward compat. Discriminates artifacts at read time. |
23
+ | `us_id` | string | ∈ `usList ∪ {'ALL'}` | OPTIONAL for backward compat. Closed-set check. |
24
+
25
+ The validator is structural-minimum + semantic-anchor. It does NOT validate downstream business fields (e.g. `verdict.verdict`, `signal.status`); those are checked by their respective consumers.
26
+
27
+ ## Examples
28
+
29
+ ### Valid worker signal
30
+ ```json
31
+ {
32
+ "slug": "sum-fn",
33
+ "iteration": 1,
34
+ "signal_type": "signal",
35
+ "us_id": "US-001",
36
+ "status": "verify",
37
+ "summary": "implementation done; tests pass"
38
+ }
39
+ ```
40
+
41
+ ### Valid verifier verdict
42
+ ```json
43
+ {
44
+ "slug": "sum-fn",
45
+ "iteration": 1,
46
+ "signal_type": "verdict",
47
+ "us_id": "US-001",
48
+ "verdict": "pass",
49
+ "criteria_results": [...]
50
+ }
51
+ ```
52
+
53
+ ### Violation: wrong slug
54
+ ```json
55
+ {
56
+ "slug": "wrong-campaign", // ← BLOCKED contract_violation
57
+ "iteration": 1,
58
+ ...
59
+ }
60
+ ```
61
+ → `Malformed artifact at slug: expected sum-fn, got wrong-campaign`
62
+
63
+ ### Violation: us_id outside allowed set
64
+ ```json
65
+ {
66
+ "us_id": "US-999" // ← BLOCKED contract_violation (US-999 ∉ [US-001, ALL])
67
+ }
68
+ ```
69
+ → `Malformed artifact at us_id: expected one of [US-001, ALL], got US-999`
70
+
71
+ ### Violation: iteration regress
72
+ ```json
73
+ {
74
+ "iteration": 0 // ← floor is 1; regress not allowed
75
+ }
76
+ ```
77
+ → `Malformed artifact at iteration: expected >= 1, got 0`
78
+
79
+ ## Backward compatibility
80
+
81
+ Existing artifacts written before v5.7 §4.25 do not carry `slug`/`signal_type`/`iteration` fields. The validator skips any field not present (`undefined` is allowed). Workers/Verifiers SHOULD start emitting these fields for stronger contract enforcement, but legacy artifacts continue to work.
82
+
83
+ ## Feedback loop closure
84
+
85
+ When `MalformedArtifactError` fires:
86
+ 1. `_handlePollFailure` writes BLOCKED with `reason_category: contract_violation`, `failure_category: malformed_artifact`, `recoverable: true`.
87
+ 2. `reason_detail` includes the structured error: `Malformed artifact at <field>: expected <expected>, got <got>`.
88
+ 3. Operators reviewing `<slug>-blocked.json` see the precise contract violation and can update the Worker prompt template (`prompts/<slug>.worker.prompt.md`) to require the missing/correct field.
89
+ 4. On re-run after fix, the Worker writes a compliant artifact and the campaign proceeds.
90
+
91
+ ## Authoring guidance
92
+
93
+ - Worker prompt templates SHOULD instruct the LLM to include `slug`, `iteration`, `signal_type`, and `us_id` in every JSON artifact.
94
+ - The fix-contract (`buildFixContract` in `campaign-main-loop.mjs`) already feeds verifier failures back to the next Worker; future enhancement: feed `MalformedArtifactError` details directly into the next Worker prompt without requiring user re-run.
95
+
96
+ ## Audit
97
+
98
+ - Schema unit tests: `tests/node/test-artifact-schema.mjs` (7 violation scenarios)
99
+ - E2E: Schema violations are exercised in `tests/sv-gate-full.sh` (REAL campaign E2E asserts `complete.md` or `blocked.md` exists — schema violations route to the latter)
@@ -0,0 +1,100 @@
1
+ # rlp-desk CI Setup (v5.7 §4.25)
2
+
3
+ > SV gate is a mechanical contract: every PR touching `src/node/**`, `src/scripts/**`, `src/commands/rlp-desk.md`, or `src/governance.md` MUST pass `tests/sv-gate-full.sh` before merge.
4
+
5
+ ## Local development
6
+
7
+ ### Fast gate (~30s)
8
+
9
+ Run before every commit:
10
+
11
+ ```sh
12
+ zsh tests/sv-gate-fast.sh
13
+ # or
14
+ npm run sv-gate:fast
15
+ ```
16
+
17
+ Checks:
18
+ - 35+ code-pattern greps (each tracked v5.7 fix has the expected code)
19
+ - All Node unit tests (~50)
20
+ - 5 critical zsh unit tests
21
+
22
+ ### Full gate (~5 min)
23
+
24
+ Run before merge / release:
25
+
26
+ ```sh
27
+ zsh tests/sv-gate-full.sh
28
+ # or
29
+ npm run sv-gate:full
30
+ ```
31
+
32
+ Adds:
33
+ - REAL tmux E2E (mocked tmux capture, 9 scenarios)
34
+ - REAL campaign E2E (haiku worker/verifier, max-iter 3, iter-timeout 300s)
35
+ - Asserts `<slug>-complete.md` OR `<slug>-blocked.md` exists post-run (file-guarantee invariant)
36
+
37
+ **Pre-conditions for full gate**:
38
+ - Inside a tmux session (`echo $TMUX` not empty)
39
+ - `claude` CLI in PATH
40
+ - `node` >= 16 in PATH
41
+ - `~/.claude/ralph-desk/` synced from latest `src/` (run `bash install.sh` or manual sync)
42
+
43
+ ## GitHub Actions
44
+
45
+ The fast gate runs on every PR via `.github/workflows/sv-gate.yml`:
46
+
47
+ ```yaml
48
+ name: SV Gate
49
+ on: [push, pull_request]
50
+ jobs:
51
+ sv-gate-fast:
52
+ runs-on: macos-latest # zsh + tmux available
53
+ steps:
54
+ - uses: actions/checkout@v4
55
+ - uses: actions/setup-node@v4
56
+ with: { node-version: '22' }
57
+ - run: bash install.sh # syncs to ~/.claude/ralph-desk
58
+ env: { REPO_URL: file://${{ github.workspace }} }
59
+ - run: zsh tests/sv-gate-fast.sh
60
+ ```
61
+
62
+ The full gate (with REAL campaign E2E) is NOT run in CI — it requires:
63
+ - Anthropic API key (haiku worker/verifier)
64
+ - Live tmux session (CI runners are non-interactive)
65
+ - ~3-5 min wallclock per run
66
+
67
+ Operators MUST run `tests/sv-gate-full.sh` locally before merging to `main`.
68
+
69
+ ## Branch protection (manual)
70
+
71
+ Required for the SV gate to be enforceable:
72
+
73
+ 1. Go to `https://github.com/<owner>/rlp-desk/settings/branches`
74
+ 2. Add rule for `main`:
75
+ - ✅ Require a pull request before merging
76
+ - ✅ Require status checks to pass before merging
77
+ - ✅ Search and select: `sv-gate-fast`
78
+ - ✅ Require branches to be up to date before merging
79
+ 3. Document the manual step here. Branch protection cannot be enforced via committed YAML alone — it is a repo-admin setting.
80
+
81
+ ## Forks / non-GitHub repos
82
+
83
+ `tests/sv-gate-fast.sh` and `tests/sv-gate-full.sh` are pure zsh + Node — no GitHub-specific dependencies. Forks should:
84
+
85
+ 1. Run `npm run sv-gate:fast` in their CI (Travis, GitLab CI, etc.) using the same OS-level prereqs (macOS or Linux + zsh + tmux + node + claude CLI).
86
+ 2. Optionally run `npm run sv-gate:full` in a scheduled job (nightly) since it requires live API key.
87
+
88
+ ## Gate failure interpretation
89
+
90
+ | Failure mode | Meaning | Action |
91
+ |--------------|---------|--------|
92
+ | Code-pattern grep failed | Tracked fix's expected code is missing | Restore the fix or update `tests/sv-gate-fast.sh` if the pattern legitimately changed |
93
+ | Node unit test failed | Behavioral regression | Fix the code; do NOT relax the test |
94
+ | zsh unit test failed | Behavioral regression in shell helpers | Fix the helper |
95
+ | REAL tmux E2E failed | Real tmux capture/send-keys broke | Investigate tmux version or pane state |
96
+ | REAL campaign E2E failed (no sentinel) | **FILE-GUARANTEE VIOLATED** — Worker/Verifier exited without artifact AND backstop did NOT catch | Critical bug; investigate `_ensureTerminalSentinel` and `_handlePollFailure` paths |
97
+
98
+ ## Memo: SV gate is the contract
99
+
100
+ The SV gate exists because AI assistants (including the Leader itself) miss steps. Mechanical .sh verification is the only enforceable contract — code review, "I tested it locally", and unit-test-only verification are not sufficient. Plan v5.7 explicitly forbids commits that have not passed `tests/sv-gate-full.sh`.
@@ -0,0 +1,102 @@
1
+ # rlp-desk E2E Test Scenarios (v5.7 §4.25)
2
+
3
+ > Two-tier coverage: **Tier A** (deterministic injection, ~ms) runs in `sv-gate-fast`; **Tier B** (real-subprocess + real-tmux + real-claude, seconds–minutes) runs in `sv-gate-full`. Every fix path is covered by at least one tier.
4
+
5
+ ## Tier A — Deterministic injection (sv-gate-fast)
6
+
7
+ Uses `pollForSignal` injection seam (no subprocess spawn) — deterministic, fast, CI-stable.
8
+
9
+ | Scenario | Test file | Asserts |
10
+ |----------|-----------|---------|
11
+ | writeSentinelExclusive O_EXCL race | `tests/node/test-sentinel-exclusive.mjs` | First-writer-wins, parent dir create, EEXIST returns no-op, parallel race |
12
+ | Backstop: missing scaffold | `tests/node/test-leader-exit-invariant.mjs` | `_ensureTerminalSentinel` writes `blocked.md` even on `ensureScaffold` throw |
13
+ | Backstop: pollForSignal throws | `tests/node/test-leader-exit-invariant.mjs` | `_handlePollFailure` writes BLOCKED + run() returns blocked status |
14
+ | Backstop: idempotent first-writer-wins | `tests/node/test-leader-exit-invariant.mjs` | Pre-existing BLOCKED is NOT overwritten by backstop |
15
+ | Lying worker (signal missing) | `tests/node/test-lying-worker.mjs` | BLOCKED `infra_failure/worker_exited_without_artifacts` |
16
+ | Lying verifier (per-US verdict missing) | `tests/node/test-lying-worker.mjs` + `tests/node/sv-e2e/test-lying-verifier.mjs` | BLOCKED `verifier_exited_without_artifacts` |
17
+ | Lying final verifier (US-ALL) | `tests/node/sv-e2e/test-lying-verifier.mjs` | BLOCKED `final_verifier_exited_without_artifacts` |
18
+ | Prompt-blocked (default-No worker) | `tests/node/sv-e2e/test-prompt-blocked.mjs` | BLOCKED `prompt_blocked` |
19
+ | Prompt-blocked (default-No verifier) | `tests/node/sv-e2e/test-prompt-blocked.mjs` | BLOCKED `prompt_blocked` (verifier role) |
20
+ | Schema: empty object | `tests/node/test-artifact-schema.mjs` | No crash |
21
+ | Schema: wrong slug | `tests/node/test-artifact-schema.mjs` | BLOCKED `contract_violation/malformed_artifact` |
22
+ | Schema: us_id outside set | `tests/node/test-artifact-schema.mjs` | BLOCKED `malformed_artifact` |
23
+ | Schema: iteration regress | `tests/node/test-artifact-schema.mjs` | BLOCKED `malformed_artifact` |
24
+ | Schema: iteration not integer | `tests/node/test-artifact-schema.mjs` | BLOCKED `malformed_artifact` |
25
+ | Schema: signal_type mismatch | `tests/node/test-artifact-schema.mjs` | BLOCKED `malformed_artifact` |
26
+ | Schema: valid signal (back-compat) | `tests/node/test-artifact-schema.mjs` | No false positive |
27
+ | Auto-dismiss prompt patterns (24+) | `tests/node/test-prompt-dismisser.mjs` | Each `(y/n)`/`[Y/n]`/`[y/N]` variant + scrollback + unknown-fast-fail + claude v2.x trust |
28
+ | Shell quote (Bug 1) | `tests/node/test-shell-quote.mjs` | POSIX single-quote escape for `[1m]` etc. |
29
+ | Opus 1M context | `tests/node/test-opus-1m-context.mjs` | `ANTHROPIC_BETA` prefix, isOpusModel detection |
30
+
31
+ **Tier A total**: 50+ tests across 11 files. Runtime: ~0.7s. Always runs in CI.
32
+
33
+ ## Tier B — Real-subprocess (sv-gate-full)
34
+
35
+ Uses real tmux session + real `tmux send-keys` / `capture-pane` / real claude haiku CLI. Slow (~5min) but exercises actual production paths.
36
+
37
+ | Scenario | Test | Asserts |
38
+ |----------|------|---------|
39
+ | Real tmux: `[Y/n]` auto-dismiss | `tests/sv-gate-real-e2e.sh` | Real `tmux send-keys Enter` after `auto_dismiss_prompts` |
40
+ | Real tmux: `[y/N]` BLOCK | `tests/sv-gate-real-e2e.sh` | `infra_failure` sentinel written, NO Enter sent |
41
+ | Real tmux: 10s no-progress timeout | `tests/sv-gate-real-e2e.sh` | BLOCKED on freeze regardless of prompt |
42
+ | Real tmux: unknown text + no bracket | `tests/sv-gate-real-e2e.sh` | No false BLOCK, no false Enter |
43
+ | Real tmux: unknown phrasing + `[y/N]` | `tests/sv-gate-real-e2e.sh` | Fast-fail BLOCK (10min wait avoided) |
44
+ | Real tmux: unknown phrasing + `(y/n)` | `tests/sv-gate-real-e2e.sh` | Fast-fail BLOCK |
45
+ | Real tmux: codex `[Y/n]` | `tests/sv-gate-real-e2e.sh` | Auto-dismiss (codex CLI variant) |
46
+ | Real tmux: codex `[y/N]` | `tests/sv-gate-real-e2e.sh` | BLOCK |
47
+ | Real tmux: scrollback contamination | `tests/sv-gate-real-e2e.sh` | Old `[Y/n]` + active `[y/N]` → BLOCK (scan-all) |
48
+ | Real haiku campaign (happy path) | `tests/sv-gate-full.sh` (inline) | `complete.md` written; trust prompt auto-dismissed; tests pass; commit recorded |
49
+
50
+ **Tier B total**: 10+ scenarios. Runtime: ~5 min (1 min for tmux scenarios + ~4 min for haiku campaign). Run before merge / release.
51
+
52
+ ## Coverage matrix (per fix)
53
+
54
+ | Fix | Tier A | Tier B | Bug ID |
55
+ |-----|--------|--------|--------|
56
+ | zsh `[1m]` glob | shell-quote | (haiku campaign launches Opus models when promoted) | Bug 1 |
57
+ | tmux silent SV/flywheel | us012 | (haiku campaign exercises tmux mode) | Bug 2/3 |
58
+ | auto_dismiss prompts | prompt-dismisser | real-e2e #1-9 | Bug 4 |
59
+ | A4 fallback prompt guard | a4_fallback | (haiku campaign) | Bug 5 |
60
+ | scrollback contamination | prompt-dismisser | real-e2e #9 | §4.17.b |
61
+ | unknown-prompt fast-fail | prompt-dismisser | real-e2e #5-6 | §4.18 |
62
+ | Node iterTimeout fwd | (verified by haiku campaign actually completing in ≤300s) | full | §4.19 |
63
+ | claude v2.x trust prompt | prompt-dismisser | full (haiku triggers it) | §4.20 |
64
+ | capture window -50 + whitespace norm | prompt-dismisser | full (haiku narrow-pane wrap) | §4.21 |
65
+ | WorkerExitedError | lying-worker | (full campaign covers happy path; injection covers exit) | §4.22 |
66
+ | tail-15 normalized matching | prompt-dismisser | real-e2e | §4.23 |
67
+ | writeSentinelExclusive O_EXCL | sentinel-exclusive | (full campaign uses it for complete.md) | §4.24 |
68
+ | run() try/finally backstop | leader-exit-invariant | (full campaign verifies success path) | §4.24 §1g |
69
+ | _handlePollFailure | lying-worker, lying-verifier, prompt-blocked | (full campaign success path) | §4.25 |
70
+ | validateArtifact schema | artifact-schema | full (haiku artifacts schema-compliant) | §4.25 P1 |
71
+
72
+ Every fix has at least one Tier A test. Tier B exercises the production-realistic paths (real tmux, real subprocess, real claude haiku).
73
+
74
+ ## Running the gates
75
+
76
+ ```sh
77
+ # Fast gate (~0.7s, every commit)
78
+ zsh tests/sv-gate-fast.sh
79
+ # or
80
+ npm run sv-gate:fast
81
+
82
+ # Full gate (~5 min, before merge/release)
83
+ zsh tests/sv-gate-full.sh
84
+ # or
85
+ npm run sv-gate:full
86
+ ```
87
+
88
+ `sv-gate-full` requires:
89
+ - Inside a tmux session (`echo $TMUX` non-empty)
90
+ - `claude` CLI in PATH with valid auth
91
+ - `node >= 16` in PATH
92
+ - `~/.claude/ralph-desk/` synced from latest `src/` (run `bash install.sh`)
93
+
94
+ ## Adding a new scenario
95
+
96
+ 1. **Determine tier**:
97
+ - Deterministic, no subprocess → Tier A
98
+ - Requires real tmux/claude/network → Tier B
99
+ 2. **Tier A**: add `tests/node/sv-e2e/test-<name>.mjs` (or extend existing file). Use `pollForSignal` injection seam. Update `NODE_TESTS` array in `tests/sv-gate-fast.sh`.
100
+ 3. **Tier B**: add scenario to `tests/sv-gate-real-e2e.sh` with `reset_pane_state` between scenarios. The script auto-runs in `sv-gate-full.sh`.
101
+ 4. **Document**: add row to the Coverage matrix in this file.
102
+ 5. **Verify**: run `npm run sv-gate:fast` (Tier A) or `npm run sv-gate:full` (both tiers); both must exit 0.
@@ -0,0 +1,260 @@
1
+ # rlp-desk 0.11.1 — Tmux session/pane lifecycle resilience (ralplan v3)
2
+
3
+ > v3: Codex Critic ITERATE 흡수 (7 patches): 단일 5s 권위 timeout, mid-iter pane death 감지, SESSION_NAME `$$` + rand 충돌 회피, destroy-unattached 한계 명시, shasum 대체 체인, mkdir atomic lock, self-V mechanical fixture (grep-only 금지).
4
+ > v2: Architect 1차 ITERATE 흡수 — ground-truth 검증으로 bug premise 수정. 실제 session-config.json pane 필드는 정상 기록, 진짜 결함은 tmux session 자체 사라짐.
5
+
6
+ ## Context
7
+
8
+ 소비자 handoff `coordination/handoffs/2026-04-26-rlp-desk-tmux-pane-disappearance-bug.md` (P0) + ground-truth 검증.
9
+
10
+ ### 보고된 증상 vs 실제 ground truth
11
+
12
+ | 항목 | 보고 | 실제 (검증 후) |
13
+ |---|---|---|
14
+ | session-config.json pane 필드 | 4 fields = null | leader=%1007, worker=%1016, verifier=%1017 (정상) |
15
+ | tmux pane lifecycle | %1014/%1015 사라짐 | session `ai-blog-system-624` 자체가 사라짐 → 모든 pane 함께 사망 |
16
+ | process state | 살아있음 | runner pid 83304 살아있음 (tmux session 만 dropped) |
17
+
18
+ **진짜 문제**: tmux session 의 lifetime 이 wrapper terminal / claude-code session 의 lifetime 과 묶여 있어서, 외부 close 시 session 이 사망 → 모든 pane id 가 stale 됨.
19
+
20
+ ### 검증 evidence (현재 시각)
21
+
22
+ ```
23
+ $ tmux ls | grep ai-blog
24
+ ai-blog-system-625 (ai-blog-system-624 부재)
25
+
26
+ $ cat .../blog-v31-flywheel-telemetry/runtime/session-config.json | jq .panes
27
+ { "leader": "%1007", "worker": "%1016", "verifier": "%1017" }
28
+ ```
29
+
30
+ → pane id 는 작성 시점엔 valid 였으나 session 이 사라져 pane 도 dead.
31
+
32
+ ## 근본 원인 (revised)
33
+
34
+ **session lifecycle ↔ wrapper lifecycle decoupling 부족**:
35
+
36
+ 1. **H1 (확인)** — runner 가 `tmux new-session -d -s "$SESSION_NAME"` 으로 detached session 생성. 그러나 wrapper 가 nohup 으로 spawn 시 wrapper 자신의 terminal close 가 자식 tmux client 도 함께 끊고, attached client 가 0 이 되면 일부 환경에서 session GC 됨 (특히 tmux server 재시작 / 사용자 manual kill).
37
+ 2. **H2 (확인)** — wrapper duplicate spawn race (96581 + 83265) 로 두 wrapper 가 동일 desk 의 다른 mission 진입. 한 쪽 cleanup 이 다른 쪽 session 영향.
38
+ 3. **H3 (가능성 낮음, 폐기)** — pane id 캡처 시점 race. 실제 file 검증 결과 pane id 는 valid → 캡처 자체는 성공.
39
+
40
+ → H1 + H2 가 주범. 보고된 H3 (캡처 race) 는 실제 ground truth 와 모순되어 폐기.
41
+
42
+ ## RALPLAN-DR
43
+
44
+ **Principles**:
45
+ 1. **Fail loud, not silent** — session/pane 사망 시 명시 alert (next iter 진입 직전 detect)
46
+ 2. **Defense-in-depth** — H1 + H2 동시 차단 (단일 fix 부족)
47
+ 3. **Backward-compat** — 기존 single-mission 인터랙티브 운영 그대로
48
+ 4. **Self-verification mechanical** — 변경 코드 직접 invoke + grep anti-tautology
49
+
50
+ **Decision Drivers**:
51
+ 1. session 이 외부 영향으로 사라져도 wrapper / 사용자 가 즉시 인지
52
+ 2. duplicate wrapper spawn 시 second-mover 가 명시 reject
53
+ 3. 작성된 session-config 가 "live" 와 "stale" 구분 가능
54
+
55
+ **Viable Options**:
56
+
57
+ - **A (채택)** — 3-pronged + Architect ITERATE 흡수:
58
+ - **R12 — Pane lifecycle monitor** — 3 검증 시점: (a) `create_session()` 직후, (b) main loop 매 iter 진입 직전, (c) 매 worker/verifier `send-keys` 직후 wait-loop 진입 직전. 각 pane `#{pane_dead}` + session `has-session` 확인. dead 발견 시 즉시 BLOCKED with `reason_category=infra_failure` + recoverable=true + suggested_action=restart. **단일 권위 timeout: 5s 총 — 1초 간격 5회 polling 후 fail (Critic 불일치 해소)**.
59
+ - **R13 — Detached session protection** — RLP_BACKGROUND=1 이면 `tmux set -t "$SESSION_NAME" destroy-unattached off` 적용해 attached client 0 일 때도 session 유지. `tmux new-session` exit code 명시 검증, fail 시 dedicated 새 이름 (`${SESSION_NAME}-bg-$(date +%s)`) 으로 retry 1회. **NEW-3: SESSION_NAME 이미 SLUG 포함하므로 중복 suffix 안 함**.
60
+ - **R14 — Project-scoped runner lockfile** — `RUNNER_LOCKFILE_PATH="$DESK/logs/.rlp-desk-runner-$(echo "$ROOT" | shasum | cut -c1-8).lock"`. 동일 project root 에서 duplicate runner spawn 차단, 다른 project 의 동시 runner 는 허용. stale pid (`kill -0` fail) 시 갱신 + log 안내.
61
+ - B — R14 only (race 차단으로 충분) — H1 (session GC) 잔존 → 폐기.
62
+ - C — skip background mode — wrapper API breaking → 폐기.
63
+
64
+ **Pre-implementation gate (NEW-4)**: 본 plan 채택 전, 위 ground-truth 검증 (실제 session-config.json 파일 + `tmux ls` 출력) 완료. 실제 결함 = session 사망 + lockfile 부재 두 축으로 확정.
65
+
66
+ ## 해결 계획
67
+
68
+ ### Fix R12: Pane lifecycle monitor + bounded retry
69
+
70
+ **대상**: `src/scripts/lib_ralph_desk.zsh` 신규 helper + `src/scripts/run_ralph_desk.zsh` main loop 진입점
71
+
72
+ **변경**:
73
+ 1. `lib_ralph_desk.zsh` 신규:
74
+ ```zsh
75
+ _verify_pane_alive() {
76
+ local pane_id="$1"
77
+ [[ -z "$pane_id" ]] && return 1
78
+ local dead
79
+ dead=$(tmux display-message -p -t "$pane_id" '#{pane_dead}' 2>/dev/null)
80
+ [[ "$dead" == "0" ]]
81
+ }
82
+ _verify_session_alive() {
83
+ local session="$1"
84
+ [[ -z "$session" ]] && return 1
85
+ tmux has-session -t "$session" 2>/dev/null
86
+ }
87
+ ```
88
+ 2. `run_ralph_desk.zsh` 3 검증 시점에 helper 호출:
89
+ ```zsh
90
+ _r12_check_lifecycle() {
91
+ local site="$1" # "create" | "iter_start" | "post_send"
92
+ local _attempts=0
93
+ while ! _verify_session_alive "$SESSION_NAME" || \
94
+ ! _verify_pane_alive "$LEADER_PANE" || \
95
+ ! _verify_pane_alive "$WORKER_PANE" || \
96
+ ! _verify_pane_alive "$VERIFIER_PANE"; do
97
+ (( _attempts++ ))
98
+ if (( _attempts >= 5 )); then
99
+ log_error "[r12:$site] tmux session/pane dead after 5×1s polling (5s total budget). session=$SESSION_NAME panes leader=$LEADER_PANE worker=$WORKER_PANE verifier=$VERIFIER_PANE"
100
+ tmux list-panes -a -F '#{session_name}:#{pane_id} dead=#{pane_dead}' 2>&1 | head -20 >> "$DEBUG_LOG"
101
+ write_blocked_sentinel "tmux session/pane dead during $site" "${CURRENT_US:-ALL}" "infra_failure"
102
+ exit 1
103
+ fi
104
+ sleep 1
105
+ done
106
+ }
107
+ ```
108
+ 호출: `create_session` 끝, main loop 진입 직전, 모든 `paste_to_pane`/`send-keys` 직후 wait-loop 시작 전.
109
+ 3. **단일 권위 timeout: 5s 총** (5회 × 1s polling), 다른 모든 "3 retries"/"4s" 표현 제거.
110
+
111
+ **검증 (us024)**:
112
+ - AC1: `_verify_pane_alive` + `_verify_session_alive` helper 정의
113
+ - AC2: create_session + main loop iter 진입 + post-send-keys 3 시점에서 caller 가 helper 호출
114
+ - AC3: behavioural — 죽은 pane id fixture → exit 1 with `infra_failure` sentinel
115
+ - AC4 (Critic): mid-iter pane kill fixture — worker pane 을 send-keys 직후 외부에서 kill → 다음 wait-loop 진입 시 R12 가 5s 안에 BLOCKED with `reason_category=infra_failure`
116
+
117
+ ### Fix R13: Detached session protection + new-session exit-code verify
118
+
119
+ **대상**: `src/scripts/run_ralph_desk.zsh:744` `create_session()`
120
+
121
+ **변경**:
122
+ 1. `tmux new-session -d -s "$SESSION_NAME"` 실행 후 즉시 `$?` 검증:
123
+ ```zsh
124
+ if ! tmux new-session -d -s "$SESSION_NAME" -x 200 -y 50 -c "$ROOT" 2>/dev/null; then
125
+ if tmux has-session -t "$SESSION_NAME" 2>/dev/null; then
126
+ if [[ "${RLP_BACKGROUND:-0}" == "1" ]]; then
127
+ # daemon mode: 충돌 회피 (Critic NEW-3: epoch + pid + rand 4-digit 까지 강화)
128
+ SESSION_NAME="${SESSION_NAME}-bg-$(date +%s)-$$"
129
+ while tmux has-session -t "$SESSION_NAME" 2>/dev/null; do
130
+ SESSION_NAME="${SESSION_NAME}-$(awk 'BEGIN{srand();print int(1000+rand()*9000)}')"
131
+ done
132
+ tmux new-session -d -s "$SESSION_NAME" -x 200 -y 50 -c "$ROOT" || die "tmux new-session retry failed: $SESSION_NAME"
133
+ fi
134
+ else
135
+ die "tmux new-session failed and session does not exist: $SESSION_NAME"
136
+ fi
137
+ fi
138
+ ```
139
+ 2. RLP_BACKGROUND=1 이면 새/재생성된 session 마다 즉시 `tmux set-option -t "$SESSION_NAME" destroy-unattached off` 호출 — attached client 0 일 때도 session 유지.
140
+ **한계 명시 (Critic R13)**: 이 옵션은 best-effort. **수동 `tmux kill-session` 또는 tmux server 재시작에는 보호 안 됨**. 둘 중 하나가 발생하면 session 은 사라지며, R12 (lifecycle monitor) 가 다음 검증 시점에서 BLOCKED 처리한다.
141
+
142
+ **검증 (us025)**:
143
+ - AC1: `tmux new-session` 실패 시 dedicated 이름으로 retry 1회 (RLP_BACKGROUND only)
144
+ - AC2: RLP_BACKGROUND=1 시 `destroy-unattached off` 호출 grep
145
+ - AC3: SESSION_NAME 변경 시 session-config 의 session_name 가 최종 이름 반영
146
+
147
+ ### Fix R14: Project-scoped runner lockfile
148
+
149
+ **대상**: `src/scripts/run_ralph_desk.zsh:231` 부근 (LOCKFILE_PATH 정의)
150
+
151
+ **변경**:
152
+ 1. 신규 변수 — shasum 대체 체인 (Critic R14 portability):
153
+ ```zsh
154
+ ROOT_HASH=$(printf '%s' "$ROOT" | { shasum 2>/dev/null || sha1sum 2>/dev/null || cksum; } | awk '{print substr($1,1,8)}')
155
+ RUNNER_LOCKFILE_PATH="$DESK/logs/.rlp-desk-runner-$ROOT_HASH.lock"
156
+ RUNNER_LOCKDIR="${RUNNER_LOCKFILE_PATH}.d"
157
+ ```
158
+ 2. 기존 `LOCKFILE_PATH` (per-SLUG) 그대로 유지 — concurrent same-slug 차단
159
+ 3. **mkdir atomic lock 패턴 (Critic R14 race fix)** — check-then-write race 차단:
160
+ ```zsh
161
+ if ! mkdir "$RUNNER_LOCKDIR" 2>/dev/null; then
162
+ existing=$(jq -r '.pid' "$RUNNER_LOCKFILE_PATH" 2>/dev/null || echo 0)
163
+ existing_slug=$(jq -r '.slug // "unknown"' "$RUNNER_LOCKFILE_PATH" 2>/dev/null || echo unknown)
164
+ if [[ "$existing" -gt 0 ]] && kill -0 "$existing" 2>/dev/null; then
165
+ log_error "duplicate rlp-desk runner detected on this project root. existing pid=$existing slug=$existing_slug, this attempt slug=$SLUG. exiting."
166
+ echo " Recover with: rm -rf '$RUNNER_LOCKDIR' '$RUNNER_LOCKFILE_PATH' (after confirming pid $existing is not active)" >&2
167
+ exit 1
168
+ fi
169
+ # stale: 다른 wrapper 가 이미 stale 청소 중일 수 있음 — atomic mkdir 재시도
170
+ rm -rf "$RUNNER_LOCKDIR"
171
+ mkdir "$RUNNER_LOCKDIR" 2>/dev/null || {
172
+ log_error "failed to acquire runner lock after stale cleanup; another wrapper raced ahead. exit 1"
173
+ exit 1
174
+ }
175
+ log " stale runner lockfile cleaned (pid $existing dead) — acquired"
176
+ fi
177
+ printf '{"pid":%s,"slug":"%s","root":"%s","started_at":"%s"}\n' \
178
+ "$$" "$SLUG" "$ROOT" "$(date -u +%Y-%m-%dT%H:%M:%SZ)" > "$RUNNER_LOCKFILE_PATH"
179
+ ```
180
+ 4. cleanup trap 에서 own_slug 확인 후 `RUNNER_LOCKDIR` + `RUNNER_LOCKFILE_PATH` 둘 다 rm:
181
+ ```zsh
182
+ if [[ -f "$RUNNER_LOCKFILE_PATH" ]]; then
183
+ own_slug=$(jq -r '.slug' "$RUNNER_LOCKFILE_PATH" 2>/dev/null)
184
+ [[ "$own_slug" == "$SLUG" ]] && rm -rf "$RUNNER_LOCKDIR" "$RUNNER_LOCKFILE_PATH"
185
+ fi
186
+ ```
187
+
188
+ **검증 (us026)**:
189
+ - AC1: `RUNNER_LOCKFILE_PATH` 변수 정의 + project root hash
190
+ - AC2: 동일 root 에서 alive duplicate runner → exit 1 + 명시 메시지
191
+ - AC3: stale pid 시 lockfile 갱신 (no exit)
192
+ - AC4: 다른 root (다른 hash) 의 동시 runner 는 허용 (multi-project parallelism preserved)
193
+ - AC5: cleanup trap 이 own_slug 일치 시만 삭제
194
+
195
+ ### Self-verification scenario (mechanical, real fixture)
196
+
197
+ `tests/test_self_verification_0_11_1.sh` — **grep-only 금지 (Critic Self-V)**. 각 함수가:
198
+ 1. 임시 desk fixture (mktemp dir + plans/PRD + memos/)
199
+ 2. 실제 helper 직접 invoke (zsh -c source) 또는 mini runner 진입
200
+ 3. 구체 process exit code + 생성된 파일 / log line 검증
201
+ 4. anti-tautology 보조 grep — primary 가 아닌 secondary
202
+
203
+ ```bash
204
+ test_r12_pane_dead_blocks() {
205
+ # 1) 가짜 dead pane id 로 _verify_pane_alive 호출
206
+ # 2) tmux new-session 으로 alive session 만든 후 일부러 kill
207
+ # 3) helper 가 false 반환하는지 + 호출자가 exit 1 + sentinel 작성하는지 확인
208
+ zsh -c "source $LIB; _verify_pane_alive '%99999'" && fail "expected dead detection"
209
+ # ... real fixture run + assert sentinel.md exists with reason_category=infra_failure
210
+ }
211
+ test_r13_session_disambiguation() {
212
+ # 1) tmux new-session -d -s "test-session-fixture" alive
213
+ # 2) RLP_BACKGROUND=1 + SESSION_NAME="test-session-fixture" 으로 create_session-like 진입
214
+ # 3) 실제 새로 생긴 session 이름이 ${name}-bg-... 인지 + alive 인지 확인
215
+ }
216
+ test_r14_lockfile_duplicate_reject() {
217
+ # 1) RUNNER_LOCKDIR mkdir
218
+ # 2) ${LOCK}/pid file 에 alive pid 작성 (sleep & 으로 백그라운드)
219
+ # 3) 두 번째 mkdir 시도 → exit 1 + stderr 에 "duplicate" 출력 검증
220
+ }
221
+ test_r14_lockfile_other_root_allowed() {
222
+ # 1) ROOT=/tmp/r1 인 lockfile 존재
223
+ # 2) ROOT=/tmp/r2 의 hash 가 다름 → 두 번째 mkdir 성공
224
+ }
225
+ ```
226
+ 각 함수 종료 시 (a) exit code 검증, (b) 생성된 sentinel/log 파일 존재 확인, (c) 패치된 함수가 호출되었음을 grep 으로 secondary 증명.
227
+
228
+ ## 변경 대상 파일
229
+
230
+ ```
231
+ src/scripts/run_ralph_desk.zsh # R12 caller, R13 create_session 가드, R14 lockfile
232
+ src/scripts/lib_ralph_desk.zsh # R12 _verify_pane_alive, _verify_session_alive
233
+ src/governance.md # §7e (lane 옆) 신규 §7h "Tmux session lifecycle"
234
+ tests/test_us024_pane_lifecycle.sh
235
+ tests/test_us025_session_disambiguation.sh
236
+ tests/test_us026_runner_lockfile.sh
237
+ tests/test_self_verification_0_11_1.sh
238
+ ```
239
+
240
+ ## 검증
241
+
242
+ 1. **LOW** — `zsh -n`, `node --check` (~10s)
243
+ 2. **MEDIUM** — us024–026 신규 (~30s)
244
+ 3. **CRITICAL** — us017–023 + us012–016 + us001/us007 무손실 (~3min)
245
+ 4. **자가검증 매핑** — 4 함수 mechanical anti-tautology
246
+
247
+ ## ADR
248
+
249
+ - **Decision**: R12 (pane/session monitor + bounded retry) + R13 (detached session protection + new-session verify) + R14 (project-root-hashed lockfile). Bug report 의 null-field 주장은 ground-truth 와 모순되어 폐기, 진짜 결함 (session lifecycle GC + duplicate wrapper) 에 집중.
250
+ - **Drivers**: visual feedback 회복, duplicate wrapper 안전, multi-project 병렬 보존.
251
+ - **Alternatives considered**: R14 only (H1 잔존), skip background (API breaking), `--isolated-session` flag (over-engineering).
252
+ - **Consequences**:
253
+ - 기존 single-mission 인터랙티브 영향 없음 (R13 dedicated 이름 retry 는 RLP_BACKGROUND only)
254
+ - duplicate wrapper 시 second-mover 명시 차단, 사용자 명령으로 lockfile 복구 가능
255
+ - 매 검증 시점 최대 5s 추가 (단일 권위 budget: 5×1s polling). 최선 케이스는 0s (첫 시도 alive).
256
+ - 다른 project 동시 runner 는 hash 분리로 그대로 동작
257
+ - **Follow-ups**:
258
+ - tmux pane lifecycle dashboard
259
+ - mission-level pane 격리 옵션 (`--isolated-session`)
260
+ - bug-report contract: 다음번부터 consumer 가 evidence 파일 (실제 session-config.json + tmux ls 출력) 첨부