@ai-dev-methodologies/rlp-desk 0.9.2 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -244,8 +244,8 @@ When all US pass individually, the final ALL verify runs **sequentially per-US**
244
244
  | `--verifier-model MODEL` | sonnet | per-US verification model (lighter) |
245
245
  | `--final-verifier-model MODEL` | opus | final ALL verification model (stricter) |
246
246
  | `--consensus off\|all\|final-only` | off | Cross-engine consensus scope |
247
- | `--consensus-model MODEL` | gpt-5.4:medium | per-US cross-verifier (lighter) |
248
- | `--final-consensus-model MODEL` | gpt-5.4:high | final cross-verifier (stricter) |
247
+ | `--consensus-model MODEL` | gpt-5.5:medium | per-US cross-verifier (lighter) |
248
+ | `--final-consensus-model MODEL` | gpt-5.5:high | final cross-verifier (stricter) |
249
249
  | `--verify-mode per-us\|batch` | per-us | per-us: verify each US → final ALL |
250
250
  | `--cb-threshold N` | 6 | Consecutive failures → BLOCKED |
251
251
  | `--max-iter N` | 100 | Max iterations → TIMEOUT |
@@ -267,7 +267,7 @@ When `--consensus` is enabled, a second cross-engine verifier runs alongside eac
267
267
  After `brainstorm`, `init` detects your environment and presents run command presets:
268
268
 
269
269
  - **Codex detected (GPT Pro / spark)** → recommends cross-engine mode (`--worker-model spark:high --consensus final-only`)
270
- - **Codex detected (large PRD, AC > 15)** → offers gpt-5.4 preset (`--worker-model gpt-5.4:high --consensus final-only`)
270
+ - **Codex detected (large PRD, AC > 15)** → offers gpt-5.5 preset (`--worker-model gpt-5.5:high --consensus final-only`)
271
271
  - **Claude-only** → defaults to `--debug` with haiku worker and opus final verifier
272
272
  - **Basic** → minimal flags for quick iteration
273
273
 
@@ -363,7 +363,7 @@ npm install -g @openai/codex
363
363
  /rlp-desk run calculator --worker-model spark:high
364
364
 
365
365
  # Customize model and reasoning effort
366
- /rlp-desk run calculator --worker-model gpt-5.4:high
366
+ /rlp-desk run calculator --worker-model gpt-5.5:high
367
367
 
368
368
  # Cross-engine: codex worker, claude verifier (recommended)
369
369
  /rlp-desk run calculator --worker-model spark:high --consensus final-only --debug
@@ -406,7 +406,7 @@ By default, Worker and Verifier stop and ask for human input when they encounter
406
406
  **`--autonomous`** enables fully unattended campaigns:
407
407
 
408
408
  ```bash
409
- /rlp-desk run my-feature --mode tmux --worker-model gpt-5.4:medium --autonomous --debug
409
+ /rlp-desk run my-feature --mode tmux --worker-model gpt-5.5:medium --autonomous --debug
410
410
  ```
411
411
 
412
412
  ### How it works
@@ -0,0 +1,352 @@
1
+ # Blueprint: Flywheel Enhancement
2
+
3
+ > Status: TODO — not yet implemented. Document for future development.
4
+ > Codex-reviewed: 2026-04-13 (pre-implementation design review)
5
+
6
+ ## Summary
7
+
8
+ Three enhancements to the flywheel direction-review step, unified in one blueprint:
9
+
10
+ 1. **Flywheel Guard** — independent verification of flywheel decisions before Worker acts on them
11
+ 2. **CEO Pattern Internalization** — selective additions from plan-ceo-review framework
12
+ 3. **tmux Shell Leader Defer** — Node.js campaign-main-loop.mjs only; zsh run script deferred
13
+
14
+ ## Problem
15
+
16
+ ### Flywheel makes bad direction decisions
17
+
18
+ In the `surge-v3-exit-strategy` campaign, the flywheel ran 3 times. Each time it made a flawed decision that wasted iterations:
19
+
20
+ | Flywheel | Decision | Failure | Root Cause |
21
+ |----------|----------|---------|------------|
22
+ | 1st | peak_pct segmentation | look-ahead bias — peak_pct is post-hoc, not available at decision time | No feasibility check on proposed features |
23
+ | 2nd | fixed_tp_5pct by median | median ignores large outliers (4.7x PnL difference invisible) | No metric alignment check against PRD intent |
24
+ | 3rd | breakeven by mean PnL | Correct — but only after user manually caught both prior errors | 2 iterations wasted |
25
+
26
+ The flywheel prompt already has premise challenge, forced alternatives, and 10 cognitive patterns. But it lacks:
27
+ - **Feasibility validation** — can the proposed direction actually be deployed?
28
+ - **Metric scrutiny** — is the optimization metric the right proxy for the real goal?
29
+ - **Independent review** — self-audit by the same agent is structurally weak
30
+
31
+ ### tmux leader gap
32
+
33
+ `campaign-main-loop.mjs` handles flywheel dispatch for both agent and tmux modes via Node.js. `run_ralph_desk.zsh` has no flywheel logic. This is intentional — see §3.
34
+
35
+ ## Design
36
+
37
+ ### 1. Flywheel Guard
38
+
39
+ #### CLI Flags
40
+
41
+ ```
42
+ --flywheel-guard off|on (default: off)
43
+ --flywheel-guard-model MODEL (default: opus)
44
+ ```
45
+
46
+ #### Architecture: Single Independent Guard
47
+
48
+ When `--flywheel-guard on`, every flywheel execution is followed by an independent Guard agent before Worker dispatch. No embedded-only phase — codex review confirmed self-audit is structurally weak for bias detection.
49
+
50
+ ```
51
+ Verifier FAIL
52
+ → Flywheel Agent (fresh context)
53
+ Steps 0A-0F: premise challenge, alternatives, scope decision, contract rewrite
54
+ → Guard Agent (fresh context, different from flywheel)
55
+ Reads: flywheel-signal.json, flywheel-review.md, PRD, campaign memory
56
+ Checks: 4 validation items (see below)
57
+ Writes: flywheel-guard-verdict.json
58
+ → verdict:
59
+ pass → Worker executes flywheel's direction
60
+ fail → Flywheel re-runs with guard feedback injected (max 2 retries)
61
+ inconclusive → Leader escalates to user (BLOCKED with escalation report)
62
+ → 2 retries exhausted + still fail → BLOCKED
63
+ ```
64
+
65
+ #### Guard Validation Checks (4 items)
66
+
67
+ **Check 1: Look-ahead Bias**
68
+ List every data feature the flywheel's proposed direction depends on.
69
+ For each feature: is it available at decision time (when the system must act)?
70
+ - `available`: feature exists before the event occurs (e.g., entry time, session start price)
71
+ - `post-hoc`: feature requires future information (e.g., peak_pct, session_end)
72
+ - Any `post-hoc` feature in a deployable direction → FAIL
73
+
74
+ **Check 2: Metric Alignment**
75
+ - State the PRD's optimization metric explicitly
76
+ - Does the flywheel's proposed direction optimize the same metric?
77
+ - Same metric → pass
78
+ - Different metric, not flagged → FAIL (silent metric switch)
79
+ - Different metric, flagged with evidence → FAIL (metric mismatch requires PRD update or user approval)
80
+ - PRD is ground truth. The guard cannot approve off-PRD metric changes autonomously.
81
+
82
+ **Check 3: Deployability**
83
+ - Can the proposed direction's output be used in production?
84
+ - Does it require data, infrastructure, or conditions not available in the deployment environment?
85
+ - Non-deployable direction proposed as champion → FAIL
86
+ - Direction labeled "upper-bound/reference only" → pass, but Guard MUST include `"analysis_only": true` in verdict so Leader skips Worker dispatch (analysis record only, no implementation)
87
+
88
+ **Check 4: Repeat Pattern (same-US scoped)**
89
+ - Compare current flywheel decision to prior flywheel decisions **for the current US only**
90
+ - Same direction category (e.g., same scope decision) with same underlying approach → FAIL
91
+ - Different framing of a previously rejected direction → FAIL
92
+ - Guard MUST persist rejected flywheel directions to campaign memory's Rejected Directions section before writing verdict file. This ensures cleanup cannot erase the record.
93
+
94
+ #### Guard Signal Protocol
95
+
96
+ ```json
97
+ {
98
+ "verdict": "pass|fail|inconclusive",
99
+ "issues": [
100
+ {
101
+ "check": "look-ahead-bias|metric-alignment|deployability|repeat-pattern",
102
+ "status": "pass|fail|inconclusive",
103
+ "detail": "specific finding",
104
+ "evidence": "file:line or data reference"
105
+ }
106
+ ],
107
+ "analysis_only": false,
108
+ "recommendation": "proceed|retry-flywheel|escalate-to-user",
109
+ "timestamp": "ISO"
110
+ }
111
+ ```
112
+
113
+ #### State Tracking
114
+
115
+ - `flywheel_guard_count`: tracked **per-US** in status.json (not per-campaign)
116
+ - Increments on each guard execution for the current US
117
+ - Resets when US changes or passes verification
118
+ - `ALL` final verification treated as its own bucket
119
+ - Guard files in cleanup: `flywheel-guard-verdict.json` added to re-execution cleanup list
120
+
121
+ #### Boundary Conditions
122
+
123
+ | Condition | Behavior |
124
+ |-----------|----------|
125
+ | `--flywheel off` | No flywheel, no guard. Guard flag ignored. |
126
+ | `--flywheel on-fail` + `--flywheel-guard off` | Flywheel runs without guard (current behavior). |
127
+ | `--flywheel on-fail` + `--flywheel-guard on` | Every flywheel followed by independent guard. |
128
+ | Final ALL verification fails | Flywheel + guard runs if `--flywheel on-fail`. ALL treated as separate US bucket. |
129
+ | Guard returns `inconclusive` | BLOCKED with escalation report. Leader does NOT retry. |
130
+ | Guard model same as flywheel model | Allowed but not recommended. Different model provides better independence. |
131
+ | Resume after guard BLOCKED | User must clear blocked sentinel. Guard count resets for that US. |
132
+
133
+ #### Guard Prompt Template
134
+
135
+ ```markdown
136
+ # Flywheel Guard Review
137
+
138
+ You are an independent reviewer verifying whether a flywheel direction decision is safe to execute.
139
+ You have NO prior context about this campaign. Read the files below and evaluate the decision objectively.
140
+
141
+ ## Files to Read (in order)
142
+ 1. PRD: {DESK}/plans/prd-{SLUG}.md — the ground truth for what success means
143
+ 2. Flywheel Decision: {DESK}/memos/{SLUG}-flywheel-signal.json — what the flywheel decided
144
+ 3. Flywheel Analysis: {DESK}/memos/{SLUG}-flywheel-review.md — the flywheel's reasoning
145
+ 4. Campaign Memory: {DESK}/memos/{SLUG}-memory.md — history, rejected directions, key decisions
146
+ 5. Done Claim: {DESK}/memos/{SLUG}-done-claim.json — what the Worker actually produced
147
+ 6. Verify Verdict: {DESK}/memos/{SLUG}-verify-verdict.json — why the Verifier failed it
148
+
149
+ {GUARD_FEEDBACK_SECTION}
150
+
151
+ ## Validation Checks
152
+
153
+ ### Check 1: Look-ahead Bias
154
+ List every data feature the flywheel's proposed direction depends on.
155
+ For each: "feature X — available at decision time: YES/NO/UNCLEAR"
156
+ - YES: feature is known before the event (entry time, session start price, order book state)
157
+ - NO: feature requires future information (peak price, session end, outcome)
158
+ - UNCLEAR: cannot determine from available context → mark inconclusive
159
+ If ANY feature is NO and used in a deployable strategy (not just upper-bound analysis): FAIL.
160
+
161
+ ### Check 2: Metric Alignment
162
+ 1. What metric does the PRD define as the optimization target?
163
+ 2. What metric does the flywheel's direction optimize?
164
+ 3. Are they the same?
165
+ - Same metric → pass
166
+ - Different metric, not flagged → FAIL (silent metric switch)
167
+ - Different metric, flagged with evidence → FAIL with recommendation: "metric mismatch requires PRD update or user approval before proceeding"
168
+ PRD is ground truth. The guard cannot approve off-PRD metric changes autonomously.
169
+
170
+ ### Check 3: Deployability
171
+ Can the proposed direction's output be used in production as-is?
172
+ - Requires post-hoc data → FAIL
173
+ - Requires infrastructure not mentioned in PRD → FAIL
174
+ - Labeled as "upper-bound only" or "reference" → pass, but you MUST include `"analysis_only": true` in your verdict so Leader skips Worker dispatch (no implementation, analysis record only)
175
+
176
+ ### Check 4: Repeat Pattern (same-US scoped)
177
+ Compare to prior flywheel decisions **for the current US only** in campaign memory's Key Decisions section.
178
+ - Same scope decision + same underlying approach as a prior flywheel for this US → FAIL
179
+ - Reframing of a previously rejected direction (check Rejected Directions) → FAIL
180
+ - Genuinely new approach → pass
181
+ Before writing your verdict, you MUST append any rejected flywheel direction to campaign memory's Rejected Directions section. This persists the record before cleanup can erase it.
182
+
183
+ ## Output
184
+ Write verdict to: {DESK}/memos/{SLUG}-flywheel-guard-verdict.json
185
+
186
+ Use this format:
187
+ {
188
+ "verdict": "pass|fail|inconclusive",
189
+ "issues": [...],
190
+ "analysis_only": false,
191
+ "recommendation": "proceed|retry-flywheel|escalate-to-user",
192
+ "timestamp": "ISO"
193
+ }
194
+
195
+ Rules:
196
+ - If ALL checks pass → verdict: pass, recommendation: proceed
197
+ - If ANY check is fail → verdict: fail, recommendation: retry-flywheel
198
+ - If ANY check is inconclusive and none are fail → verdict: inconclusive, recommendation: escalate-to-user
199
+ - Include specific evidence for every check. No "seems fine" or "probably ok."
200
+ ```
201
+
202
+ When guard fails and flywheel retries, the `{GUARD_FEEDBACK_SECTION}` is populated:
203
+
204
+ ```markdown
205
+ ## Previous Guard Feedback (MUST address these issues)
206
+ The previous flywheel decision was rejected by the Guard. Issues found:
207
+ {list of guard issues with evidence}
208
+
209
+ You MUST address each issue above. Do NOT repeat the same direction.
210
+ Check Rejected Directions in campaign memory before proposing alternatives.
211
+ ```
212
+
213
+ ### 2. CEO Pattern Internalization (Selective)
214
+
215
+ From plan-ceo-review's 16+ cognitive patterns, add **2** to the flywheel prompt's existing 10 patterns:
216
+
217
+ #### Added
218
+
219
+ **11. Proxy Skepticism**
220
+ > Is the metric you're optimizing actually the right proxy for the real goal? What would change if you used a different metric? Name the proxy, name the goal, check the gap.
221
+
222
+ Why: Directly prevents the median-vs-mean failure. The flywheel optimized median gap without questioning whether median was the right proxy for total PnL.
223
+
224
+ Placement: Added to the CEO Cognitive Patterns list (items 1-10 already exist). Also referenced in Step 0D½ context (when applicable).
225
+
226
+ **12. Classification (reversibility x magnitude)**
227
+ > Rate your proposed direction change on two axes: How hard is it to reverse? How large is its impact? Hard-to-reverse + large-magnitude decisions need proportionally stronger evidence.
228
+
229
+ Why: Prevents casual PIVOT decisions on major scope changes without sufficient evidence. Lightweight — one sentence judgment per direction, not a matrix.
230
+
231
+ Placement: Added to Step 0E (Scope Decision) as a judgment criterion.
232
+
233
+ #### Not Added (with rationale)
234
+
235
+ | Pattern | Why Not |
236
+ |---------|---------|
237
+ | Wartime awareness | Mechanical (cb_threshold/2) conflicts with governance CB semantics. Flywheel already has time-value pattern (#6). |
238
+ | Temporal depth (5-10yr) | Iteration-level direction review, not strategic planning. |
239
+ | People-first sequencing | Organizational, not applicable to automated agents. |
240
+ | Hierarchy as service | Organizational. |
241
+ | Narrative coherence | Relevant for product vision, not iteration pivots. |
242
+ | Speed calibration (70% info) | Flywheel already operates on limited info by design. |
243
+ | Founder-mode bias | Human leadership pattern. |
244
+ | Willfulness as strategy | Human trait. |
245
+ | Courage accumulation | Human trait. |
246
+ | Leverage obsession | Too abstract for iteration-level use. |
247
+ | Focus as subtraction | Already covered by simplicity bias (#4). |
248
+ | Paranoid scanning | Already covered by inversion (#3). |
249
+ | Design for trust | UX-specific. |
250
+ | Edge case paranoia | Already covered by evidence > opinion (#10). |
251
+
252
+ #### Updated Flywheel Prompt Cognitive Patterns Section
253
+
254
+ ```markdown
255
+ ## CEO Cognitive Patterns (apply throughout your review)
256
+ 1. First-principles — ignore convention, start from the problem itself
257
+ 2. 10x check — can 2x effort yield 10x better result?
258
+ 3. Inversion — what must be true for this approach to fail?
259
+ 4. Simplicity bias — prefer simple over complex solutions
260
+ 5. User-back — reason backwards from end-user experience
261
+ 6. Time-value — does this direction change save 3+ iterations?
262
+ 7. Sunk cost immunity — ignore what was already invested
263
+ 8. Blast radius — assess impact scope of direction change
264
+ 9. Reversibility — prefer easily reversible decisions
265
+ 10. Evidence > opinion — judge only by this iteration's actual results
266
+ 11. Proxy skepticism — is the optimization metric the right proxy for the real goal?
267
+ 12. Classification — hard-to-reverse + large-magnitude changes need stronger evidence
268
+ ```
269
+
270
+ ### 3. tmux Shell Leader Defer
271
+
272
+ #### Current State
273
+
274
+ `campaign-main-loop.mjs` manages flywheel for both execution modes:
275
+ - **tmux mode**: creates flywheel pane, dispatches via `sendKeys`, polls `flywheel-signal.json`
276
+ - **agent mode**: (planned) Leader calls Agent() with flywheel prompt
277
+
278
+ `run_ralph_desk.zsh` has **no** flywheel logic. It manages Worker + Verifier panes only.
279
+
280
+ #### Defer Rationale
281
+
282
+ 1. Node.js `campaign-main-loop.mjs` already covers tmux mode's flywheel dispatch via pane management
283
+ 2. Duplicating the same logic in zsh creates maintenance burden with no functional gain
284
+ 3. Workstream research is evaluating whether tmux leader should migrate to Node.js entirely
285
+
286
+ #### Decision Point
287
+
288
+ | Research Conclusion | Action |
289
+ |---------------------|--------|
290
+ | Keep zsh leader | Implement flywheel logic in `run_ralph_desk.zsh` — new blueprint |
291
+ | Migrate to Node.js | This defer item is closed. `campaign-main-loop.mjs` is the single implementation. |
292
+
293
+ #### Until Then
294
+
295
+ - `run_ralph_desk.zsh` continues to operate Worker + Verifier only
296
+ - flywheel is available through `node src/node/run.mjs run <slug> --flywheel on-fail --mode tmux`
297
+ - The Node.js runner handles all tmux pane management for flywheel
298
+
299
+ ## Implementation Scope
300
+
301
+ ### Files Changed
302
+
303
+ | File | Change |
304
+ |------|--------|
305
+ | `src/scripts/init_ralph_desk.zsh` | Flywheel prompt: add patterns #11-12. Guard prompt template (new). Guard files in cleanup list. |
306
+ | `src/node/runner/campaign-main-loop.mjs` | Guard dispatch logic, `flywheel_guard_count` per-US in status, guard verdict polling, retry-with-feedback loop, `inconclusive` → BLOCKED path. |
307
+ | `src/node/run.mjs` | `--flywheel-guard off|on`, `--flywheel-guard-model MODEL` flags. |
308
+ | `src/commands/rlp-desk.md` | Flywheel guard options documentation, guard flow description. |
309
+ | `src/governance.md` | §7 Leader Loop: flywheel guard step after flywheel, before Worker. |
310
+
311
+ ### New Files
312
+
313
+ | File | Content |
314
+ |------|---------|
315
+ | `tests/node/test-flywheel-guard.mjs` | Guard logic unit tests, verdict parsing, retry loop, per-US count tracking. |
316
+
317
+ ### Init Scaffold Additions
318
+
319
+ ```
320
+ .claude/ralph-desk/
321
+ ├── memos/
322
+ │ ├── <slug>-flywheel-guard-verdict.json (runtime; deleted on re-execution)
323
+ ```
324
+
325
+ ## Verification
326
+
327
+ ### TDD Tests
328
+ - Guard dispatch only when `--flywheel-guard on` AND flywheel ran
329
+ - Guard verdict parsing (pass/fail/inconclusive)
330
+ - Retry loop: fail → re-run flywheel with feedback → re-guard (max 2)
331
+ - inconclusive → BLOCKED (no retry)
332
+ - Per-US guard count tracking (increments, resets on US change/pass)
333
+ - ALL bucket treated separately
334
+ - Guard files in cleanup list
335
+
336
+ ### Self-Verification (5 scenarios)
337
+ - **LOW**: `--flywheel-guard off` → flywheel runs without guard (current behavior unchanged)
338
+ - **MEDIUM-1**: `--flywheel-guard on` + flywheel decision with look-ahead bias → guard catches it (Check 1 FAIL) → flywheel retries → corrected direction
339
+ - **MEDIUM-2**: `--flywheel-guard on` + flywheel silently switches optimization metric → guard catches it (Check 2 FAIL, metric mismatch requires PRD update) → escalation
340
+ - **MEDIUM-3**: `--flywheel-guard on` + flywheel proposes direction previously rejected for same US → guard catches it (Check 4 FAIL) → flywheel retries with different approach
341
+ - **CRITICAL**: Guard fails 2x → BLOCKED with escalation report including all guard issues from both attempts
342
+
343
+ ## Dependencies
344
+
345
+ - Requires `--flywheel on-fail` (guard without flywheel is meaningless)
346
+ - Works with any flywheel model and guard model combination
347
+ - Does not require gstack installation
348
+ - Does not require tmux (works in agent mode)
349
+
350
+ ## Priority
351
+
352
+ Medium — implement after flywheel has been battle-tested in more campaigns. Current workaround is user vigilance (which caught all 3 issues in surge-v3-exit-strategy). Guard formalizes that vigilance into a protocol.