@ai-dev-methodologies/rlp-desk 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/src/governance.md CHANGED
@@ -248,6 +248,63 @@ Verifier records WHY each judgment was made in `verify-verdict.json`:
248
248
  - Without reasoning, Verifier's verdict is an unsubstantiated judgment
249
249
  - Both are archived in `logs/<slug>/` per existing audit trail pattern
250
250
 
251
+ ### Cost Log (US-023 R11 P2-K)
252
+ `logs/<slug>/cost-log.jsonl` always has at least one entry per campaign. tmux mode runs the estimated path (no LLM SDK token counters), so when prompt/claim/verdict bytes are all zero the entry's `note` field is set to `no_actual_usage_recorded`. Audit pipelines branch on `note` to distinguish "iteration ran but tokens not captured" (tmux estimated path) from "logging broken" (file empty / writer never called). The runner registers `trap '_emit_final_cost_log; cleanup' EXIT INT TERM` so an unconditional final entry is appended even if the campaign exits via an early-return path.
253
+
254
+ ### A4 Fallback Audit (US-017 R5 P0-D)
255
+ When Worker writes done-claim.json but forgets iter-signal.json, the runner auto-generates a verify signal as A4 fallback. This produces an opaque `summary="auto-generated by A4 fallback (done-claim without signal)"` that erases debugging context.
256
+
257
+ - Each A4 fallback invocation appends a JSONL entry to `logs/<slug>/a4-fallback-audit.jsonl` (event=`a4_fallback`, iter, us_id, source).
258
+ - **Recommended ratio < 10%** of total iterations (per mission). Above this threshold, Worker prompt mandate (Step N+1) is failing — investigate prompt clarity or Worker model.
259
+ - Verifier sets `meta.iter_signal_quality='auto_generated'` when it detects an A4 fallback summary so audit pipelines can join the signal-quality dimension to verdicts.
260
+
261
+ ### BLOCKED Surfacing
262
+ A BLOCKED outcome MUST surface its reason on **FIVE channels at once**: (1) sentinel file (markdown `<slug>-blocked.md` + JSON sidecar `<slug>-blocked.json`), (2) status.json, (3) Leader's stderr console, (4) campaign report, (5) memory.md/latest.md hygiene update (worker mandate per US-020 R8 P1-H — `Blocking History` entry in memory.md and `Known Issues` update in latest.md before the sentinel is written). Sentinel-only is silent failure; operators (and wrappers) must see WHY without grep'ing memo files. The leader propagates `verdict.reason || verdict.summary` into the sentinel reason field, the JSON sidecar, the return object, and the campaign report. The 5th channel survives across iterations: the next worker reads memory.md before re-attempting, preventing same-block-reason loops.
263
+
264
+ When the worker writes a sentinel without performing the 5th-channel hygiene update (memory.md/latest.md mtime older than 5 minutes at sentinel-write time), the runner stamps `meta.blocked_hygiene_violated=true` on the JSON sidecar and emits an analytics event so audit pipelines can track hygiene compliance.
265
+
266
+ ### Failure Taxonomy (P1-D)
267
+ BLOCKED writes a JSON sidecar (`<slug>-blocked.json`) alongside the markdown sentinel so wrappers can `jq .reason_category` instead of regex'ing free text. Schema:
268
+
269
+ ```json
270
+ {
271
+ "schema_version": "2.0",
272
+ "slug": "<slug>",
273
+ "us_id": "<us_id or ALL>",
274
+ "blocked_at_iter": <int>,
275
+ "blocked_at_utc": "<iso8601>",
276
+ "reason_category": "metric_failure | cross_us_dep | context_limit | infra_failure | repeat_axis | mission_abort",
277
+ "reason_detail": "<full reason text>",
278
+ "failure_category": "spec | implementation | integration | flaky | null",
279
+ "recoverable": true | false,
280
+ "suggested_action": "next_mission_chain | restart | retry_after_fix | terminal_alert"
281
+ }
282
+ ```
283
+
284
+ **Wrapper contract (binding)**:
285
+ - `reason_category` is **PRIMARY** — wrappers MUST branch on this field for recovery decisions.
286
+ - `failure_category` is **SECONDARY, diagnostic only** — do NOT branch on it; logging/triage only.
287
+
288
+ **Category → wrapper recovery action mapping** (defaults set by writer; wrappers may override but should follow):
289
+ - `metric_failure` → `retry_after_fix` (fix PRD/code, retry; recoverable=true)
290
+ - `cross_us_dep` → `retry_after_fix` (move AC to later US or switch to batch mode; recoverable=true)
291
+ - `infra_failure` → `restart` (CLI/network/spawn issue; recoverable=true)
292
+ - `context_limit` → `next_mission_chain` (current mission stale; recoverable=false)
293
+ - `repeat_axis` → `next_mission_chain` (model ceiling reached on this axis; recoverable=false)
294
+ - `mission_abort` → `terminal_alert` (flywheel guard exhausted; recoverable=false)
295
+
296
+ **Cross-US token list (cross_us_dep classifier)** — verifier verdict / worker signal text matching ANY of these is classified as `cross_us_dep`:
297
+ - English: `depends on US-`, `blocking US-`, `awaits US-`, `post-iter US-`, `requires US-N`, `cross-US`
298
+ - Korean: `US-N 산출물`, `신규 US-`, `post-iter`
299
+
300
+ **Write Order Contract (atomicity invariant)**:
301
+ 1. JSON sidecar written FIRST (`fs.writeFile` / `atomic_write`).
302
+ 2. markdown sentinel written SECOND.
303
+ 3. Invariant: **markdown exists ⇒ JSON exists** (writer enforces order).
304
+ 4. Wrappers SHOULD watch markdown sentinel, then read JSON sidecar. If JSON not yet visible (rare), retry up to 5 × 50ms before failing.
305
+
306
+ `atomic_write` provides per-file rename atomicity; cross-file ordering is enforced by the explicit two-call sequence.
307
+
251
308
  ## 2. Roles
252
309
 
253
310
  ### Leader (current session)
@@ -486,6 +543,7 @@ for iteration in 1..max_iter:
486
543
  ⑥½ Flywheel direction review (when --flywheel on-fail and consecutive_failures > 0)
487
544
  - Dispatch Flywheel agent (fresh context, --flywheel-model)
488
545
  - Read flywheel-signal.json for direction decision (hold/pivot/reduce/expand)
546
+ - Optional `next_mission_candidate` field (string | null): when present, the leader propagates it to status.json so consumer wrappers can chain the next mission without code edits. See docs/multi-mission-orchestration.md.
489
547
  - If --flywheel-guard on:
490
548
  - Dispatch Guard agent (fresh context, --flywheel-guard-model)
491
549
  - Read flywheel-guard-verdict.json:
@@ -544,6 +602,10 @@ Worker completes US-001 → signal verify (us_id: "US-001")
544
602
 
545
603
  **Batch mode** (`--verify-mode batch`) preserves legacy behavior: Worker signals `verify` only after all work is done, and the Verifier checks all AC at once.
546
604
 
605
+ **Cross-US dependency rule (per-us only):** In per-us mode each AC must reference only the same US or earlier verified US' artifacts. Future-US references (e.g. "post-iter US-(N+M) batch", "new US-(M) artifact") make the AC unsatisfiable inside a single per-us iteration and are rejected at init time (`init_ralph_desk.zsh` exits 2). Fold cross-US verification into the last measurement US, or run with `--verify-mode batch`.
606
+
607
+ **Cross-mission us_id leak prevention (US-022 R10 P2-J):** When the same `$DESK` directory hosts back-to-back missions, an `iter-signal.json` left over from the prior mission can carry a `us_id` (e.g. `US-005`) that has no corresponding section in the new mission's PRD (`US-001` through `US-003`). The runner would then scope-lock the next iteration to a non-existent US and block. `init_ralph_desk.zsh` runs `_quarantine_stale_signal` (lib_ralph_desk.zsh) which moves any signal whose `us_id` is absent from the new mission's PRD into `.sisyphus/quarantine/iter-signal.<epoch>.json` instead of `rm`-ing it. The PRD US-list extractor `_extract_prd_us_list` recognises three heading variants (`## US-005:`, `## US-005 -`, bare `## US-005`) so legitimate references are not false-flagged. The quarantine file is preserved so the operator can recover when the leak was actually intentional handoff state.
608
+
547
609
  ## 7b. Cross-Engine Consensus Verification
548
610
 
549
611
  Controlled by `--consensus off|all|final-only` (default: `off`).
@@ -625,6 +687,93 @@ If `cb_threshold` or more consecutive fix attempts fail for the same US:
625
687
 
626
688
  In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
627
689
 
690
+ ## 7e. Lane Enforcement (P1-E)
691
+
692
+ Default mode is **WARN-only** (`LANE_MODE=warn`). The opt-in `--lane-strict`
693
+ flag (or `LANE_MODE=strict`) escalates lane violations to BLOCKED, but the
694
+ escalation is **downgraded** to `recoverable=true` + `suggested_action=retry_after_fix`
695
+ (NOT `terminal_alert`) so an inaccurate mtime audit does not terminally
696
+ kill a campaign.
697
+
698
+ ### Decision tree
699
+
700
+ | Detection | Default (`warn`) | `--lane-strict` |
701
+ |-----------|-----------------|-----------------|
702
+ | PRD / test-spec / memory mtime changed during a worker iteration | analytics event `event_type=lane_violation_warning` + `log_warn` + audit log entry. Loop continues. | All of the WARN actions PLUS sentinel BLOCKED with `reason_category=infra_failure`, `recoverable=true`, `suggested_action=retry_after_fix`. |
703
+
704
+ ### Channels (Silent failure 0)
705
+
706
+ WARN mode is NOT silent — violations always emit on three channels:
707
+ 1. analytics jsonl event (`lane_violation_warning`)
708
+ 2. leader stderr (`log_warn`)
709
+ 3. audit log file `~/.claude/ralph-desk/logs/<slug>/lane-audit.json`
710
+
711
+ The audit log is initialized to `[]` at campaign start so the file always exists;
712
+ each violation appends an entry `{file, mtime_before, mtime_after, iter, lane_mode}`.
713
+
714
+ ### Why downgrade in strict mode
715
+
716
+ mtime audit is best-effort heuristic — it cannot accurately attribute the
717
+ modifier (worker vs leader vs external editor). Running an inaccurate
718
+ detector with `terminal_alert` would hand it the power to permanently
719
+ terminate a campaign. The downgrade keeps `recoverable=true` so wrappers
720
+ can re-launch after operator review.
721
+
722
+ ### Non-goals
723
+
724
+ - chmod-based enforcement (would break test fixtures and consumer envs).
725
+ - git_blame-based actor identification (best-effort hint only; verifier IL-2
726
+ is the real lane gate via worker process audit).
727
+ - Auto-launching missions on violation (consumer wrapper responsibility).
728
+
729
+ ## 7f. Test Density Enforcement (US-018 R6 P1-F)
730
+
731
+ Default mode is **WARN-default** (`TEST_DENSITY_MODE=warn`). The opt-in `--test-density-strict` flag escalates a `< 3 tests/AC` finding to a non-zero `init` exit. Worker prompt mandates **>= 3 tests per AC** (happy + negative + boundary categories — IL-4). The test-spec must encode the same density. When the test-spec encodes fewer (e.g., 1 test per AC) the contract collapses: Worker following the prompt fails IL-4, Worker following the spec fails the prompt.
732
+
733
+ `init_ralph_desk.zsh` runs `_lint_test_density` (lib_ralph_desk.zsh) on the generated PRD + test-spec pair before campaign launch.
734
+
735
+ ### Decision tree
736
+
737
+ | Detection | Default (`warn`) | `--test-density-strict` |
738
+ |-----------|-----------------|-----------------|
739
+ | Any US has `test_count < 3 * ac_count` | log_warn to stderr + audit log entry (`logs/<slug>/test-density-audit.jsonl`). Init exits 0. | All WARN actions PLUS init exits 1 with the same message. |
740
+
741
+ ### Why no downgrade in strict mode
742
+
743
+ Test density is a *static* property of the test-spec, deterministically measurable, and observed before any worker runs. There is no risk asymmetry comparable to the lane-mtime audit (which is best-effort heuristic). Strict mode is a hard fail because the failure is unambiguous: too few tests for the AC count.
744
+
745
+ ### Categorization (happy + negative + boundary)
746
+
747
+ The `>= 3 tests / AC` rule is a coverage floor, not a ceiling. Worker should distribute tests across:
748
+ - **happy**: standard input → expected output
749
+ - **negative**: malformed/missing input → defined error
750
+ - **boundary**: edge of allowed range, off-by-one, empty/max collections
751
+
752
+ If any category is missing for an AC the test-spec generator should densify before init. The runtime gate only counts; the categorization is enforced by the verifier's Test Coverage Audit (governance §1f Verifier reasoning).
753
+
754
+ ## 7g. Signal Vocabulary Extension (US-019 R7 P1-G)
755
+
756
+ The base signal vocabulary (`continue | verify | blocked`) is binary at the iteration level: every AC in the current US either passes together or the whole iteration blocks. When unblocked ACs share an iteration with a single unsatisfiable AC the all-or-nothing semantic discards real progress.
757
+
758
+ `verify_partial` lets the worker emit progress and a deferral in one signal:
759
+
760
+ ```json
761
+ {
762
+ "iteration": N,
763
+ "status": "verify_partial",
764
+ "us_id": "US-001",
765
+ "verified_acs": ["AC1", "AC2"],
766
+ "deferred_acs": ["AC3"],
767
+ "defer_reason": "AC3 depends on US-003 batch artifacts; cross-US"
768
+ }
769
+ ```
770
+
771
+ - Verifier evaluates **only** `verified_acs`. `deferred_acs` are out-of-scope (not fail).
772
+ - Deferred ACs queue for the next iteration or the final ALL verify pass.
773
+ - The runner downgrades to `blocked` with reason `verify_partial_malformed` (reason_category `mission_abort`, recoverable=true, suggested_action=retry_after_fix) when `verified_acs` is missing or empty — the verifier has nothing to evaluate, so silent acceptance would be a false GREEN.
774
+
775
+ The downgrade is intentionally recoverable: the malformed signal is a worker-side prompt regression, not an environment failure, and the operator can fix it in-place.
776
+
628
777
  ## 8. Circuit Breaker
629
778
 
630
779
  | Condition | Verdict |
@@ -633,12 +782,23 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
633
782
  | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
634
783
  | `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`; when `--consensus` is not `off`, effective threshold doubles automatically: default 6 → 12) |
635
784
  | max_iter reached | TIMEOUT (report to user) |
785
+ | Same canonical block reason fires `BLOCK_CB_THRESHOLD` (default: 3) times in a row | Mission abort (`.sisyphus/mission-abort.json` + non-zero exit). US-021 R9 P2-I `consecutive_blocks` counter. |
636
786
 
637
787
  The Leader tracks `consecutive_failures` in `status.json`:
638
788
  - Increments on `fail`, resets on `pass`, **unchanged by `request_info`**.
639
789
  - "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
640
790
  - "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
641
791
 
792
+ ### consecutive_blocks (US-021 R9 P2-I)
793
+
794
+ `consecutive_failures` only counts `fail` verdicts; a worker that signals `blocked` does not advance it, so a contract defect (e.g., test-spec/PRD mismatch) can repeat silently for many iterations. `consecutive_blocks` closes that hole.
795
+
796
+ - Counter increments when the **canonical** block reason matches the previous block's reason (`_canonical_block_reason` strips wrapper prefixes like `hygiene_violated:` and `wrapped:` before comparison so R8 hygiene wrappers don't fragment the chain).
797
+ - Counter resets to 1 when a *different* canonical reason fires.
798
+ - `infra_failure` category is **exempt** — transient API/tmux/process failures are environment problems, not contract defects, and shouldn't trip the abort.
799
+ - The very first iteration (`ITERATION <= 1`) is **exempt** — mission setup blocks (e.g., missing PRD, init misconfig) shouldn't terminate before the first real attempt.
800
+ - When the counter reaches `BLOCK_CB_THRESHOLD` the runner writes `.sisyphus/mission-abort.json` (`{reason, count, last_reason, threshold, timestamp}`) and exits non-zero so wrappers can chain to the next mission instead of looping.
801
+
642
802
  ## 8½. Self-Verification Feedback Loop
643
803
 
644
804
  When `--with-self-verification` is enabled, the SV report feeds back into the next brainstorm cycle:
@@ -165,6 +165,8 @@ export async function generateCampaignReport({
165
165
  now = new Date(),
166
166
  gitDiffProvider = defaultGitDiffProvider,
167
167
  svSummary = 'N/A — --with-self-verification not enabled',
168
+ blockedReason = null,
169
+ blockedCategory = null,
168
170
  }) {
169
171
  await fs.mkdir(path.dirname(reportFile), { recursive: true });
170
172
  await versionFile(reportFile, reportVersionPath);
@@ -197,6 +199,8 @@ export async function generateCampaignReport({
197
199
  '',
198
200
  '## Execution Summary',
199
201
  `- Terminal state: ${terminalState}`,
202
+ ...(blockedReason ? [`- Blocked reason: ${blockedReason}`] : []),
203
+ ...(blockedCategory ? [`- Blocked category: ${blockedCategory}`] : []),
200
204
  `- Iterations run: ${status.iteration ?? 0}`,
201
205
  `- Elapsed: ${elapsed}`,
202
206
  '',
package/src/node/run.mjs CHANGED
@@ -21,6 +21,8 @@ const RUN_DEFAULTS = {
21
21
  lockWorkerModel: false,
22
22
  autonomous: false,
23
23
  withSelfVerification: false,
24
+ laneStrict: false,
25
+ testDensityStrict: false,
24
26
  flywheel: 'off',
25
27
  flywheelModel: 'opus',
26
28
  flywheelGuard: 'off',
@@ -60,6 +62,8 @@ function buildHelpText() {
60
62
  ' --iter-timeout N',
61
63
  ' --debug',
62
64
  ' --autonomous',
65
+ ' --lane-strict',
66
+ ' --test-density-strict',
63
67
  ' --with-self-verification',
64
68
  ' --flywheel off|on-fail',
65
69
  ' --flywheel-model MODEL',
@@ -147,6 +151,14 @@ function parseRunOptions(args, cwd) {
147
151
  case '--autonomous':
148
152
  options.autonomous = true;
149
153
  break;
154
+ case '--lane-strict':
155
+ // P1-E lane enforcement opt-in. Default WARN. governance §7¾.
156
+ options.laneStrict = true;
157
+ break;
158
+ case '--test-density-strict':
159
+ // US-018 R6 P1-F test density enforcement opt-in. Default WARN. governance §7f.
160
+ options.testDensityStrict = true;
161
+ break;
150
162
  case '--with-self-verification':
151
163
  options.withSelfVerification = true;
152
164
  break;
@@ -209,7 +221,17 @@ async function runRunCommand(args, deps) {
209
221
 
210
222
  const slug = args[0];
211
223
  const options = parseRunOptions(args.slice(1), deps.cwd);
212
- await deps.runCampaign(slug, options);
224
+ const result = await deps.runCampaign(slug, options);
225
+ // governance §1f BLOCKED Surfacing: surface the blocked reason on stderr so
226
+ // the operator (or wrapper script) does not have to grep memo files.
227
+ if (result && result.status === 'blocked') {
228
+ // P1-D 4-channel surfacing: include category so wrappers can see
229
+ // reason_category alongside the textual reason without parsing JSON.
230
+ const reason = result.reason ? ` — ${result.reason}` : '';
231
+ const cat = result.category ? `, category=${result.category}` : '';
232
+ write(deps.stderr, `Campaign BLOCKED for ${slug} (US=${result.usId}${cat})${reason}`);
233
+ return 2;
234
+ }
213
235
  write(deps.stdout, `Campaign started for ${slug}`);
214
236
  return 0;
215
237
  }