npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.10.0 → 0.11.0 - Mend

@ai-dev-methodologies/rlp-desk 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/docs/blueprints/sv-architecture-rethink.md +84 -0
package/docs/multi-mission-orchestration.md +154 -0
package/docs/plans/rlp-desk-0.11-handoff-7fixes.md +352 -0
package/docs/plans/rlp-desk-elegant-papert-agent-a8cd695ffca2a3ad8.md +84 -0
package/docs/plans/rlp-desk-elegant-papert.md +270 -0
package/docs/protocol-reference.md +82 -0
package/package.json +1 -1
package/src/commands/rlp-desk.md +5 -0
package/src/governance.md +160 -0
package/src/node/reporting/campaign-reporting.mjs +4 -0
package/src/node/run.mjs +23 -1
package/src/node/runner/campaign-main-loop.mjs +284 -10
package/docs/superpowers/plans/2026-04-12-flywheel-redesign.md +0 -704
package/docs/superpowers/specs/2026-04-12-flywheel-redesign.md +0 -161

package/src/governance.md CHANGED Viewed

@@ -248,6 +248,63 @@ Verifier records WHY each judgment was made in `verify-verdict.json`:
 - Without reasoning, Verifier's verdict is an unsubstantiated judgment
 - Both are archived in `logs/<slug>/` per existing audit trail pattern
+### Cost Log (US-023 R11 P2-K)
+`logs/<slug>/cost-log.jsonl` always has at least one entry per campaign. tmux mode runs the estimated path (no LLM SDK token counters), so when prompt/claim/verdict bytes are all zero the entry's `note` field is set to `no_actual_usage_recorded`. Audit pipelines branch on `note` to distinguish "iteration ran but tokens not captured" (tmux estimated path) from "logging broken" (file empty / writer never called). The runner registers `trap '_emit_final_cost_log; cleanup' EXIT INT TERM` so an unconditional final entry is appended even if the campaign exits via an early-return path.
+### A4 Fallback Audit (US-017 R5 P0-D)
+When Worker writes done-claim.json but forgets iter-signal.json, the runner auto-generates a verify signal as A4 fallback. This produces an opaque `summary="auto-generated by A4 fallback (done-claim without signal)"` that erases debugging context.
+- Each A4 fallback invocation appends a JSONL entry to `logs/<slug>/a4-fallback-audit.jsonl` (event=`a4_fallback`, iter, us_id, source).
+- **Recommended ratio < 10%** of total iterations (per mission). Above this threshold, Worker prompt mandate (Step N+1) is failing — investigate prompt clarity or Worker model.
+- Verifier sets `meta.iter_signal_quality='auto_generated'` when it detects an A4 fallback summary so audit pipelines can join the signal-quality dimension to verdicts.
+### BLOCKED Surfacing
+A BLOCKED outcome MUST surface its reason on **FIVE channels at once**: (1) sentinel file (markdown `<slug>-blocked.md` + JSON sidecar `<slug>-blocked.json`), (2) status.json, (3) Leader's stderr console, (4) campaign report, (5) memory.md/latest.md hygiene update (worker mandate per US-020 R8 P1-H — `Blocking History` entry in memory.md and `Known Issues` update in latest.md before the sentinel is written). Sentinel-only is silent failure; operators (and wrappers) must see WHY without grep'ing memo files. The leader propagates `verdict.reason || verdict.summary` into the sentinel reason field, the JSON sidecar, the return object, and the campaign report. The 5th channel survives across iterations: the next worker reads memory.md before re-attempting, preventing same-block-reason loops.
+When the worker writes a sentinel without performing the 5th-channel hygiene update (memory.md/latest.md mtime older than 5 minutes at sentinel-write time), the runner stamps `meta.blocked_hygiene_violated=true` on the JSON sidecar and emits an analytics event so audit pipelines can track hygiene compliance.
+### Failure Taxonomy (P1-D)
+BLOCKED writes a JSON sidecar (`<slug>-blocked.json`) alongside the markdown sentinel so wrappers can `jq .reason_category` instead of regex'ing free text. Schema:
+```json
+{
+  "schema_version": "2.0",
+  "slug": "<slug>",
+  "us_id": "<us_id or ALL>",
+  "blocked_at_iter": <int>,
+  "blocked_at_utc": "<iso8601>",
+  "reason_category": "metric_failure | cross_us_dep | context_limit | infra_failure | repeat_axis | mission_abort",
+  "reason_detail": "<full reason text>",
+  "failure_category": "spec | implementation | integration | flaky | null",
+  "recoverable": true | false,
+  "suggested_action": "next_mission_chain | restart | retry_after_fix | terminal_alert"
+}
+```
+**Wrapper contract (binding)**:
+- `reason_category` is **PRIMARY** — wrappers MUST branch on this field for recovery decisions.
+- `failure_category` is **SECONDARY, diagnostic only** — do NOT branch on it; logging/triage only.
+**Category → wrapper recovery action mapping** (defaults set by writer; wrappers may override but should follow):
+- `metric_failure` → `retry_after_fix` (fix PRD/code, retry; recoverable=true)
+- `cross_us_dep` → `retry_after_fix` (move AC to later US or switch to batch mode; recoverable=true)
+- `infra_failure` → `restart` (CLI/network/spawn issue; recoverable=true)
+- `context_limit` → `next_mission_chain` (current mission stale; recoverable=false)
+- `repeat_axis` → `next_mission_chain` (model ceiling reached on this axis; recoverable=false)
+- `mission_abort` → `terminal_alert` (flywheel guard exhausted; recoverable=false)
+**Cross-US token list (cross_us_dep classifier)** — verifier verdict / worker signal text matching ANY of these is classified as `cross_us_dep`:
+- English: `depends on US-`, `blocking US-`, `awaits US-`, `post-iter US-`, `requires US-N`, `cross-US`
+- Korean: `US-N 산출물`, `신규 US-`, `post-iter`
+**Write Order Contract (atomicity invariant)**:
+1. JSON sidecar written FIRST (`fs.writeFile` / `atomic_write`).
+2. markdown sentinel written SECOND.
+3. Invariant: **markdown exists ⇒ JSON exists** (writer enforces order).
+4. Wrappers SHOULD watch markdown sentinel, then read JSON sidecar. If JSON not yet visible (rare), retry up to 5 × 50ms before failing.
+`atomic_write` provides per-file rename atomicity; cross-file ordering is enforced by the explicit two-call sequence.
 ## 2. Roles
 ### Leader (current session)
@@ -486,6 +543,7 @@ for iteration in 1..max_iter:
   ⑥½ Flywheel direction review (when --flywheel on-fail and consecutive_failures > 0)
      - Dispatch Flywheel agent (fresh context, --flywheel-model)
      - Read flywheel-signal.json for direction decision (hold/pivot/reduce/expand)
+     - Optional `next_mission_candidate` field (string | null): when present, the leader propagates it to status.json so consumer wrappers can chain the next mission without code edits. See docs/multi-mission-orchestration.md.
      - If --flywheel-guard on:
        - Dispatch Guard agent (fresh context, --flywheel-guard-model)
        - Read flywheel-guard-verdict.json:
@@ -544,6 +602,10 @@ Worker completes US-001 → signal verify (us_id: "US-001")
 **Batch mode** (`--verify-mode batch`) preserves legacy behavior: Worker signals `verify` only after all work is done, and the Verifier checks all AC at once.
+**Cross-US dependency rule (per-us only):** In per-us mode each AC must reference only the same US or earlier verified US' artifacts. Future-US references (e.g. "post-iter US-(N+M) batch", "new US-(M) artifact") make the AC unsatisfiable inside a single per-us iteration and are rejected at init time (`init_ralph_desk.zsh` exits 2). Fold cross-US verification into the last measurement US, or run with `--verify-mode batch`.
+**Cross-mission us_id leak prevention (US-022 R10 P2-J):** When the same `$DESK` directory hosts back-to-back missions, an `iter-signal.json` left over from the prior mission can carry a `us_id` (e.g. `US-005`) that has no corresponding section in the new mission's PRD (`US-001` through `US-003`). The runner would then scope-lock the next iteration to a non-existent US and block. `init_ralph_desk.zsh` runs `_quarantine_stale_signal` (lib_ralph_desk.zsh) which moves any signal whose `us_id` is absent from the new mission's PRD into `.sisyphus/quarantine/iter-signal.<epoch>.json` instead of `rm`-ing it. The PRD US-list extractor `_extract_prd_us_list` recognises three heading variants (`## US-005:`, `## US-005 -`, bare `## US-005`) so legitimate references are not false-flagged. The quarantine file is preserved so the operator can recover when the leak was actually intentional handoff state.
 ## 7b. Cross-Engine Consensus Verification
 Controlled by `--consensus off|all|final-only` (default: `off`).
@@ -625,6 +687,93 @@ If `cb_threshold` or more consecutive fix attempts fail for the same US:
 In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
+## 7e. Lane Enforcement (P1-E)
+Default mode is **WARN-only** (`LANE_MODE=warn`). The opt-in `--lane-strict`
+flag (or `LANE_MODE=strict`) escalates lane violations to BLOCKED, but the
+escalation is **downgraded** to `recoverable=true` + `suggested_action=retry_after_fix`
+(NOT `terminal_alert`) so an inaccurate mtime audit does not terminally
+kill a campaign.
+### Decision tree
+| Detection | Default (`warn`) | `--lane-strict` |
+|-----------|-----------------|-----------------|
+| PRD / test-spec / memory mtime changed during a worker iteration | analytics event `event_type=lane_violation_warning` + `log_warn` + audit log entry. Loop continues. | All of the WARN actions PLUS sentinel BLOCKED with `reason_category=infra_failure`, `recoverable=true`, `suggested_action=retry_after_fix`. |
+### Channels (Silent failure 0)
+WARN mode is NOT silent — violations always emit on three channels:
+1. analytics jsonl event (`lane_violation_warning`)
+2. leader stderr (`log_warn`)
+3. audit log file `~/.claude/ralph-desk/logs/<slug>/lane-audit.json`
+The audit log is initialized to `[]` at campaign start so the file always exists;
+each violation appends an entry `{file, mtime_before, mtime_after, iter, lane_mode}`.
+### Why downgrade in strict mode
+mtime audit is best-effort heuristic — it cannot accurately attribute the
+modifier (worker vs leader vs external editor). Running an inaccurate
+detector with `terminal_alert` would hand it the power to permanently
+terminate a campaign. The downgrade keeps `recoverable=true` so wrappers
+can re-launch after operator review.
+### Non-goals
+- chmod-based enforcement (would break test fixtures and consumer envs).
+- git_blame-based actor identification (best-effort hint only; verifier IL-2
+  is the real lane gate via worker process audit).
+- Auto-launching missions on violation (consumer wrapper responsibility).
+## 7f. Test Density Enforcement (US-018 R6 P1-F)
+Default mode is **WARN-default** (`TEST_DENSITY_MODE=warn`). The opt-in `--test-density-strict` flag escalates a `< 3 tests/AC` finding to a non-zero `init` exit. Worker prompt mandates **>= 3 tests per AC** (happy + negative + boundary categories — IL-4). The test-spec must encode the same density. When the test-spec encodes fewer (e.g., 1 test per AC) the contract collapses: Worker following the prompt fails IL-4, Worker following the spec fails the prompt.
+`init_ralph_desk.zsh` runs `_lint_test_density` (lib_ralph_desk.zsh) on the generated PRD + test-spec pair before campaign launch.
+### Decision tree
+| Detection | Default (`warn`) | `--test-density-strict` |
+|-----------|-----------------|-----------------|
+| Any US has `test_count < 3 * ac_count` | log_warn to stderr + audit log entry (`logs/<slug>/test-density-audit.jsonl`). Init exits 0. | All WARN actions PLUS init exits 1 with the same message. |
+### Why no downgrade in strict mode
+Test density is a *static* property of the test-spec, deterministically measurable, and observed before any worker runs. There is no risk asymmetry comparable to the lane-mtime audit (which is best-effort heuristic). Strict mode is a hard fail because the failure is unambiguous: too few tests for the AC count.
+### Categorization (happy + negative + boundary)
+The `>= 3 tests / AC` rule is a coverage floor, not a ceiling. Worker should distribute tests across:
+- **happy**: standard input → expected output
+- **negative**: malformed/missing input → defined error
+- **boundary**: edge of allowed range, off-by-one, empty/max collections
+If any category is missing for an AC the test-spec generator should densify before init. The runtime gate only counts; the categorization is enforced by the verifier's Test Coverage Audit (governance §1f Verifier reasoning).
+## 7g. Signal Vocabulary Extension (US-019 R7 P1-G)
+The base signal vocabulary (`continue | verify | blocked`) is binary at the iteration level: every AC in the current US either passes together or the whole iteration blocks. When unblocked ACs share an iteration with a single unsatisfiable AC the all-or-nothing semantic discards real progress.
+`verify_partial` lets the worker emit progress and a deferral in one signal:
+```json
+{
+  "iteration": N,
+  "status": "verify_partial",
+  "us_id": "US-001",
+  "verified_acs": ["AC1", "AC2"],
+  "deferred_acs": ["AC3"],
+  "defer_reason": "AC3 depends on US-003 batch artifacts; cross-US"
+}
+```
+- Verifier evaluates **only** `verified_acs`. `deferred_acs` are out-of-scope (not fail).
+- Deferred ACs queue for the next iteration or the final ALL verify pass.
+- The runner downgrades to `blocked` with reason `verify_partial_malformed` (reason_category `mission_abort`, recoverable=true, suggested_action=retry_after_fix) when `verified_acs` is missing or empty — the verifier has nothing to evaluate, so silent acceptance would be a false GREEN.
+The downgrade is intentionally recoverable: the malformed signal is a worker-side prompt regression, not an environment failure, and the operator can fix it in-place.
 ## 8. Circuit Breaker
 | Condition | Verdict |
@@ -633,12 +782,23 @@ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOC
 | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once (Agent mode only; tmux: same model retry); if still failing → Architecture Escalation (§7¾) → BLOCKED |
 | `cb_threshold` (default: 6) consecutive **fail** verdicts on `cb_threshold` unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED (adjustable via `--cb-threshold`; when `--consensus` is not `off`, effective threshold doubles automatically: default 6 → 12) |
 | max_iter reached | TIMEOUT (report to user) |
+| Same canonical block reason fires `BLOCK_CB_THRESHOLD` (default: 3) times in a row | Mission abort (`.sisyphus/mission-abort.json` + non-zero exit). US-021 R9 P2-I `consecutive_blocks` counter. |
 The Leader tracks `consecutive_failures` in `status.json`:
 - Increments on `fail`, resets on `pass`, **unchanged by `request_info`**.
 - "Same error" = same acceptance criterion ID in two consecutive **fail** verdicts (`request_info` does not break or contribute to this chain).
 - "Diverse failures" = `cb_threshold` most recent `fail` verdicts each have a unique criterion ID.
+### consecutive_blocks (US-021 R9 P2-I)
+`consecutive_failures` only counts `fail` verdicts; a worker that signals `blocked` does not advance it, so a contract defect (e.g., test-spec/PRD mismatch) can repeat silently for many iterations. `consecutive_blocks` closes that hole.
+- Counter increments when the **canonical** block reason matches the previous block's reason (`_canonical_block_reason` strips wrapper prefixes like `hygiene_violated:` and `wrapped:` before comparison so R8 hygiene wrappers don't fragment the chain).
+- Counter resets to 1 when a *different* canonical reason fires.
+- `infra_failure` category is **exempt** — transient API/tmux/process failures are environment problems, not contract defects, and shouldn't trip the abort.
+- The very first iteration (`ITERATION <= 1`) is **exempt** — mission setup blocks (e.g., missing PRD, init misconfig) shouldn't terminate before the first real attempt.
+- When the counter reaches `BLOCK_CB_THRESHOLD` the runner writes `.sisyphus/mission-abort.json` (`{reason, count, last_reason, threshold, timestamp}`) and exits non-zero so wrappers can chain to the next mission instead of looping.
 ## 8½. Self-Verification Feedback Loop
 When `--with-self-verification` is enabled, the SV report feeds back into the next brainstorm cycle:

package/src/node/reporting/campaign-reporting.mjs CHANGED Viewed

@@ -165,6 +165,8 @@ export async function generateCampaignReport({
   now = new Date(),
   gitDiffProvider = defaultGitDiffProvider,
   svSummary = 'N/A — --with-self-verification not enabled',
+  blockedReason = null,
+  blockedCategory = null,
 }) {
   await fs.mkdir(path.dirname(reportFile), { recursive: true });
   await versionFile(reportFile, reportVersionPath);
@@ -197,6 +199,8 @@ export async function generateCampaignReport({
     '',
     '## Execution Summary',
     `- Terminal state: ${terminalState}`,
+    ...(blockedReason ? [`- Blocked reason: ${blockedReason}`] : []),
+    ...(blockedCategory ? [`- Blocked category: ${blockedCategory}`] : []),
     `- Iterations run: ${status.iteration ?? 0}`,
     `- Elapsed: ${elapsed}`,
     '',

package/src/node/run.mjs CHANGED Viewed

@@ -21,6 +21,8 @@ const RUN_DEFAULTS = {
   lockWorkerModel: false,
   autonomous: false,
   withSelfVerification: false,
+  laneStrict: false,
+  testDensityStrict: false,
   flywheel: 'off',
   flywheelModel: 'opus',
   flywheelGuard: 'off',
@@ -60,6 +62,8 @@ function buildHelpText() {
     '  --iter-timeout N',
     '  --debug',
     '  --autonomous',
+    '  --lane-strict',
+    '  --test-density-strict',
     '  --with-self-verification',
     '  --flywheel off|on-fail',
     '  --flywheel-model MODEL',
@@ -147,6 +151,14 @@ function parseRunOptions(args, cwd) {
       case '--autonomous':
         options.autonomous = true;
         break;
+      case '--lane-strict':
+        // P1-E lane enforcement opt-in. Default WARN. governance §7¾.
+        options.laneStrict = true;
+        break;
+      case '--test-density-strict':
+        // US-018 R6 P1-F test density enforcement opt-in. Default WARN. governance §7f.
+        options.testDensityStrict = true;
+        break;
       case '--with-self-verification':
         options.withSelfVerification = true;
         break;
@@ -209,7 +221,17 @@ async function runRunCommand(args, deps) {
   const slug = args[0];
   const options = parseRunOptions(args.slice(1), deps.cwd);
-  await deps.runCampaign(slug, options);
+  const result = await deps.runCampaign(slug, options);
+  // governance §1f BLOCKED Surfacing: surface the blocked reason on stderr so
+  // the operator (or wrapper script) does not have to grep memo files.
+  if (result && result.status === 'blocked') {
+    // P1-D 4-channel surfacing: include category so wrappers can see
+    // reason_category alongside the textual reason without parsing JSON.
+    const reason = result.reason ? ` — ${result.reason}` : '';
+    const cat = result.category ? `, category=${result.category}` : '';
+    write(deps.stderr, `Campaign BLOCKED for ${slug} (US=${result.usId}${cat})${reason}`);
+    return 2;
+  }
   write(deps.stdout, `Campaign started for ${slug}`);
   return 0;
 }