npm - @ai-dev-methodologies/rlp-desk - Versions diffs - 0.3.4 → 0.3.6 - Mend

@ai-dev-methodologies/rlp-desk 0.3.4 → 0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/package.json +1 -1
package/src/commands/rlp-desk.md +25 -4
package/src/governance.md +4 -0
package/src/scripts/init_ralph_desk.zsh +1 -0
package/src/scripts/run_ralph_desk.zsh +6 -49

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ai-dev-methodologies/rlp-desk",
-  "version": "0.3.4",
+  "version": "0.3.6",
   "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
   "scripts": {
     "postinstall": "node scripts/postinstall.js",

package/src/commands/rlp-desk.md CHANGED Viewed

@@ -107,7 +107,7 @@ Available run commands (copy the one you want):
 #   --max-iter N               Max iterations (default: 100)
 ```
-**CRITICAL: Do NOT offer to run for the user. Do NOT ask "실행할까요?" or "shall I run?". The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
+**CRITICAL: Do NOT offer to run for the user. Do NOT ask "shall I run?" or offer to execute. The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
 ---
@@ -318,6 +318,8 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
 - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
 - Max 3 consensus rounds per US. After 3 rounds → BLOCKED.
+**NO ENGINE PRIORITY (ABSOLUTE):** There is no primary or secondary engine. Claude and Codex have EQUAL weight. If one passes and the other fails, the verdict is FAIL — always. The Leader MUST NOT override, prioritize, or dismiss either engine's verdict. "Claude priority", "primary engine override", "infrastructure failure" (when a valid verdict file exists), or any similar rationalization = governance violation. Infrastructure failure means ONLY: CLI crash (exit ≠ 0), timeout, or verdict file not generated.
 **⑦c Read verdict(s)**
 - Read `verify-verdict.json` (or both `-claude.json` and `-codex.json` if consensus):
   - `pass` + `complete` → write COMPLETE sentinel, report done!
@@ -345,6 +347,10 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
   - Result status `[leader-measured]`
   - Files changed via `git diff --stat HEAD~1 HEAD` `[git-measured]`
   - Verifier verdict `[leader-measured]`
+- **Record cost & performance per iteration**:
+  - Agent mode: record `total_tokens` and `duration_ms` from Agent() return metadata for both Worker and Verifier
+  - Tmux mode: record `duration_seconds` from shell timing. Estimate tokens from file sizes: `(prompt_bytes + done_claim_bytes + verdict_bytes) / 4` — label as "estimated"
+  - Write to `status.json`: `{"iter_N": {"worker_tokens": N, "worker_duration_ms": N, "verifier_tokens": N, "verifier_duration_ms": N, "token_source": "measured|estimated"}}`
 - Write `status.json`
 - Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
 - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
@@ -362,7 +368,7 @@ After the loop ends, the Leader performs post-campaign analysis:
 3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
 4. **Report to user**: Display the full report content
-Report template (9 sections):
+Report template (10 sections):
 ```
 # Campaign Self-Verification Report: <slug>
@@ -401,7 +407,12 @@ Weaknesses: systemic issues
 ### PRD (ambiguous or oversized ACs) — citing iter/AC
 ### Test-Spec (missing layers, weak mappings) — citing iter/AC
-## 9. Blind Spots
+## 9. Cost & Performance
+Table: Iter | Role | Model | Tokens | Duration | Source
+Aggregate: total Worker tokens, total Verifier tokens, total campaign tokens, total duration
+Source: "measured" (Agent mode) or "estimated" (Tmux mode, from file sizes / 4)
+## 10. Blind Spots
 What this report CANNOT prove from available data
 ## Data Provenance Rule
@@ -454,9 +465,19 @@ Remove:
 - `.claude/ralph-desk/memos/<slug>-escalation.md`
 Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
-If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
+If `--kill-session` is passed, clean up ALL tmux artifacts:
 ```bash
+# Kill rlp-desk tmux sessions
 tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "^rlp-desk-<slug>-" | while read s; do tmux kill-session -t "$s"; done
+# Kill split panes in current window (Worker/Verifier panes from --mode tmux)
+# Find panes running claude/codex for this slug and kill them
+for pane_id in $(tmux list-panes -F '#{pane_id}:#{pane_current_command}' 2>/dev/null | grep -i 'claude\|codex' | cut -d: -f1); do
+  tmux kill-pane -t "$pane_id" 2>/dev/null
+done
+# Kill any remaining claude/codex processes for this campaign
+ps aux | grep -E "claude.*<slug>|codex.*<slug>" | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
 ```
 ## No args or `help`

package/src/governance.md CHANGED Viewed

@@ -24,6 +24,7 @@ IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
 IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
 IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
 IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
+IL-5: NO PASS WHEN TESTS ARE SKIPPED OR NOT EXECUTED
 ```
 **IL-1: Evidence Mandate**
@@ -83,6 +84,7 @@ Count < 3 per any AC = FAIL.
 | IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
 | IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
 | IL-4 | Verifier | verification time | scored (test count per AC) |
+| IL-5 | Verifier | verification time | mechanical (skip/pending/0-collected scan in test output) |
 - Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
 - When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
@@ -479,6 +481,8 @@ Worker completes US → signal verify
   → Both pass → proceed (next US or COMPLETE)
   → Either fails → combined issues → fix contract → Worker retry
   → Max 3 consensus rounds per US → BLOCKED if still disagreeing
+**NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
 ```
 **Key rules:**

package/src/scripts/init_ralph_desk.zsh CHANGED Viewed

@@ -159,6 +159,7 @@ Check the iter-signal.json "us_id" field:
 4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
 5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
 6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
+   **Skip detection (IL-5)**: After running tests, check output for "skip", "pending", "not run", or "0 items collected". Tests that did not actually execute do NOT count as passed. If test_count_executed < test_count_expected, verdict = FAIL ("skipped tests detected").
 7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
 8. Run smoke test if defined in PRD
 9. **Test Sufficiency (IL-4)**: count test functions exercising each AC. Count < 3 per AC = FAIL.

package/src/scripts/run_ralph_desk.zsh CHANGED Viewed

@@ -1411,56 +1411,13 @@ run_consensus_verification() {
     # Consensus disagreement
     log_debug "[EXEC] iter=$iter phase=consensus_disagreement round=$CONSENSUS_ROUND claude=$CLAUDE_VERDICT codex=$CODEX_VERDICT action=fix_contract"
-    # --- Pre-existing failure detection ---
-    # Get files changed by Worker in this iteration
-    local worker_changed_files=""
-    worker_changed_files=$(cd "$ROOT" && git diff --name-only HEAD~1 HEAD 2>/dev/null || echo "")
-    log_debug "[EXEC] iter=$iter worker_changed_files=\"$worker_changed_files\""
-    # Check if ALL failing issues reference files NOT touched by the worker
-    local has_worker_caused_issues=0
-    local failing_verdict_file=""
-    if [[ "$CLAUDE_VERDICT" = "fail" ]]; then failing_verdict_file="$claude_verdict_file"
-    elif [[ "$CODEX_VERDICT" = "fail" ]]; then failing_verdict_file="$codex_verdict_file"
-    fi
-    if [[ -n "$failing_verdict_file" && -n "$worker_changed_files" ]]; then
-      # Extract file paths mentioned in issues and check against worker changes
-      local issue_files
-      issue_files=$(jq -r '.issues[]? | .description // ""' "$failing_verdict_file" 2>/dev/null)
-      for changed_file in $(echo "$worker_changed_files"); do
-        if echo "$issue_files" | grep -q "$changed_file" 2>/dev/null; then
-          has_worker_caused_issues=1
-          break
-        fi
-      done
-      if (( ! has_worker_caused_issues )); then
-        # None of the failing issues reference files the worker changed
-        log "  Pre-existing failure detected: failing tests are NOT in files changed by Worker."
-        log_debug "[EXEC] iter=$iter pre_existing_failure=true failing_engine=$([ \"$CLAUDE_VERDICT\" = 'fail' ] && echo claude || echo codex)"
-        # Treat as pass — the other engine passed, and failures are pre-existing
-        {
-          echo '{'
-          echo '  "verdict": "pass",'
-          echo '  "verified_at_utc": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'",'
-          echo '  "summary": "Consensus PASS (pre-existing failure filtered): claude='"$CLAUDE_VERDICT"' codex='"$CODEX_VERDICT"'. Failing tests not in worker-changed files.",'
-          echo '  "recommended_state_transition": "complete",'
-          echo '  "pre_existing_failure": true,'
-          echo '  "worker_changed_files": "'"$(echo $worker_changed_files | tr '\n' ',')"'",'
-          echo '  "consensus": {'
-          echo '    "claude": { "verdict": "'"$CLAUDE_VERDICT"'" },'
-          echo '    "codex": { "verdict": "'"$CODEX_VERDICT"'" },'
-          echo '    "round": '"$CONSENSUS_ROUND"
-          echo '  }'
-          echo '}'
-        } | atomic_write "$VERDICT_FILE"
-        return 0
-      fi
-    fi
+    # NOTE: pre_existing_failure heuristic was removed (v0.3.5).
+    # It used unreliable grep-in-description string matching to classify
+    # consensus failures as "pre-existing", bypassing the consensus rule.
+    # Consensus disagreement now ALWAYS flows to fix contract.
+    # Codex CLI crash (no verdict file) is handled upstream via run_single_verifier return 1 → BLOCKED.
-    # --- Worker-caused failure: build fix contract as before ---
+    # --- Consensus disagreement: build fix contract ---
     local fix_contract="$LOGS_DIR/iter-$(printf '%03d' $iter).fix-contract.md"
     {
       echo "# Fix Contract (Consensus Round $CONSENSUS_ROUND, iteration $iter)"