@ai-dev-methodologies/rlp-desk 0.3.4 → 0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.3.4",
3
+ "version": "0.3.6",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -107,7 +107,7 @@ Available run commands (copy the one you want):
107
107
  # --max-iter N Max iterations (default: 100)
108
108
  ```
109
109
 
110
- **CRITICAL: Do NOT offer to run for the user. Do NOT ask "실행할까요?" or "shall I run?". The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
110
+ **CRITICAL: Do NOT offer to run for the user. Do NOT ask "shall I run?" or offer to execute. The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
111
111
 
112
112
  ---
113
113
 
@@ -318,6 +318,8 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
318
318
  - **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
319
319
  - Max 3 consensus rounds per US. After 3 rounds → BLOCKED.
320
320
 
321
+ **NO ENGINE PRIORITY (ABSOLUTE):** There is no primary or secondary engine. Claude and Codex have EQUAL weight. If one passes and the other fails, the verdict is FAIL — always. The Leader MUST NOT override, prioritize, or dismiss either engine's verdict. "Claude priority", "primary engine override", "infrastructure failure" (when a valid verdict file exists), or any similar rationalization = governance violation. Infrastructure failure means ONLY: CLI crash (exit ≠ 0), timeout, or verdict file not generated.
322
+
321
323
  **⑦c Read verdict(s)**
322
324
  - Read `verify-verdict.json` (or both `-claude.json` and `-codex.json` if consensus):
323
325
  - `pass` + `complete` → write COMPLETE sentinel, report done!
@@ -345,6 +347,10 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
345
347
  - Result status `[leader-measured]`
346
348
  - Files changed via `git diff --stat HEAD~1 HEAD` `[git-measured]`
347
349
  - Verifier verdict `[leader-measured]`
350
+ - **Record cost & performance per iteration**:
351
+ - Agent mode: record `total_tokens` and `duration_ms` from Agent() return metadata for both Worker and Verifier
352
+ - Tmux mode: record `duration_seconds` from shell timing. Estimate tokens from file sizes: `(prompt_bytes + done_claim_bytes + verdict_bytes) / 4` — label as "estimated"
353
+ - Write to `status.json`: `{"iter_N": {"worker_tokens": N, "worker_duration_ms": N, "verifier_tokens": N, "verifier_duration_ms": N, "token_source": "measured|estimated"}}`
348
354
  - Write `status.json`
349
355
  - Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
350
356
  - **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
@@ -362,7 +368,7 @@ After the loop ends, the Leader performs post-campaign analysis:
362
368
  3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
363
369
  4. **Report to user**: Display the full report content
364
370
 
365
- Report template (9 sections):
371
+ Report template (10 sections):
366
372
 
367
373
  ```
368
374
  # Campaign Self-Verification Report: <slug>
@@ -401,7 +407,12 @@ Weaknesses: systemic issues
401
407
  ### PRD (ambiguous or oversized ACs) — citing iter/AC
402
408
  ### Test-Spec (missing layers, weak mappings) — citing iter/AC
403
409
 
404
- ## 9. Blind Spots
410
+ ## 9. Cost & Performance
411
+ Table: Iter | Role | Model | Tokens | Duration | Source
412
+ Aggregate: total Worker tokens, total Verifier tokens, total campaign tokens, total duration
413
+ Source: "measured" (Agent mode) or "estimated" (Tmux mode, from file sizes / 4)
414
+
415
+ ## 10. Blind Spots
405
416
  What this report CANNOT prove from available data
406
417
 
407
418
  ## Data Provenance Rule
@@ -454,9 +465,19 @@ Remove:
454
465
  - `.claude/ralph-desk/memos/<slug>-escalation.md`
455
466
  Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
456
467
 
457
- If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
468
+ If `--kill-session` is passed, clean up ALL tmux artifacts:
458
469
  ```bash
470
+ # Kill rlp-desk tmux sessions
459
471
  tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "^rlp-desk-<slug>-" | while read s; do tmux kill-session -t "$s"; done
472
+
473
+ # Kill split panes in current window (Worker/Verifier panes from --mode tmux)
474
+ # Find panes running claude/codex for this slug and kill them
475
+ for pane_id in $(tmux list-panes -F '#{pane_id}:#{pane_current_command}' 2>/dev/null | grep -i 'claude\|codex' | cut -d: -f1); do
476
+ tmux kill-pane -t "$pane_id" 2>/dev/null
477
+ done
478
+
479
+ # Kill any remaining claude/codex processes for this campaign
480
+ ps aux | grep -E "claude.*<slug>|codex.*<slug>" | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
460
481
  ```
461
482
 
462
483
  ## No args or `help`
package/src/governance.md CHANGED
@@ -24,6 +24,7 @@ IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
24
24
  IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
25
25
  IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
26
26
  IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
27
+ IL-5: NO PASS WHEN TESTS ARE SKIPPED OR NOT EXECUTED
27
28
  ```
28
29
 
29
30
  **IL-1: Evidence Mandate**
@@ -83,6 +84,7 @@ Count < 3 per any AC = FAIL.
83
84
  | IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
84
85
  | IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
85
86
  | IL-4 | Verifier | verification time | scored (test count per AC) |
87
+ | IL-5 | Verifier | verification time | mechanical (skip/pending/0-collected scan in test output) |
86
88
 
87
89
  - Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
88
90
  - When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
@@ -479,6 +481,8 @@ Worker completes US → signal verify
479
481
  → Both pass → proceed (next US or COMPLETE)
480
482
  → Either fails → combined issues → fix contract → Worker retry
481
483
  → Max 3 consensus rounds per US → BLOCKED if still disagreeing
484
+
485
+ **NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
482
486
  ```
483
487
 
484
488
  **Key rules:**
@@ -159,6 +159,7 @@ Check the iter-signal.json "us_id" field:
159
159
  4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
160
160
  5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
161
161
  6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
162
+ **Skip detection (IL-5)**: After running tests, check output for "skip", "pending", "not run", or "0 items collected". Tests that did not actually execute do NOT count as passed. If test_count_executed < test_count_expected, verdict = FAIL ("skipped tests detected").
162
163
  7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
163
164
  8. Run smoke test if defined in PRD
164
165
  9. **Test Sufficiency (IL-4)**: count test functions exercising each AC. Count < 3 per AC = FAIL.
@@ -1411,56 +1411,13 @@ run_consensus_verification() {
1411
1411
  # Consensus disagreement
1412
1412
  log_debug "[EXEC] iter=$iter phase=consensus_disagreement round=$CONSENSUS_ROUND claude=$CLAUDE_VERDICT codex=$CODEX_VERDICT action=fix_contract"
1413
1413
 
1414
- # --- Pre-existing failure detection ---
1415
- # Get files changed by Worker in this iteration
1416
- local worker_changed_files=""
1417
- worker_changed_files=$(cd "$ROOT" && git diff --name-only HEAD~1 HEAD 2>/dev/null || echo "")
1418
- log_debug "[EXEC] iter=$iter worker_changed_files=\"$worker_changed_files\""
1419
-
1420
- # Check if ALL failing issues reference files NOT touched by the worker
1421
- local has_worker_caused_issues=0
1422
- local failing_verdict_file=""
1423
- if [[ "$CLAUDE_VERDICT" = "fail" ]]; then failing_verdict_file="$claude_verdict_file"
1424
- elif [[ "$CODEX_VERDICT" = "fail" ]]; then failing_verdict_file="$codex_verdict_file"
1425
- fi
1426
-
1427
- if [[ -n "$failing_verdict_file" && -n "$worker_changed_files" ]]; then
1428
- # Extract file paths mentioned in issues and check against worker changes
1429
- local issue_files
1430
- issue_files=$(jq -r '.issues[]? | .description // ""' "$failing_verdict_file" 2>/dev/null)
1431
- for changed_file in $(echo "$worker_changed_files"); do
1432
- if echo "$issue_files" | grep -q "$changed_file" 2>/dev/null; then
1433
- has_worker_caused_issues=1
1434
- break
1435
- fi
1436
- done
1437
-
1438
- if (( ! has_worker_caused_issues )); then
1439
- # None of the failing issues reference files the worker changed
1440
- log " Pre-existing failure detected: failing tests are NOT in files changed by Worker."
1441
- log_debug "[EXEC] iter=$iter pre_existing_failure=true failing_engine=$([ \"$CLAUDE_VERDICT\" = 'fail' ] && echo claude || echo codex)"
1442
-
1443
- # Treat as pass — the other engine passed, and failures are pre-existing
1444
- {
1445
- echo '{'
1446
- echo ' "verdict": "pass",'
1447
- echo ' "verified_at_utc": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'",'
1448
- echo ' "summary": "Consensus PASS (pre-existing failure filtered): claude='"$CLAUDE_VERDICT"' codex='"$CODEX_VERDICT"'. Failing tests not in worker-changed files.",'
1449
- echo ' "recommended_state_transition": "complete",'
1450
- echo ' "pre_existing_failure": true,'
1451
- echo ' "worker_changed_files": "'"$(echo $worker_changed_files | tr '\n' ',')"'",'
1452
- echo ' "consensus": {'
1453
- echo ' "claude": { "verdict": "'"$CLAUDE_VERDICT"'" },'
1454
- echo ' "codex": { "verdict": "'"$CODEX_VERDICT"'" },'
1455
- echo ' "round": '"$CONSENSUS_ROUND"
1456
- echo ' }'
1457
- echo '}'
1458
- } | atomic_write "$VERDICT_FILE"
1459
- return 0
1460
- fi
1461
- fi
1414
+ # NOTE: pre_existing_failure heuristic was removed (v0.3.5).
1415
+ # It used unreliable grep-in-description string matching to classify
1416
+ # consensus failures as "pre-existing", bypassing the consensus rule.
1417
+ # Consensus disagreement now ALWAYS flows to fix contract.
1418
+ # Codex CLI crash (no verdict file) is handled upstream via run_single_verifier return 1 → BLOCKED.
1462
1419
 
1463
- # --- Worker-caused failure: build fix contract as before ---
1420
+ # --- Consensus disagreement: build fix contract ---
1464
1421
  local fix_contract="$LOGS_DIR/iter-$(printf '%03d' $iter).fix-contract.md"
1465
1422
  {
1466
1423
  echo "# Fix Contract (Consensus Round $CONSENSUS_ROUND, iteration $iter)"