@ai-dev-methodologies/rlp-desk 0.3.4 → 0.3.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ai-dev-methodologies/rlp-desk",
|
|
3
|
-
"version": "0.3.
|
|
3
|
+
"version": "0.3.6",
|
|
4
4
|
"description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
|
|
5
5
|
"scripts": {
|
|
6
6
|
"postinstall": "node scripts/postinstall.js",
|
package/src/commands/rlp-desk.md
CHANGED
|
@@ -107,7 +107,7 @@ Available run commands (copy the one you want):
|
|
|
107
107
|
# --max-iter N Max iterations (default: 100)
|
|
108
108
|
```
|
|
109
109
|
|
|
110
|
-
**CRITICAL: Do NOT offer to run for the user. Do NOT ask "
|
|
110
|
+
**CRITICAL: Do NOT offer to run for the user. Do NOT ask "shall I run?" or offer to execute. The user MUST type the run command themselves. Just present the options, recommend one, and STOP.**
|
|
111
111
|
|
|
112
112
|
---
|
|
113
113
|
|
|
@@ -318,6 +318,8 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
318
318
|
- **Either fails** → combine issues from both verdicts into a single fix contract → Worker retry
|
|
319
319
|
- Max 3 consensus rounds per US. After 3 rounds → BLOCKED.
|
|
320
320
|
|
|
321
|
+
**NO ENGINE PRIORITY (ABSOLUTE):** There is no primary or secondary engine. Claude and Codex have EQUAL weight. If one passes and the other fails, the verdict is FAIL — always. The Leader MUST NOT override, prioritize, or dismiss either engine's verdict. "Claude priority", "primary engine override", "infrastructure failure" (when a valid verdict file exists), or any similar rationalization = governance violation. Infrastructure failure means ONLY: CLI crash (exit ≠ 0), timeout, or verdict file not generated.
|
|
322
|
+
|
|
321
323
|
**⑦c Read verdict(s)**
|
|
322
324
|
- Read `verify-verdict.json` (or both `-claude.json` and `-codex.json` if consensus):
|
|
323
325
|
- `pass` + `complete` → write COMPLETE sentinel, report done!
|
|
@@ -345,6 +347,10 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
345
347
|
- Result status `[leader-measured]`
|
|
346
348
|
- Files changed via `git diff --stat HEAD~1 HEAD` `[git-measured]`
|
|
347
349
|
- Verifier verdict `[leader-measured]`
|
|
350
|
+
- **Record cost & performance per iteration**:
|
|
351
|
+
- Agent mode: record `total_tokens` and `duration_ms` from Agent() return metadata for both Worker and Verifier
|
|
352
|
+
- Tmux mode: record `duration_seconds` from shell timing. Estimate tokens from file sizes: `(prompt_bytes + done_claim_bytes + verdict_bytes) / 4` — label as "estimated"
|
|
353
|
+
- Write to `status.json`: `{"iter_N": {"worker_tokens": N, "worker_duration_ms": N, "verifier_tokens": N, "verifier_duration_ms": N, "token_source": "measured|estimated"}}`
|
|
348
354
|
- Write `status.json`
|
|
349
355
|
- Report via tool call: `Bash("echo 'Iter N | US-NNN | verdict | model | next_action'")` — NEVER plain text. This keeps the turn alive for the next iteration.
|
|
350
356
|
- **Always**: append to baseline.log: `[timestamp] iter=N verdict=<pass|fail|continue> us=<us_id> model=<worker_model>`
|
|
@@ -362,7 +368,7 @@ After the loop ends, the Leader performs post-campaign analysis:
|
|
|
362
368
|
3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
|
|
363
369
|
4. **Report to user**: Display the full report content
|
|
364
370
|
|
|
365
|
-
Report template (
|
|
371
|
+
Report template (10 sections):
|
|
366
372
|
|
|
367
373
|
```
|
|
368
374
|
# Campaign Self-Verification Report: <slug>
|
|
@@ -401,7 +407,12 @@ Weaknesses: systemic issues
|
|
|
401
407
|
### PRD (ambiguous or oversized ACs) — citing iter/AC
|
|
402
408
|
### Test-Spec (missing layers, weak mappings) — citing iter/AC
|
|
403
409
|
|
|
404
|
-
## 9.
|
|
410
|
+
## 9. Cost & Performance
|
|
411
|
+
Table: Iter | Role | Model | Tokens | Duration | Source
|
|
412
|
+
Aggregate: total Worker tokens, total Verifier tokens, total campaign tokens, total duration
|
|
413
|
+
Source: "measured" (Agent mode) or "estimated" (Tmux mode, from file sizes / 4)
|
|
414
|
+
|
|
415
|
+
## 10. Blind Spots
|
|
405
416
|
What this report CANNOT prove from available data
|
|
406
417
|
|
|
407
418
|
## Data Provenance Rule
|
|
@@ -454,9 +465,19 @@ Remove:
|
|
|
454
465
|
- `.claude/ralph-desk/memos/<slug>-escalation.md`
|
|
455
466
|
Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
|
|
456
467
|
|
|
457
|
-
If `--kill-session` is passed,
|
|
468
|
+
If `--kill-session` is passed, clean up ALL tmux artifacts:
|
|
458
469
|
```bash
|
|
470
|
+
# Kill rlp-desk tmux sessions
|
|
459
471
|
tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "^rlp-desk-<slug>-" | while read s; do tmux kill-session -t "$s"; done
|
|
472
|
+
|
|
473
|
+
# Kill split panes in current window (Worker/Verifier panes from --mode tmux)
|
|
474
|
+
# Find panes running claude/codex for this slug and kill them
|
|
475
|
+
for pane_id in $(tmux list-panes -F '#{pane_id}:#{pane_current_command}' 2>/dev/null | grep -i 'claude\|codex' | cut -d: -f1); do
|
|
476
|
+
tmux kill-pane -t "$pane_id" 2>/dev/null
|
|
477
|
+
done
|
|
478
|
+
|
|
479
|
+
# Kill any remaining claude/codex processes for this campaign
|
|
480
|
+
ps aux | grep -E "claude.*<slug>|codex.*<slug>" | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
|
|
460
481
|
```
|
|
461
482
|
|
|
462
483
|
## No args or `help`
|
package/src/governance.md
CHANGED
|
@@ -24,6 +24,7 @@ IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
|
|
|
24
24
|
IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
|
|
25
25
|
IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
|
|
26
26
|
IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
|
|
27
|
+
IL-5: NO PASS WHEN TESTS ARE SKIPPED OR NOT EXECUTED
|
|
27
28
|
```
|
|
28
29
|
|
|
29
30
|
**IL-1: Evidence Mandate**
|
|
@@ -83,6 +84,7 @@ Count < 3 per any AC = FAIL.
|
|
|
83
84
|
| IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
|
|
84
85
|
| IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
|
|
85
86
|
| IL-4 | Verifier | verification time | scored (test count per AC) |
|
|
87
|
+
| IL-5 | Verifier | verification time | mechanical (skip/pending/0-collected scan in test output) |
|
|
86
88
|
|
|
87
89
|
- Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
|
|
88
90
|
- When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
|
|
@@ -479,6 +481,8 @@ Worker completes US → signal verify
|
|
|
479
481
|
→ Both pass → proceed (next US or COMPLETE)
|
|
480
482
|
→ Either fails → combined issues → fix contract → Worker retry
|
|
481
483
|
→ Max 3 consensus rounds per US → BLOCKED if still disagreeing
|
|
484
|
+
|
|
485
|
+
**NO ENGINE PRIORITY:** Claude and Codex have equal weight. If one passes and the other fails, the verdict is FAIL. No engine may be prioritized or dismissed. Infrastructure failure = CLI crash, timeout, or verdict file not generated — NOT a valid verdict with verdict=fail.
|
|
482
486
|
```
|
|
483
487
|
|
|
484
488
|
**Key rules:**
|
|
@@ -159,6 +159,7 @@ Check the iter-signal.json "us_id" field:
|
|
|
159
159
|
4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
|
|
160
160
|
5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
|
|
161
161
|
6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
|
|
162
|
+
**Skip detection (IL-5)**: After running tests, check output for "skip", "pending", "not run", or "0 items collected". Tests that did not actually execute do NOT count as passed. If test_count_executed < test_count_expected, verdict = FAIL ("skipped tests detected").
|
|
162
163
|
7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
|
|
163
164
|
8. Run smoke test if defined in PRD
|
|
164
165
|
9. **Test Sufficiency (IL-4)**: count test functions exercising each AC. Count < 3 per AC = FAIL.
|
|
@@ -1411,56 +1411,13 @@ run_consensus_verification() {
|
|
|
1411
1411
|
# Consensus disagreement
|
|
1412
1412
|
log_debug "[EXEC] iter=$iter phase=consensus_disagreement round=$CONSENSUS_ROUND claude=$CLAUDE_VERDICT codex=$CODEX_VERDICT action=fix_contract"
|
|
1413
1413
|
|
|
1414
|
-
#
|
|
1415
|
-
#
|
|
1416
|
-
|
|
1417
|
-
|
|
1418
|
-
|
|
1419
|
-
|
|
1420
|
-
# Check if ALL failing issues reference files NOT touched by the worker
|
|
1421
|
-
local has_worker_caused_issues=0
|
|
1422
|
-
local failing_verdict_file=""
|
|
1423
|
-
if [[ "$CLAUDE_VERDICT" = "fail" ]]; then failing_verdict_file="$claude_verdict_file"
|
|
1424
|
-
elif [[ "$CODEX_VERDICT" = "fail" ]]; then failing_verdict_file="$codex_verdict_file"
|
|
1425
|
-
fi
|
|
1426
|
-
|
|
1427
|
-
if [[ -n "$failing_verdict_file" && -n "$worker_changed_files" ]]; then
|
|
1428
|
-
# Extract file paths mentioned in issues and check against worker changes
|
|
1429
|
-
local issue_files
|
|
1430
|
-
issue_files=$(jq -r '.issues[]? | .description // ""' "$failing_verdict_file" 2>/dev/null)
|
|
1431
|
-
for changed_file in $(echo "$worker_changed_files"); do
|
|
1432
|
-
if echo "$issue_files" | grep -q "$changed_file" 2>/dev/null; then
|
|
1433
|
-
has_worker_caused_issues=1
|
|
1434
|
-
break
|
|
1435
|
-
fi
|
|
1436
|
-
done
|
|
1437
|
-
|
|
1438
|
-
if (( ! has_worker_caused_issues )); then
|
|
1439
|
-
# None of the failing issues reference files the worker changed
|
|
1440
|
-
log " Pre-existing failure detected: failing tests are NOT in files changed by Worker."
|
|
1441
|
-
log_debug "[EXEC] iter=$iter pre_existing_failure=true failing_engine=$([ \"$CLAUDE_VERDICT\" = 'fail' ] && echo claude || echo codex)"
|
|
1442
|
-
|
|
1443
|
-
# Treat as pass — the other engine passed, and failures are pre-existing
|
|
1444
|
-
{
|
|
1445
|
-
echo '{'
|
|
1446
|
-
echo ' "verdict": "pass",'
|
|
1447
|
-
echo ' "verified_at_utc": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'",'
|
|
1448
|
-
echo ' "summary": "Consensus PASS (pre-existing failure filtered): claude='"$CLAUDE_VERDICT"' codex='"$CODEX_VERDICT"'. Failing tests not in worker-changed files.",'
|
|
1449
|
-
echo ' "recommended_state_transition": "complete",'
|
|
1450
|
-
echo ' "pre_existing_failure": true,'
|
|
1451
|
-
echo ' "worker_changed_files": "'"$(echo $worker_changed_files | tr '\n' ',')"'",'
|
|
1452
|
-
echo ' "consensus": {'
|
|
1453
|
-
echo ' "claude": { "verdict": "'"$CLAUDE_VERDICT"'" },'
|
|
1454
|
-
echo ' "codex": { "verdict": "'"$CODEX_VERDICT"'" },'
|
|
1455
|
-
echo ' "round": '"$CONSENSUS_ROUND"
|
|
1456
|
-
echo ' }'
|
|
1457
|
-
echo '}'
|
|
1458
|
-
} | atomic_write "$VERDICT_FILE"
|
|
1459
|
-
return 0
|
|
1460
|
-
fi
|
|
1461
|
-
fi
|
|
1414
|
+
# NOTE: pre_existing_failure heuristic was removed (v0.3.5).
|
|
1415
|
+
# It used unreliable grep-in-description string matching to classify
|
|
1416
|
+
# consensus failures as "pre-existing", bypassing the consensus rule.
|
|
1417
|
+
# Consensus disagreement now ALWAYS flows to fix contract.
|
|
1418
|
+
# Codex CLI crash (no verdict file) is handled upstream via run_single_verifier return 1 → BLOCKED.
|
|
1462
1419
|
|
|
1463
|
-
# ---
|
|
1420
|
+
# --- Consensus disagreement: build fix contract ---
|
|
1464
1421
|
local fix_contract="$LOGS_DIR/iter-$(printf '%03d' $iter).fix-contract.md"
|
|
1465
1422
|
{
|
|
1466
1423
|
echo "# Fix Contract (Consensus Round $CONSENSUS_ROUND, iteration $iter)"
|