@ai-dev-methodologies/rlp-desk 0.16.0 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -599,7 +599,7 @@ Check the iter-signal.json "us_id" field:
599
599
  2. Read done claim
600
600
  3. Identify scope: run \`git diff --name-only\` to find changed files, then read those files + related imports only
601
601
  4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
602
- 5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
602
+ 5. **Layer Enforcement (IL-3)**: confirm each REQUIRED layer is actually verified by a concrete PASSING check (a per-AC command in the Criteria-to-Verification table counts as L1/L3 coverage). Explicit "## L1/L2/L3" section headers and "N/A" markers are NOT mandatory — their absence is NOT a fail. FAIL a required layer ONLY when its verification is genuinely absent, blank, TODO, or failing never for format alone. (Identical for claude AND codex.)
603
603
  6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
604
604
  **Skip detection (IL-5)**: After running tests, check output for "skip", "pending", "not run", or "0 items collected". Tests that did not actually execute do NOT count as passed. If test_count_executed < test_count_expected, verdict = FAIL ("skipped tests detected").
605
605
  7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
@@ -618,10 +618,11 @@ Check the iter-signal.json "us_id" field:
618
618
  - Rationalization red flags: "tests pass so it works" (passing ≠ correct), "Worker is confident" (confidence ≠ evidence), "changes are minimal" (scope ≠ correctness)
619
619
  10½. **Worker Process Audit**:
620
620
  - Test-first compliance: done-claim execution_steps must show write_test step before implement step for each AC
621
- - RED phase evidence: at least one verify_red step with exit_code=1 per AC (proves tests were written before passing)
621
+ - RED phase evidence: at least one verify_red step with exit_code=1 for the US (proves tests were written before passing). Per-AC RED is preferred, but AGGREGATE RED evidence is acceptable — do NOT FAIL merely because red/green is aggregated rather than per-AC.
622
622
  - Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "I'll test after", "already manually tested", "partial check")
623
623
  - Step completeness: each AC should have write_test → verify_red → implement → verify_green sequence in execution_steps
624
624
  - Planning Step presence: done-claim execution_steps should include a \`plan\` step as the first entry. If missing, record in reasoning as {"check": "Planning Step", "decision": "info", "basis": "plan step present/absent"} — informational only (does not affect pass/fail verdict)
625
+ 10¾. **FORMAT is not a PASS-blocker, but SUBSTANCE always is (F-17→F-18, identical for claude AND codex)**: when the acceptance criteria are met and their FRESH checks are green (per the Evidence Gate), record pure-FORMAT observations — missing layer-section headers, a missing N/A marker, RED evidence aggregated rather than per-AC — as warnings in reasoning, NOT as a FAIL. The iter-signal.json only identifies WHICH US to verify; its author (Worker vs leader-synthesized) does not change the verdict. But deliverable COMPLETENESS is NOT a format concern — if an AC's work is absent, uncommitted/untracked, or never actually exercised, that is a FAIL. The real correctness gates (Evidence Gate, Test Sufficiency IL-4, Skip detection IL-5, Anti-Gaming) stay strict regardless.
625
626
  11. **Reproducibility check**: verify lock file committed, clean install succeeds, security scan passes, env vars documented (per test-spec Reproducibility Gate). Skip if test-spec says "N/A."
626
627
  12. Write verdict JSON to: $DESK/memos/$SLUG-verify-verdict.json
627
628
  **CRITICAL: You MUST write the verdict as a FILE (not stdout/echo/cat). The Leader polls this file path — terminal output is lost. Evidence strings: include key metrics and exit codes only, do NOT quote full command output or logs verbatim.**
@@ -761,7 +762,7 @@ Based on your decision, update campaign memory:
761
762
  current direction. The wrapper polls this field for autonomous
762
763
  multi-mission orchestration (rlp-desk does not auto-launch missions —
763
764
  the consumer wrapper owns that policy). Field is OPTIONAL; absence is
764
- treated as null. See docs/multi-mission-orchestration.md for the
765
+ treated as null. See docs/rlp-desk/multi-mission-orchestration.md for the
765
766
  consumer-side polling pattern.
766
767
  FLYWHEEL_EOF
767
768
 
@@ -240,8 +240,93 @@ record_us_failure() {
240
240
  atomic_write() {
241
241
  local target="$1"
242
242
  local tmp="${target}.tmp.$$"
243
- cat > "$tmp"
244
- mv "$tmp" "$target"
243
+ # F-26: check BOTH stages. A truncated tmp (ENOSPC / SIGPIPE / full disk) must
244
+ # never be atomically renamed into the canonical path — a half-written
245
+ # complete/blocked/status sentinel would otherwise pass existence checks and
246
+ # mis-drive (or falsely terminate) the campaign. On failure: drop the tmp,
247
+ # leave the existing target untouched, and signal the error to callers that
248
+ # check. Behaviour on success is unchanged.
249
+ if ! cat > "$tmp"; then
250
+ rm -f "$tmp" 2>/dev/null
251
+ return 1
252
+ fi
253
+ if ! mv "$tmp" "$target" 2>/dev/null; then
254
+ rm -f "$tmp" 2>/dev/null
255
+ return 1
256
+ fi
257
+ return 0
258
+ }
259
+
260
+ # =============================================================================
261
+ # ZSH-4: race-safe per-slug lock acquisition (redesign, v0.17.1)
262
+ # =============================================================================
263
+ # Acquire an exclusive lock at $1 (a file holding the owner PID). Race-safe vs:
264
+ # (a) two concurrent stale-lock recoverers,
265
+ # (b) a normal starter slipping into the rm/create gap,
266
+ # (c) a recovery mutex leaked by a crashed recoverer.
267
+ # Algorithm: fast path is `set -C` (noclobber) atomic create. On contention with
268
+ # a STALE (dead-owner) lock, recovery is serialized by an atomic `mkdir` mutex
269
+ # whose own staleness is PID-based (never age-based, so a slow-but-alive recoverer
270
+ # is never falsely reaped). Inside the mutex we re-read the lock (don't clobber a
271
+ # live holder that recovered first) and re-acquire with `set -C` (so a starter
272
+ # that grabbed the lock in the gap wins instead of us). Echoes nothing; returns:
273
+ # 0 = acquired (caller should set LOCKFILE_ACQUIRED=1 and trap cleanup)
274
+ # 1 = busy (a live instance holds the lock) OR lost a recovery race — caller exits
275
+ acquire_slug_lock() {
276
+ local lockfile="$1"
277
+ mkdir -p "$(dirname "$lockfile")" 2>/dev/null
278
+ # Fast path: atomic noclobber create.
279
+ if (set -C; echo $$ > "$lockfile") 2>/dev/null; then
280
+ return 0
281
+ fi
282
+ local lock_pid
283
+ lock_pid=$(cat "$lockfile" 2>/dev/null)
284
+ if [[ -n "$lock_pid" ]] && kill -0 "$lock_pid" 2>/dev/null; then
285
+ return 1 # a live instance holds it
286
+ fi
287
+ # Stale lock (dead/unknown owner) — recover under an atomic mkdir mutex.
288
+ local rmutex="${lockfile}.recovery.d"
289
+ # Reap a leaked mutex ONLY when we can prove its owner is dead. An EMPTY owner
290
+ # is NOT proof of death: it usually means another recoverer just won the `mkdir`
291
+ # and has not yet written its PID (the window between `mkdir` and the owner
292
+ # write below). Reaping an empty-owner mutex here is a TOCTOU that deletes a
293
+ # LIVE mid-creation holder, letting two recoverers both proceed. So: a
294
+ # present-but-dead owner is reaped immediately; for an empty owner we give a
295
+ # brief settle window and re-read — if a PID appears it is a live holder and we
296
+ # do NOT reap (we lose the mkdir below and back off), and only if it stays
297
+ # empty do we treat it as a genuinely leaked mutex (creator died in the gap).
298
+ if [[ -d "$rmutex" ]]; then
299
+ local mowner
300
+ mowner=$(cat "$rmutex/owner" 2>/dev/null)
301
+ if [[ -z "$mowner" ]]; then
302
+ sleep 0.3
303
+ mowner=$(cat "$rmutex/owner" 2>/dev/null)
304
+ fi
305
+ if [[ -z "$mowner" ]] || ! kill -0 "$mowner" 2>/dev/null; then
306
+ rm -rf "$rmutex" 2>/dev/null
307
+ fi
308
+ fi
309
+ if ! mkdir "$rmutex" 2>/dev/null; then
310
+ return 1 # another recoverer owns the critical section
311
+ fi
312
+ echo $$ > "$rmutex/owner" 2>/dev/null
313
+ # Critical section: re-read the lock. If a prior recoverer installed a LIVE pid,
314
+ # do not clobber it.
315
+ local cur_pid
316
+ cur_pid=$(cat "$lockfile" 2>/dev/null)
317
+ if [[ -n "$cur_pid" && "$cur_pid" != "$$" ]] && kill -0 "$cur_pid" 2>/dev/null; then
318
+ rm -rf "$rmutex" 2>/dev/null
319
+ return 1
320
+ fi
321
+ # Replace the stale lock, re-acquiring with noclobber so a starter that slipped
322
+ # into the gap (and created the lock) wins — we lose cleanly instead of clobbering.
323
+ rm -f "$lockfile" 2>/dev/null
324
+ if ! (set -C; echo $$ > "$lockfile") 2>/dev/null; then
325
+ rm -rf "$rmutex" 2>/dev/null
326
+ return 1
327
+ fi
328
+ rm -rf "$rmutex" 2>/dev/null
329
+ return 0
245
330
  }
246
331
 
247
332
  # =============================================================================
@@ -587,6 +672,12 @@ update_status() {
587
672
  verified_us_json=$(echo "$VERIFIED_US" | tr ',' '\n' | jq -R . | jq -s .)
588
673
  fi
589
674
 
675
+ # D-5: jq-encode the free-text restore fields so a reason/model with special
676
+ # chars can't corrupt the status JSON (the rest of this builder is echo-based).
677
+ local _lbr_json _owm_json
678
+ _lbr_json=$(printf '%s' "${LAST_BLOCK_REASON:-}" | jq -Rs . 2>/dev/null); [[ -z "$_lbr_json" ]] && _lbr_json='""'
679
+ _owm_json=$(printf '%s' "${_ORIGINAL_WORKER_MODEL:-}" | jq -Rs . 2>/dev/null); [[ -z "$_owm_json" ]] && _owm_json='""'
680
+
590
681
  # Build consensus fields
591
682
  local consensus_json=""
592
683
  if [[ "$CONSENSUS_MODE" != "off" ]]; then
@@ -615,6 +706,11 @@ update_status() {
615
706
  "consensus_mode": "'"$CONSENSUS_MODE"'",
616
707
  "last_result": "'"$last_result"'",
617
708
  "consecutive_failures": '"$CONSECUTIVE_FAILURES"',
709
+ "consecutive_blocks": '"${CONSECUTIVE_BLOCKS:-0}"',
710
+ "last_block_reason": '"$_lbr_json"',
711
+ "model_upgraded": '"${_MODEL_UPGRADED:-0}"',
712
+ "same_us_fail_count": '"${_SAME_US_FAIL_COUNT:-0}"',
713
+ "original_worker_model": '"$_owm_json"',
618
714
  "verified_us": '"$verified_us_json"''"$consensus_json"',
619
715
  "updated_at_utc": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
620
716
  }' | atomic_write "$STATUS_FILE"
@@ -756,9 +852,18 @@ _lint_test_density() {
756
852
  us_list=$(grep -oE '^##[[:space:]]+US-[0-9]+' "$prd_file" 2>/dev/null | grep -oE 'US-[0-9]+' | sort -u)
757
853
  [[ -z "$us_list" ]] && return 0
758
854
 
759
- local audit_dir="${LOGS_DIR:-/tmp}"
760
- local audit_file="$audit_dir/test-density-audit.jsonl"
761
- [[ -d "$audit_dir" ]] || audit_file="/tmp/test-density-audit.jsonl"
855
+ # ZSH-8: prefer the campaign LOGS_DIR. When it is unavailable, avoid a fixed,
856
+ # predictable /tmp name (insecure-temp: symlink/collision risk) by creating a
857
+ # unique temp file via mktemp; fall back to a PID-scoped name only if mktemp
858
+ # is missing.
859
+ local audit_dir="${LOGS_DIR:-}"
860
+ local audit_file
861
+ if [[ -n "$audit_dir" && -d "$audit_dir" ]]; then
862
+ audit_file="$audit_dir/test-density-audit.jsonl"
863
+ else
864
+ audit_file=$(mktemp "${TMPDIR:-/tmp}/test-density-audit.XXXXXX" 2>/dev/null) \
865
+ || audit_file="${TMPDIR:-/tmp}/test-density-audit.$$.jsonl"
866
+ fi
762
867
 
763
868
  local us
764
869
  for us in ${(f)us_list}; do
@@ -1179,6 +1284,11 @@ Summary: $summary
1179
1284
  Completed at iteration $ITERATION.
1180
1285
 
1181
1286
  Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" | atomic_write "$COMPLETE_SENTINEL"
1287
+ # F-26: propagate atomic_write failure — never log success on a failed write.
1288
+ if (( ${pipestatus[-1]:-0} != 0 )); then
1289
+ log_error "FAILED to write COMPLETE sentinel ($COMPLETE_SENTINEL) — IO/disk error; completion NOT durably recorded"
1290
+ return 1
1291
+ fi
1182
1292
  log "COMPLETE sentinel written: $COMPLETE_SENTINEL"
1183
1293
  }
1184
1294
 
@@ -1288,6 +1398,7 @@ write_blocked_sentinel() {
1288
1398
  suggested_action: $action,
1289
1399
  meta: { blocked_hygiene_violated: $hygiene }
1290
1400
  }' | atomic_write "$json_path"
1401
+ local _bs_json_rc=${pipestatus[-1]:-0}
1291
1402
 
1292
1403
  echo "BLOCKED: $us_id
1293
1404
  Reason: $reason
@@ -1298,7 +1409,16 @@ Category: $category
1298
1409
  Blocked at iteration $ITERATION.
1299
1410
 
1300
1411
  Timestamp: $now_iso" | atomic_write "$BLOCKED_SENTINEL"
1301
-
1412
+ local _bs_md_rc=${pipestatus[-1]:-0}
1413
+
1414
+ # F-26: propagate atomic_write failure. The "markdown ⇒ JSON" invariant means a
1415
+ # half-written sentinel must surface loudly, not log false success. (Best-effort
1416
+ # signal: callers already `return 1` after this, so we log+return rather than
1417
+ # restructure every caller.)
1418
+ if (( _bs_md_rc != 0 || _bs_json_rc != 0 )); then
1419
+ log_error "FAILED to durably write BLOCKED sentinel (md_rc=$_bs_md_rc json_rc=$_bs_json_rc) for [$category] $reason — IO/disk error"
1420
+ return 1
1421
+ fi
1302
1422
  log_error "Campaign BLOCKED: [$category] $reason"
1303
1423
  log "BLOCKED sentinel written: $BLOCKED_SENTINEL"
1304
1424
  log "BLOCKED sidecar written: $json_path"