@ai-dev-methodologies/rlp-desk 0.17.0 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -599,7 +599,7 @@ Check the iter-signal.json "us_id" field:
599
599
  2. Read done claim
600
600
  3. Identify scope: run \`git diff --name-only\` to find changed files, then read those files + related imports only
601
601
  4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
602
- 5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
602
+ 5. **Layer Enforcement (IL-3)**: confirm each REQUIRED layer is actually verified by a concrete PASSING check (a per-AC command in the Criteria-to-Verification table counts as L1/L3 coverage). Explicit "## L1/L2/L3" section headers and "N/A" markers are NOT mandatory — their absence is NOT a fail. FAIL a required layer ONLY when its verification is genuinely absent, blank, TODO, or failing never for format alone. (Identical for claude AND codex.)
603
603
  6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
604
604
  **Skip detection (IL-5)**: After running tests, check output for "skip", "pending", "not run", or "0 items collected". Tests that did not actually execute do NOT count as passed. If test_count_executed < test_count_expected, verdict = FAIL ("skipped tests detected").
605
605
  7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
@@ -618,10 +618,11 @@ Check the iter-signal.json "us_id" field:
618
618
  - Rationalization red flags: "tests pass so it works" (passing ≠ correct), "Worker is confident" (confidence ≠ evidence), "changes are minimal" (scope ≠ correctness)
619
619
  10½. **Worker Process Audit**:
620
620
  - Test-first compliance: done-claim execution_steps must show write_test step before implement step for each AC
621
- - RED phase evidence: at least one verify_red step with exit_code=1 per AC (proves tests were written before passing)
621
+ - RED phase evidence: at least one verify_red step with exit_code=1 for the US (proves tests were written before passing). Per-AC RED is preferred, but AGGREGATE RED evidence is acceptable — do NOT FAIL merely because red/green is aggregated rather than per-AC.
622
622
  - Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "I'll test after", "already manually tested", "partial check")
623
623
  - Step completeness: each AC should have write_test → verify_red → implement → verify_green sequence in execution_steps
624
624
  - Planning Step presence: done-claim execution_steps should include a \`plan\` step as the first entry. If missing, record in reasoning as {"check": "Planning Step", "decision": "info", "basis": "plan step present/absent"} — informational only (does not affect pass/fail verdict)
625
+ 10¾. **FORMAT is not a PASS-blocker, but SUBSTANCE always is (F-17→F-18, identical for claude AND codex)**: when the acceptance criteria are met and their FRESH checks are green (per the Evidence Gate), record pure-FORMAT observations — missing layer-section headers, a missing N/A marker, RED evidence aggregated rather than per-AC — as warnings in reasoning, NOT as a FAIL. The iter-signal.json only identifies WHICH US to verify; its author (Worker vs leader-synthesized) does not change the verdict. But deliverable COMPLETENESS is NOT a format concern — if an AC's work is absent, uncommitted/untracked, or never actually exercised, that is a FAIL. The real correctness gates (Evidence Gate, Test Sufficiency IL-4, Skip detection IL-5, Anti-Gaming) stay strict regardless.
625
626
  11. **Reproducibility check**: verify lock file committed, clean install succeeds, security scan passes, env vars documented (per test-spec Reproducibility Gate). Skip if test-spec says "N/A."
626
627
  12. Write verdict JSON to: $DESK/memos/$SLUG-verify-verdict.json
627
628
  **CRITICAL: You MUST write the verdict as a FILE (not stdout/echo/cat). The Leader polls this file path — terminal output is lost. Evidence strings: include key metrics and exit codes only, do NOT quote full command output or logs verbatim.**
@@ -240,8 +240,93 @@ record_us_failure() {
240
240
  atomic_write() {
241
241
  local target="$1"
242
242
  local tmp="${target}.tmp.$$"
243
- cat > "$tmp"
244
- mv "$tmp" "$target"
243
+ # F-26: check BOTH stages. A truncated tmp (ENOSPC / SIGPIPE / full disk) must
244
+ # never be atomically renamed into the canonical path — a half-written
245
+ # complete/blocked/status sentinel would otherwise pass existence checks and
246
+ # mis-drive (or falsely terminate) the campaign. On failure: drop the tmp,
247
+ # leave the existing target untouched, and signal the error to callers that
248
+ # check. Behaviour on success is unchanged.
249
+ if ! cat > "$tmp"; then
250
+ rm -f "$tmp" 2>/dev/null
251
+ return 1
252
+ fi
253
+ if ! mv "$tmp" "$target" 2>/dev/null; then
254
+ rm -f "$tmp" 2>/dev/null
255
+ return 1
256
+ fi
257
+ return 0
258
+ }
259
+
260
+ # =============================================================================
261
+ # ZSH-4: race-safe per-slug lock acquisition (redesign, v0.17.1)
262
+ # =============================================================================
263
+ # Acquire an exclusive lock at $1 (a file holding the owner PID). Race-safe vs:
264
+ # (a) two concurrent stale-lock recoverers,
265
+ # (b) a normal starter slipping into the rm/create gap,
266
+ # (c) a recovery mutex leaked by a crashed recoverer.
267
+ # Algorithm: fast path is `set -C` (noclobber) atomic create. On contention with
268
+ # a STALE (dead-owner) lock, recovery is serialized by an atomic `mkdir` mutex
269
+ # whose own staleness is PID-based (never age-based, so a slow-but-alive recoverer
270
+ # is never falsely reaped). Inside the mutex we re-read the lock (don't clobber a
271
+ # live holder that recovered first) and re-acquire with `set -C` (so a starter
272
+ # that grabbed the lock in the gap wins instead of us). Echoes nothing; returns:
273
+ # 0 = acquired (caller should set LOCKFILE_ACQUIRED=1 and trap cleanup)
274
+ # 1 = busy (a live instance holds the lock) OR lost a recovery race — caller exits
275
+ acquire_slug_lock() {
276
+ local lockfile="$1"
277
+ mkdir -p "$(dirname "$lockfile")" 2>/dev/null
278
+ # Fast path: atomic noclobber create.
279
+ if (set -C; echo $$ > "$lockfile") 2>/dev/null; then
280
+ return 0
281
+ fi
282
+ local lock_pid
283
+ lock_pid=$(cat "$lockfile" 2>/dev/null)
284
+ if [[ -n "$lock_pid" ]] && kill -0 "$lock_pid" 2>/dev/null; then
285
+ return 1 # a live instance holds it
286
+ fi
287
+ # Stale lock (dead/unknown owner) — recover under an atomic mkdir mutex.
288
+ local rmutex="${lockfile}.recovery.d"
289
+ # Reap a leaked mutex ONLY when we can prove its owner is dead. An EMPTY owner
290
+ # is NOT proof of death: it usually means another recoverer just won the `mkdir`
291
+ # and has not yet written its PID (the window between `mkdir` and the owner
292
+ # write below). Reaping an empty-owner mutex here is a TOCTOU that deletes a
293
+ # LIVE mid-creation holder, letting two recoverers both proceed. So: a
294
+ # present-but-dead owner is reaped immediately; for an empty owner we give a
295
+ # brief settle window and re-read — if a PID appears it is a live holder and we
296
+ # do NOT reap (we lose the mkdir below and back off), and only if it stays
297
+ # empty do we treat it as a genuinely leaked mutex (creator died in the gap).
298
+ if [[ -d "$rmutex" ]]; then
299
+ local mowner
300
+ mowner=$(cat "$rmutex/owner" 2>/dev/null)
301
+ if [[ -z "$mowner" ]]; then
302
+ sleep 0.3
303
+ mowner=$(cat "$rmutex/owner" 2>/dev/null)
304
+ fi
305
+ if [[ -z "$mowner" ]] || ! kill -0 "$mowner" 2>/dev/null; then
306
+ rm -rf "$rmutex" 2>/dev/null
307
+ fi
308
+ fi
309
+ if ! mkdir "$rmutex" 2>/dev/null; then
310
+ return 1 # another recoverer owns the critical section
311
+ fi
312
+ echo $$ > "$rmutex/owner" 2>/dev/null
313
+ # Critical section: re-read the lock. If a prior recoverer installed a LIVE pid,
314
+ # do not clobber it.
315
+ local cur_pid
316
+ cur_pid=$(cat "$lockfile" 2>/dev/null)
317
+ if [[ -n "$cur_pid" && "$cur_pid" != "$$" ]] && kill -0 "$cur_pid" 2>/dev/null; then
318
+ rm -rf "$rmutex" 2>/dev/null
319
+ return 1
320
+ fi
321
+ # Replace the stale lock, re-acquiring with noclobber so a starter that slipped
322
+ # into the gap (and created the lock) wins — we lose cleanly instead of clobbering.
323
+ rm -f "$lockfile" 2>/dev/null
324
+ if ! (set -C; echo $$ > "$lockfile") 2>/dev/null; then
325
+ rm -rf "$rmutex" 2>/dev/null
326
+ return 1
327
+ fi
328
+ rm -rf "$rmutex" 2>/dev/null
329
+ return 0
245
330
  }
246
331
 
247
332
  # =============================================================================
@@ -587,6 +672,12 @@ update_status() {
587
672
  verified_us_json=$(echo "$VERIFIED_US" | tr ',' '\n' | jq -R . | jq -s .)
588
673
  fi
589
674
 
675
+ # D-5: jq-encode the free-text restore fields so a reason/model with special
676
+ # chars can't corrupt the status JSON (the rest of this builder is echo-based).
677
+ local _lbr_json _owm_json
678
+ _lbr_json=$(printf '%s' "${LAST_BLOCK_REASON:-}" | jq -Rs . 2>/dev/null); [[ -z "$_lbr_json" ]] && _lbr_json='""'
679
+ _owm_json=$(printf '%s' "${_ORIGINAL_WORKER_MODEL:-}" | jq -Rs . 2>/dev/null); [[ -z "$_owm_json" ]] && _owm_json='""'
680
+
590
681
  # Build consensus fields
591
682
  local consensus_json=""
592
683
  if [[ "$CONSENSUS_MODE" != "off" ]]; then
@@ -615,6 +706,11 @@ update_status() {
615
706
  "consensus_mode": "'"$CONSENSUS_MODE"'",
616
707
  "last_result": "'"$last_result"'",
617
708
  "consecutive_failures": '"$CONSECUTIVE_FAILURES"',
709
+ "consecutive_blocks": '"${CONSECUTIVE_BLOCKS:-0}"',
710
+ "last_block_reason": '"$_lbr_json"',
711
+ "model_upgraded": '"${_MODEL_UPGRADED:-0}"',
712
+ "same_us_fail_count": '"${_SAME_US_FAIL_COUNT:-0}"',
713
+ "original_worker_model": '"$_owm_json"',
618
714
  "verified_us": '"$verified_us_json"''"$consensus_json"',
619
715
  "updated_at_utc": "'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"
620
716
  }' | atomic_write "$STATUS_FILE"
@@ -1188,6 +1284,11 @@ Summary: $summary
1188
1284
  Completed at iteration $ITERATION.
1189
1285
 
1190
1286
  Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" | atomic_write "$COMPLETE_SENTINEL"
1287
+ # F-26: propagate atomic_write failure — never log success on a failed write.
1288
+ if (( ${pipestatus[-1]:-0} != 0 )); then
1289
+ log_error "FAILED to write COMPLETE sentinel ($COMPLETE_SENTINEL) — IO/disk error; completion NOT durably recorded"
1290
+ return 1
1291
+ fi
1191
1292
  log "COMPLETE sentinel written: $COMPLETE_SENTINEL"
1192
1293
  }
1193
1294
 
@@ -1297,6 +1398,7 @@ write_blocked_sentinel() {
1297
1398
  suggested_action: $action,
1298
1399
  meta: { blocked_hygiene_violated: $hygiene }
1299
1400
  }' | atomic_write "$json_path"
1401
+ local _bs_json_rc=${pipestatus[-1]:-0}
1300
1402
 
1301
1403
  echo "BLOCKED: $us_id
1302
1404
  Reason: $reason
@@ -1307,7 +1409,16 @@ Category: $category
1307
1409
  Blocked at iteration $ITERATION.
1308
1410
 
1309
1411
  Timestamp: $now_iso" | atomic_write "$BLOCKED_SENTINEL"
1310
-
1412
+ local _bs_md_rc=${pipestatus[-1]:-0}
1413
+
1414
+ # F-26: propagate atomic_write failure. The "markdown ⇒ JSON" invariant means a
1415
+ # half-written sentinel must surface loudly, not log false success. (Best-effort
1416
+ # signal: callers already `return 1` after this, so we log+return rather than
1417
+ # restructure every caller.)
1418
+ if (( _bs_md_rc != 0 || _bs_json_rc != 0 )); then
1419
+ log_error "FAILED to durably write BLOCKED sentinel (md_rc=$_bs_md_rc json_rc=$_bs_json_rc) for [$category] $reason — IO/disk error"
1420
+ return 1
1421
+ fi
1311
1422
  log_error "Campaign BLOCKED: [$category] $reason"
1312
1423
  log "BLOCKED sentinel written: $BLOCKED_SENTINEL"
1313
1424
  log "BLOCKED sidecar written: $json_path"