@windyroad/itil 0.47.6 → 0.47.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -497,5 +497,5 @@
497
497
  }
498
498
  },
499
499
  "name": "wr-itil",
500
- "version": "0.47.6"
500
+ "version": "0.47.7"
501
501
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@windyroad/itil",
3
- "version": "0.47.6",
3
+ "version": "0.47.7",
4
4
  "description": "ITIL-aligned IT service management for Claude Code (problem, and future incident/change skills)",
5
5
  "bin": {
6
6
  "windyroad-itil": "./bin/install.mjs"
@@ -368,9 +368,26 @@ claude -p \
368
368
  ITER_PID=$!
369
369
 
370
370
  SIGTERM_SENT=0
371
+ LAST_POLL_EPOCH=$DISPATCH_START_EPOCH
372
+ SUSPEND_OFFSET_S=0
373
+ EXPECTED_POLL_DELTA_S=60 # matches `sleep 60` cadence below
374
+ SUSPEND_JITTER_S=120 # tolerance above expected before treating gap as suspend (P307)
371
375
  while kill -0 "$ITER_PID" 2>/dev/null; do
372
- sleep 60
376
+ sleep "$EXPECTED_POLL_DELTA_S"
373
377
  NOW=$(date +%s)
378
+ # P307 machine-sleep false-kill: when the host suspends between polls,
379
+ # wall-clock advances while the iter subprocess is itself suspended (no
380
+ # actual idle work). Detect the wall-clock jump and accumulate it into
381
+ # SUSPEND_OFFSET_S so IDLE_SECONDS (computed against NOW - SUSPEND_OFFSET_S
382
+ # below) reads active-elapsed rather than wall-clock-elapsed. Without
383
+ # this, laptop suspend falsely kills a completing iter (2026-05-26 iter 1
384
+ # evidence: idle jumped 481s -> 1016s -> 5544s across suspend gaps;
385
+ # SIGTERM fired at 5544s > 3600s, lost the iter's commit + cost metadata).
386
+ ACTUAL_POLL_DELTA=$(( NOW - LAST_POLL_EPOCH ))
387
+ if (( ACTUAL_POLL_DELTA > EXPECTED_POLL_DELTA_S + SUSPEND_JITTER_S )); then
388
+ SUSPEND_OFFSET_S=$(( SUSPEND_OFFSET_S + ACTUAL_POLL_DELTA - EXPECTED_POLL_DELTA_S ))
389
+ fi
390
+ LAST_POLL_EPOCH=$NOW
374
391
  LAST_COMMIT_EPOCH=$(git log -1 --format=%at HEAD 2>/dev/null || echo "$DISPATCH_START_EPOCH")
375
392
  # LAST_ACTIVITY_MARK = max(DISPATCH_START_EPOCH, last commit timestamp).
376
393
  # The dispatch-start floor handles skip-iterations that produce no commit:
@@ -381,7 +398,7 @@ while kill -0 "$ITER_PID" 2>/dev/null; do
381
398
  else
382
399
  LAST_ACTIVITY_MARK=$DISPATCH_START_EPOCH
383
400
  fi
384
- IDLE_SECONDS=$(( NOW - LAST_ACTIVITY_MARK ))
401
+ IDLE_SECONDS=$(( NOW - SUSPEND_OFFSET_S - LAST_ACTIVITY_MARK ))
385
402
  if (( IDLE_SECONDS > IDLE_TIMEOUT_S )) && (( SIGTERM_SENT == 0 )); then
386
403
  kill -TERM "$ITER_PID" 2>/dev/null || true
387
404
  SIGTERM_SENT=1
@@ -409,6 +426,8 @@ rm -f "$ITER_JSON"
409
426
 
410
427
  **LAST_ACTIVITY_MARK signal trade-off.** The mark is `max(DISPATCH_START_EPOCH, last commit timestamp)`. The dispatch-start floor is intentional: skip-iterations that produce no commit (Step 4 routes a ticket to `action: skipped`) are bounded by `IDLE_TIMEOUT_S` since dispatch start, not by an arbitrarily-stale prior-commit timestamp. This protects against false-positive SIGTERM at iter T=0 when the most recent commit happens to be hours old. The trade-off is the inverse: a skip-iter that runs for `IDLE_TIMEOUT_S` (60 min default) will SIGTERM even though it never had a chance to commit. The 60-min default is well past the typical skip-iter wall-clock (a normal skip completes in seconds), so the trade-off rarely fires in practice; adopters who run unusually long skip-evaluation iters (e.g. deep architect-design probes) should raise `WORK_PROBLEMS_IDLE_TIMEOUT_S` accordingly. Alternative signals considered and rejected: `stat -f%m "$ITER_JSON"` (binary — file mtime only changes on subprocess exit, useless during the idle gap); subprocess RSS-change tracking (noisy; spikes during Agent-tool expansions confound the signal). The git-log signal is the cheapest reliable progress indicator the orchestrator already has.
411
428
 
429
+ **Machine-sleep false-kill — suspend-detect heuristic (P307).** The IDLE_SECONDS computation above subtracts `SUSPEND_OFFSET_S` from wall-clock `NOW` so the orchestrator measures *active-elapsed* time rather than raw wall-clock between LAST_ACTIVITY_MARK and now. The offset accumulates whenever a poll observes `ACTUAL_POLL_DELTA > EXPECTED_POLL_DELTA_S + SUSPEND_JITTER_S` (default `60 + 120 = 180s`) — i.e., the gap between consecutive `sleep 60` polls vastly exceeds the cadence the loop scheduled. The driver is the 2026-05-26 iter 1 evidence: the iter's host suspended (lid-close mid-loop) and the next poll observed an idle of 5544s; the wall-clock-only computation tripped SIGTERM at 5544s > 3600s, exit 143 + 0-byte JSON (the P147 stuck-before-emit metadata-loss class), losing a commit + cost metadata for an iter whose semantic work had completed. The suspend-detect heuristic converts that wall-clock-elapsed measure to "active-elapsed approximate" without needing monotonic clocks (which bash does not natively expose anyway). Alternatives considered and rejected: (a) monotonic / active-time clocks (POSIX `CLOCK_MONOTONIC` is not surfaced by `date` or `$EPOCHSECONDS`; would require a C helper or a Python-shim subprocess per poll); (b) iter-side heartbeat file the poll loop reads instead of wall-clock (works but adds an iter-side write contract; suspend-detect is purely orchestrator-side, no iter-prompt changes). The jitter buffer (`SUSPEND_JITTER_S=120`) is the load-bearing safety margin: it tolerates slow-hook / GC / brief-load-spike jitter (up to 180s total inter-poll delay) without falsely shifting; only genuine suspend / system-clock jumps cross the threshold. Adopters with unusually noisy hosts can raise `SUSPEND_JITTER_S` per environment; lowering it risks counting brief stalls as suspend. The heuristic is asymmetric — it can absorb a 5 min host hang into the offset and treat it as suspend, but the cost is at worst that one iter runs an extra 5 min before SIGTERM (cheaper than losing the iter's commit + metadata to a false-kill).
430
+
412
431
  **Iteration prompt body (self-contained — the subprocess has no prior conversation context):**
413
432
 
414
433
  1. **Context**: this is one iteration of the AFK work-problems loop. The user is AFK. The orchestrator selected `P<NNN> (<title>)` as the highest-WSJF actionable ticket.
@@ -298,3 +298,124 @@ FAKE_EOF
298
298
  run grep -nE "P147" "$SKILL_FILE"
299
299
  [ "$status" -eq 0 ]
300
300
  }
301
+
302
+ # ---------------------------------------------------------------------------
303
+ # P307 machine-sleep false-kill subclass: P121's IDLE_SECONDS = NOW -
304
+ # LAST_ACTIVITY_MARK computation is wall-clock time. When the host machine
305
+ # suspends/sleeps between the 60s polls, wall-clock advances while the iter
306
+ # subprocess is itself suspended (no actual idle work). On resume, IDLE_SECONDS
307
+ # jumps past the threshold and SIGTERM fires on a subprocess that was
308
+ # genuinely making progress, not stuck. The 2026-05-26 evidence: poll log
309
+ # idle jumped non-linearly 481s -> 1016s -> 5544s across suspend gaps,
310
+ # SIGTERM at idle=5544s > 3600s threshold, exit 143 + 0-byte JSON (the
311
+ # P147 stuck-before-emit metadata-loss class).
312
+ #
313
+ # Fix: detect large wall-clock jumps between consecutive polls (>> 60s
314
+ # expected) as suspend events and shift LAST_ACTIVITY_MARK forward by the
315
+ # gap-minus-expected so IDLE_SECONDS approximates active-elapsed rather
316
+ # than wall-clock-elapsed. Pure-bash heuristic — no monotonic-clock
317
+ # dependency.
318
+ #
319
+ # @problem P307
320
+
321
+ # Pure-bash helper mirroring SKILL.md Step 5 suspend-detect math. Tests
322
+ # below pin the algorithm against parameter combinations exercising
323
+ # (a) normal poll cadence (no shift), (b) within-jitter delay (no shift),
324
+ # (c) detected suspend (shift forward by actual-minus-expected),
325
+ # (d) reproduction of the 2026-05-26 5544s evidence (large shift absorbs
326
+ # the gap). The shape returned is the EFFECTIVE LAST_ACTIVITY_MARK such
327
+ # that IDLE_SECONDS = NOW - effective_mark yields active-elapsed.
328
+ compute_effective_mark() {
329
+ local prev_mark="$1"
330
+ local prev_poll="$2"
331
+ local now="$3"
332
+ local expected_delta="${4:-60}"
333
+ local jitter="${5:-120}"
334
+
335
+ local actual_delta=$(( now - prev_poll ))
336
+ local threshold=$(( expected_delta + jitter ))
337
+ if (( actual_delta > threshold )); then
338
+ printf '%d\n' $(( prev_mark + actual_delta - expected_delta ))
339
+ else
340
+ printf '%d\n' "$prev_mark"
341
+ fi
342
+ }
343
+
344
+ @test "P307: normal poll cadence (60s actual delta) does NOT shift LAST_ACTIVITY_MARK" {
345
+ # 60s between polls is the expected `sleep 60` cadence; no suspend; mark
346
+ # unchanged. Guards against an over-eager heuristic that would shift on
347
+ # every normal poll.
348
+ run compute_effective_mark 1000 0 60 60 120
349
+ [ "$status" -eq 0 ]
350
+ [ "$output" = "1000" ]
351
+ }
352
+
353
+ @test "P307: within-jitter delay (90s actual delta) does NOT shift LAST_ACTIVITY_MARK" {
354
+ # 90s between polls is mild jitter (slow hook, GC pause, brief load
355
+ # spike); below the 60+120=180s suspend threshold; mark unchanged.
356
+ # Bounded noise must not trigger a shift.
357
+ run compute_effective_mark 1000 0 90 60 120
358
+ [ "$status" -eq 0 ]
359
+ [ "$output" = "1000" ]
360
+ }
361
+
362
+ @test "P307: at-threshold delay (180s actual delta) does NOT shift LAST_ACTIVITY_MARK" {
363
+ # Exactly at expected+jitter is the boundary; strict-greater-than test
364
+ # means no shift at the boundary. Adopters tuning the jitter window
365
+ # know 180s == EXPECTED_POLL_DELTA_S + SUSPEND_JITTER_S is the inclusive
366
+ # ceiling of the no-shift band.
367
+ run compute_effective_mark 1000 0 180 60 120
368
+ [ "$status" -eq 0 ]
369
+ [ "$output" = "1000" ]
370
+ }
371
+
372
+ @test "P307: detected suspend (300s actual delta) shifts mark forward by actual-minus-expected" {
373
+ # 300s between polls vastly exceeds the 180s threshold; treat as suspend
374
+ # event and shift mark forward by 300-60=240s. Effect: IDLE_SECONDS
375
+ # (NOW - effective_mark) reads 60s instead of 300s, preserving the
376
+ # subprocess from a wall-clock false-kill.
377
+ run compute_effective_mark 1000 0 300 60 120
378
+ [ "$status" -eq 0 ]
379
+ [ "$output" = "1240" ]
380
+ }
381
+
382
+ @test "P307: reproduces 2026-05-26 iter 1 evidence (5544s suspend gap shifts mark to absorb)" {
383
+ # Concrete reproduction of the production observation: poll saw idle
384
+ # jump to 5544s after a multi-hour laptop suspend. Without suspend-detect,
385
+ # SIGTERM fires at 5544s > 3600s threshold. With suspend-detect, mark
386
+ # shifts forward by 5544-60=5484s; IDLE_SECONDS = 5544 - 5484 = 60s,
387
+ # below threshold; iter survives.
388
+ run compute_effective_mark 0 0 5544 60 120
389
+ [ "$status" -eq 0 ]
390
+ [ "$output" = "5484" ]
391
+ }
392
+
393
+ @test "P307: SKILL.md Step 5 documents the suspend-detect heuristic" {
394
+ # Prose must name the heuristic so adopters reading the SKILL.md know
395
+ # how the poll loop survives machine-sleep without inventing one.
396
+ # Accept any of: "suspend-detect", "wall-clock jump", "machine sleep",
397
+ # "machine-sleep", or the constants EXPECTED_POLL_DELTA_S /
398
+ # SUSPEND_JITTER_S / SUSPEND_OFFSET_S that name the construct.
399
+ run grep -niE "suspend.?detect|wall.?clock jump|machine.?sleep|EXPECTED_POLL_DELTA_S|SUSPEND_JITTER_S|SUSPEND_OFFSET_S" "$SKILL_FILE"
400
+ [ "$status" -eq 0 ]
401
+ }
402
+
403
+ @test "P307: SKILL.md Step 5 cites P307 (machine-sleep false-kill driver)" {
404
+ run grep -nE "P307" "$SKILL_FILE"
405
+ [ "$status" -eq 0 ]
406
+ }
407
+
408
+ @test "P307: SKILL.md Step 5 trade-off paragraph names suspend-detect alongside skip-iter trade-off" {
409
+ # The LAST_ACTIVITY_MARK signal trade-off paragraph (existing at L410)
410
+ # enumerates alternatives considered and rejected (mtime, RSS). The
411
+ # suspend-detect addition belongs in the same locus per architect
412
+ # review — keeps the rationale chain (P121 -> P147 -> trade-off ->
413
+ # P307 suspend-detect) reading linearly rather than fragmenting into
414
+ # a separate section. Assert the trade-off paragraph names both
415
+ # SUSPEND_OFFSET_S (the accumulator) AND the EXPECTED_POLL_DELTA_S +
416
+ # SUSPEND_JITTER_S threshold so the rationale chain is complete.
417
+ run grep -nE "LAST_ACTIVITY_MARK signal trade-off" "$SKILL_FILE"
418
+ [ "$status" -eq 0 ]
419
+ run grep -niE "SUSPEND_OFFSET_S|suspend.?offset" "$SKILL_FILE"
420
+ [ "$status" -eq 0 ]
421
+ }