npm - pi-crew - Versions diffs - 0.9.4 → 0.9.5 - Mend

pi-crew 0.9.4 → 0.9.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/CHANGELOG.md +98 -0
package/README.md +54 -2
package/docs/troubleshooting.md +26 -0
package/package.json +1 -1
package/src/extension/command-completions.ts +1 -0
package/src/extension/crew-shortcuts.ts +1 -0
package/src/extension/register.ts +2 -0
package/src/extension/registration/commands.ts +3 -0
package/src/extension/team-tool/goal.ts +1 -0
package/src/extension/team-tool/run.ts +2 -0
package/src/runtime/background-runner.ts +23 -1
package/src/runtime/chain-runner.ts +1 -0
package/src/runtime/crash-recovery.ts +78 -36
package/src/runtime/dynamic-workflow-runner.ts +1 -0
package/src/runtime/goal-loop-runner.ts +2 -0
package/src/runtime/live-session-runtime.ts +1 -0
package/src/runtime/model-scope.ts +1 -0
package/src/runtime/peer-dep.ts +1 -0
package/src/runtime/resilient-edit.ts +1 -0
package/src/runtime/task-runner.ts +1 -0
package/src/state/hook-instinct-bridge.ts +3 -0
package/src/utils/bm25-search.ts +2 -0

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,103 @@
 # Changelog
+## [v0.9.5] — fix "team run hangs forever at 25%" (2026-06-23)
+Two coupled runtime bugs caused the recurring "run stuck at 25% (1/4)" failure
+observed across 4+ consecutive review/fast-fix runs. Both are now fixed; full
+diagnostics (background.log, events.jsonl, heartbeat.json) are preserved for
+all runs.
+### Bug X — `purgeStaleActiveRunIndex` destroyed the run's stateRoot (proximate cause)
+**File:** `src/runtime/crash-recovery.ts`
+**What was wrong:** `purgeStaleActiveRunIndex` decided whether a run was
+"orphaned" using `entry.updatedAt`, which is **frozen at registration** and
+never refreshed during execution. A long-running legitimate async run whose
+background worker had exited (e.g. after a 5–15 min explorer) would have its
+entire durable state (manifest/tasks/events/heartbeat) hard-deleted. Because
+`saveRunTasks()` silently no-ops once the state dir is missing, the workflow
+could never advance past the current task → **permanent invisible hang**
+("Run not found"), with all diagnostics lost.
+**Fix:**
+- Liveness now corroborated via (a) the on-disk `manifest.updatedAt` (rewritten
+  on every task transition) and (b) the team-level `heartbeat.json` mtime —
+  any one of which is sufficient to declare the run live.
+- Cancelling a run now **keeps its stateRoot** so the run stays queryable and
+  resumable, and its diagnostics survive. The finished-run pruner removes the
+  directory later on its normal schedule.
+- Removed two redundant `saveRunManifest(fullLoaded.manifest)` calls that
+  were clobbering the freshly-saved `cancelled` status back to `running`.
+**New regression test:** `test/unit/crash-recovery-purge-liveness.test.ts`
+(3 cases: fresh manifest kept, orphan cancelled-but-preserved, fresh
+heartbeat kept — all using a live-worker-then-reap + `now`-time-shift
+harness to deterministically simulate the registration-then-aging race).
+### Bug Y — background runner crashed with EPIPE on the first post-detach `console.debug` (root cause)
+**File:** `src/runtime/background-runner.ts`
+**What was wrong:** The in-process console redirect only covered `console.log`
+and `console.error`; `console.debug` and `console.warn` still wrote to the
+original stdout/stderr pipes. The background runner is spawned with
+`detached:true` + `setsid:true`, so the parent disconnects the stdio pipes
+immediately after spawn. The first post-detach `console.debug` call from
+`team-runner.ts:242` (inside `mergeTaskUpdatesPreservingTerminal` →
+"Skipping stale merge") hit the closed stdout → unhandled `EPIPE` error →
+**process exit** → scheduler dead → run stuck at 25% forever.
+Prior investigators saw only "the run died silently right after explorer
+completed" and concluded (incorrectly) that the cause was a native crash
+(SIGKILL/segfault/V8 heap-OOM), because their [DIAG] handlers never fired.
+In reality the diagnostic handlers DID fire — but on a `EPIPE` write error,
+which `process.on('error')` doesn't catch. The fix below makes the crash
+observable AND non-fatal.
+**Fix:**
+- Extend the console redirect to also cover `console.debug` and `console.warn`,
+  so they go to the log file (logFd) instead of the disconnected stdio pipes.
+- Wrap the `fs.writeSync` in try-catch so any log-write failure (closed fd,
+  ENOSPC, etc.) can never crash the scheduler. The scheduler log is
+  best-effort by design.
+**New regression test:** `test/unit/background-runner-console-redirect.test.ts`
+(4 cases: undefined logFd no-op, valid logFd writes correctly, EBADF on
+closed logFd is swallowed, post-undefined fd-toggle is safe). Replicates the
+`origWrite` pattern from the source so any drift between the two is easy to
+spot.
+### Why this took multiple attempts
+All prior attempts to diagnose the hang destroyed the only evidence (the
+stateRoot) the moment the `purgeStaleActiveRunIndex` heuristic misfired.
+The chain was always the same: a worker exits for any reason → purge sees
+dead PID + frozen-stale entry → **deletes stateRoot** → the run becomes
+"Run not found" with no log, no events, no heartbeat, no way to even resume.
+That hid the real cause (Bug Y) for the entire series of failed diagnostic
+runs. With Bug X fixed, the diagnostic trail (background.log 345 KB +
+events.jsonl 166 KB) survives long enough to read the actual EPIPE crash
+that Bug Y left behind.
+### Verification
+- 7/7 new regression tests pass (`crash-recovery-purge-liveness.test.ts` +
+  `background-runner-console-redirect.test.ts`).
+- Existing crash-recovery / active-run-registry / stale-reconciler /
+  async-stale / run-accumulation / auto-recovery suites: 71/71 pass.
+- End-to-end: a 4-step review run now advances 3/4 tasks (75%) instead of
+  hanging at 25%; the verify step that would have failed earlier now fails
+  only for environmental reasons (memory OOM under load), not the fix.
+- `npx tsc --noEmit` is green.
+### Notes for users
+If you have a stuck "running" run from v0.9.4 or earlier (the symptom was
+"Run not found" / "25% hang" / "had to kill pi"), upgrading alone will not
+recover it — its `stateRoot` was already destroyed by the buggy purge.
+Re-dispatch the workflow. New runs are fully protected.
 ## [v0.9.4] — fix macOS CI: benchmark allowlist + cross-platform fixtures (2026-06-23)
 Patch fix for a CI failure introduced in v0.9.3 (caught by the macOS CI job,

package/README.md CHANGED Viewed

@@ -39,13 +39,65 @@ npm: pi-crew
 repo: https://github.com/baphuongna/pi-crew
 ```
-**v0.9.0**: See [CHANGELOG.md](CHANGELOG.md).
+**v0.9.4 / v0.9.5**: See [CHANGELOG.md](CHANGELOG.md).
-### Highlights (v0.6.4 → v0.9.0)
+### Highlights (v0.6.4 → v0.9.5)
 A long arc of **trust, cliff-resilience, and robustness** work. Principle: *build
 trust and cliff-resilience, stay lean, delete before adding.*
+#### v0.9.5 — fix "team run hangs forever at 25%" (2026-06-23)
+Two coupled runtime bugs caused recurring "run stuck at 25% (1/4)" failures
+across 4+ consecutive review/fast-fix runs. The combined symptom: scheduler
+appears to stop responding right after the first task (explorer) finishes, no
+progress to task 2, and `team action='status'` returns "Run not found" with
+**no diagnostic trail** to investigate. Manual `kill` of the parent `pi`
+process was the only workaround.
+- **🩹 Bug X (proximate cause)** — `purgeStaleActiveRunIndex`
+  (`src/runtime/crash-recovery.ts`) destroyed a run's `stateRoot` based on a
+  **frozen** `entry.updatedAt` (set once at registration, never refreshed).
+  Any long-running legitimate async run (≥5 min) whose worker had exited
+  lost its entire durable state. `saveRunTasks()` then silently no-op'd on
+  the missing dir, and the workflow could never advance. Fix: corroborate
+  liveness via the on-disk `manifest.updatedAt` AND the team-level
+  `heartbeat.json`; keep `stateRoot` on cancel so runs stay queryable and
+  resumable.
+- **🩹 Bug Y (root cause — why the scheduler died in the first place)** —
+  `src/runtime/background-runner.ts` redirected only `console.log` /
+  `console.error` to the log file. The first post-detach `console.debug`
+  call from `team-runner.ts:242` (inside `mergeTaskUpdatesPreservingTerminal`
+  → "Skipping stale merge") hit the disconnected stdout pipe → unhandled
+  `EPIPE` → process exit. Prior investigators concluded (incorrectly) that
+  the cause was a native crash, because diagnostic `[DIAG]` handlers never
+  fired on the EPIPE. Fix: extend the console redirect to `console.debug` /
+  `console.warn`, and wrap `fs.writeSync` in try-catch so any log-write
+  failure can never crash the scheduler.
+- **🧪 Regression coverage** — 7 new tests: 3 in
+  `test/unit/crash-recovery-purge-liveness.test.ts` (fresh-manifest-kept,
+  orphan-cancelled-preserved, fresh-heartbeat-kept) + 4 in
+  `test/unit/background-runner-console-redirect.test.ts` (drift-detector
+  pattern that exercises undefined / valid / EBADF / post-toggle logFd).
+- **📖 See [CHANGELOG.md](CHANGELOG.md) for full details**, including
+  why prior attempts to diagnose the hang kept destroying the only
+  evidence (Bug X nuked the stateRoot before anyone could read the EPIPE
+  crash in Bug Y).
+> **Recovering a stuck run from v0.9.4 or earlier:** the `stateRoot` for
+> those runs is already gone. Re-dispatch the workflow — new runs are
+> fully protected.
+#### v0.9.4 — macOS CI fixture (2026-06-23)
+- **🧪 BSD-vs-GNU grep fix** — benchmark test fixtures used
+  `grep --help` (exits 0 on GNU/Linux, exits 2 on BSD/macOS). Switched
+  the exit-0 fixture to `echo ok`; the not-in-allowlist fixture is now
+  `ls`. CI matrix is now green on all 3 OSes.
+- **📌 Process note** — this release re-commits to: **tag/publish ONLY
+  after the full OS matrix CI is green.** v0.9.3 was published mid-CI-run
+  (the macOS job hadn't finished); the package itself was correct (the
+  broken file is test-only and not shipped), but the repo CI went red.
+  v0.9.4 restores green CI. v0.9.5 follows the same discipline.
 #### v0.9.0 — goal loops + dynamic workflows (2026-06-18)
 Two new features, both modeled on Claude Code, built on a shared `runKind`
 background-dispatch discriminator.

package/docs/troubleshooting.md CHANGED Viewed

@@ -74,6 +74,32 @@ team action='cancel' runId=…    # cancel a truly-dead run
 The error message explains the heartbeat mechanism + remediation.
+### "Run not found" but `team list` shows it / scheduler appears frozen at 25%
+**Symptom:** an async `team action='run'` (e.g. a review) gets through the
+first task (e.g. explorer), then the scheduler appears to stop responding.
+`team action='status' runId=…` returns `Run not found`; the run's
+`stateRoot` (in `<project>/.crew/state/runs/<runId>/`) is missing. TUI
+progress shows the run stuck at the same task percentage forever, and the
+only workaround was killing the parent `pi` process.
+**This was the v0.9.4 symptom** caused by two coupled runtime bugs:
+- **Bug X** (proximate): `purgeStaleActiveRunIndex` destroyed the
+  `stateRoot` of long-running legitimate async runs based on a frozen
+  `entry.updatedAt` (set at registration, never refreshed).
+- **Bug Y** (root cause): the bg-runner crashed with an unhandled `EPIPE`
+  on the first `console.debug` after the parent detached its stdio pipes.
+**Fixed in v0.9.5** (see [CHANGELOG.md](../CHANGELOG.md#v095--fix-team-run-hangs-forever-at-25-2026-06-23)).
+With the fix, a long-running run is no longer falsely purged, and even if the
+bg-runner dies, the `stateRoot`, `background.log`, `events.jsonl`, and
+`heartbeat.json` survive — runs stay queryable and resumable.
+**Recovering a stuck run from v0.9.4 or earlier:** the `stateRoot` for
+those runs is already gone (Bug X nuked it). Re-dispatch the workflow. New
+runs on v0.9.5+ are fully protected.
 ## Model fallback exhausted
 **Symptom:** `All N candidates exhausted (tried: a → b → c)`.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "pi-crew",
-  "version": "0.9.4",
+  "version": "0.9.5",
   "description": "Pi extension for coordinated AI teams, workflows, worktrees, and async task orchestration",
   "author": "baphuongna",
   "license": "MIT",

package/src/extension/command-completions.ts CHANGED Viewed

@@ -67,6 +67,7 @@ export function suggestRunIds(_prefix: string, cwd?: string): AutocompleteItem[]
 export async function suggestTaskIds(runId: string, prefix: string, cwd?: string): Promise<AutocompleteItem[] | null> {
 	const resolvedCwd = cwd ?? process.cwd();
 	// Dynamic import to avoid pulling state-store into the hot command-registration path.
+		// LAZY: defer dynamic import of ../state/state-store.ts to its call site.
 	const { loadRunManifestById } = await import("../state/state-store.ts");
 	const loaded = loadRunManifestById(resolvedCwd, runId);
 	if (!loaded) return null;

package/src/extension/crew-shortcuts.ts CHANGED Viewed

@@ -34,6 +34,7 @@ const CREW_SHORTCUTS: ReadonlyArray<ShortcutRegistration> = [
 		// (avoids pulling the full commands.ts dependency tree into every
 		// process that imports this module, e.g. the unit test).
 		handler: async (ctx) => {
+		// LAZY: defer dynamic import of ./registration/commands.ts to its call site.
 			const { openTeamSettingsOverlay } = await import("./registration/commands.ts");
 			await openTeamSettingsOverlay(ctx);
 		},

package/src/extension/register.ts CHANGED Viewed

@@ -1129,6 +1129,7 @@ export function registerPiTeams(pi: ExtensionAPI): void {
 				// LAZY: state-store only needed in hasRunning; avoid at startup.
 				// Use dynamic import to avoid CJS/ESM mixed module issues.
 				const { loadRunManifestById: loadRunForHasRunning } =
+		// LAZY: defer dynamic import of ../state/state-store.ts to its call site.
 					await import("../state/state-store.ts");
 				const loaded = loadRunForHasRunning(
 					currentCtx?.cwd ?? process.cwd(),
@@ -1494,6 +1495,7 @@ export function registerPiTeams(pi: ExtensionAPI): void {
 								const cwd = ctx.cwd ?? process.cwd();
 								const loaded = loadRunManifestById(cwd, runId);
 								if (loaded) {
+		// LAZY: defer dynamic import of ../state/atomic-write.ts to its call site.
 									const { atomicWriteJson } = await import("../state/atomic-write.ts");
 									atomicWriteJson(loaded.manifest.stateRoot + "/manifest.json", {
 										...loaded.manifest,

package/src/extension/registration/commands.ts CHANGED Viewed

@@ -202,11 +202,13 @@ export async function openTeamSettingsOverlay(ctx: ExtensionContext): Promise<vo
 						if (res.success) {
 							ctx.ui.notify(`Theme: ${value} (applied live)`, "info");
 						} else {
+		// LAZY: defer dynamic import of ../../ui/theme-discovery.ts to its call site.
 							const { setPiTheme } = await import("../../ui/theme-discovery.ts");
 							setPiTheme(value);
 							ctx.ui.notify(`Theme saved as '${value}' but failed to apply: ${res.error ?? "unknown"}. Restart Pi.`, "warning");
 						}
 					} else {
+		// LAZY: defer dynamic import of ../../ui/theme-discovery.ts to its call site.
 						const { setPiTheme } = await import("../../ui/theme-discovery.ts");
 						setPiTheme(value);
 						ctx.ui.notify(`Pi theme set to '${value}'. Restart Pi to apply.`, "info");
@@ -672,6 +674,7 @@ export function registerTeamCommands(pi: ExtensionAPI, deps: RegisterTeamCommand
 	pi.registerCommand("crew-brief", {
 		description: "Toggle brief tool output mode: on | off | status",
 		handler: async (args: string, ctx: ExtensionCommandContext) => {
+		// LAZY: defer dynamic import of ../../ui/tool-renderers/brief-mode.ts to its call site.
 			const { isBrief, setBrief, BRIEF_ENTRY_TYPE, makeBriefEntry } = await import("../../ui/tool-renderers/brief-mode.ts");
 			const trimmed = args.trim();

package/src/extension/team-tool/goal.ts CHANGED Viewed

@@ -269,6 +269,7 @@ async function handleStop(input: GoalSubActionInput): Promise<ReturnType<typeof
 	let cancelMsg = "";
 	if (updated.currentRunId) {
 		try {
+		// LAZY: defer dynamic import of ./cancel.ts to its call site.
 			const { handleCancel } = await import("./cancel.ts");
 			const cancelResult = await handleCancel({ action: "cancel", runId: updated.currentRunId, force: true, config: { intent: "user requested goal stop" } }, ctx);
 			cancelMsg = ` In-flight turn ${updated.currentRunId} cancel: ${(cancelResult.content[0] as { text?: string } | undefined)?.text ?? "ok"}.`;

package/src/extension/team-tool/run.ts CHANGED Viewed

@@ -34,6 +34,7 @@ import { expandParallelResearchWorkflow } from "../../runtime/parallel-research.
 /**
  * Module-scoped latch for the crew-init dynamic import.
  *
+		// LAZY: defer dynamic import of module to its call site.
  * `crew-init.ts` is dynamically `await import()`'d from `handleRun` below, which
  * N concurrent subagents hit simultaneously (every `team` tool call runs it).
  * Under the tsx/jiti loader, concurrent first-imports race module-record
@@ -296,6 +297,7 @@ export async function handleRun(params: TeamToolParamsValue, ctx: TeamContext):
 	// orchestrates subagents via ctx.agent(); only ctx.setResult() reaches the main context.
 	// Placed AFTER manifest creation so runId/paths/artifactsRoot are available.
 	if (!directAgent && (workflow as import("../../workflows/workflow-config.ts").DynamicWorkflowConfig).runtime === "dynamic") {
+		// LAZY: defer dynamic import of ../../runtime/dynamic-workflow-runner.ts to its call site.
 		const { runDynamicWorkflow } = await import("../../runtime/dynamic-workflow-runner.ts");
 		// Re-synthesize a dynamic-team (§0c C9) for role resolution.
 		const dwfTeam: import("../../teams/team-config.ts").TeamConfig = {

package/src/runtime/background-runner.ts CHANGED Viewed

@@ -323,11 +323,28 @@ async function main(): Promise<void> {
 			const origWrite =
 				(_prefix: string) =>
 				(data: unknown, ...args: unknown[]) => {
+					// FIX: Never let the in-process console redirect crash the background
+					// runner. If logFd is missing/invalid or the write fails, swallow the
+					// error silently — losing one debug line is far better than killing the
+					// scheduler (a previous version only redirected console.log/error, so
+					// console.debug/.warn still wrote to the original stdout/stderr pipe
+					// which is closed after the parent detaches, producing EPIPE → process
+					// crash mid-workflow → runs hang at 25% forever).
+					if (logFd === undefined) return;
 					const msg = [data, ...args].map(String).join(" ") + "\n";
-					fs.writeSync(logFd!, msg);
+					try {
+						fs.writeSync(logFd, msg);
+					} catch {
+						/* best-effort: never crash the scheduler over a log write */
+					}
 				};
 			console.log = origWrite("OUT");
 			console.error = origWrite("ERR");
+			// FIX: Also redirect console.debug and console.warn — otherwise they still
+			// hit the original stdout/stderr pipe, which is closed once the parent
+			// process detaches, causing EPIPE unhandled errors that kill the scheduler.
+			console.debug = origWrite("DBG");
+			console.warn = origWrite("WARN");
 			// FIX: Close logFd on process exit to prevent file descriptor leak
 			process.on("exit", () => {
 				try {
@@ -558,8 +575,11 @@ async function main(): Promise<void> {
 			debugLog(`[background-runner] short-circuiting ${manifest.runKind} (synthetic team/workflow)`,
 			);
 			if (manifest.runKind === "goal-loop") {
+		// LAZY: defer dynamic import of ./goal-loop-runner.ts to its call site.
 				const { runGoalLoop } = await import("./goal-loop-runner.ts");
+		// LAZY: defer dynamic import of ./goal-state-store.ts to its call site.
 				const { GoalStore } = await import("./goal-state-store.ts");
+		// LAZY: defer dynamic import of ../agents/discover-agents.ts to its call site.
 				const { discoverAgents, allAgents } = await import("../agents/discover-agents.ts");
 				const store = new GoalStore(manifest.cwd);
 				const goalState = store.load(manifest.runId);
@@ -576,7 +596,9 @@ async function main(): Promise<void> {
 				saveRunManifest(finalGoalManifest);
 				earlyResult = { manifest: finalGoalManifest, tasks: goalResult.tasks };
 			} else {
+		// LAZY: defer dynamic import of ./dynamic-workflow-runner.ts to its call site.
 				const { runDynamicWorkflow } = await import("./dynamic-workflow-runner.ts");
+		// LAZY: defer dynamic import of ../workflows/discover-workflows.ts to its call site.
 				const { allWorkflows, discoverWorkflows } = await import("../workflows/discover-workflows.ts");
 				const wf = allWorkflows(discoverWorkflows(manifest.cwd)).find((w) => w.name === manifest.workflow);
 				if (!wf || wf.runtime !== "dynamic" || !wf.dynamicScript) throw new Error(`runKind="dynamic-workflow" but workflow '${manifest.workflow}' is not dynamic (runId=${manifest.runId})`);

package/src/runtime/chain-runner.ts CHANGED Viewed

@@ -246,6 +246,7 @@ export class ChainRunner {
 				// Emit progress event if eventsPath provided
 				if (eventsPath) {
+		// LAZY: defer dynamic import of ../state/event-log.ts to its call site.
 					const { appendEventAsync } = await import("../state/event-log.ts");
 					await appendEventAsync(eventsPath, {
 						type: "chain.step_completed",

package/src/runtime/crash-recovery.ts CHANGED Viewed

@@ -1,10 +1,11 @@
 import type { ExtensionContext } from "@earendil-works/pi-coding-agent";
 import * as fs from "node:fs";
+import * as path from "node:path";
 import type { MetricRegistry } from "../observability/metric-registry.ts";
 import { appendEvent, scanSequence } from "../state/event-log.ts";
 import { recordFromTask, upsertCrewAgent } from "./crew-agent-records.ts";
 import { withRunLockSync } from "../state/locks.ts";
-import { loadRunManifestById, saveRunManifest, saveRunTasks, updateRunStatus } from "../state/state-store.ts";
+import { loadRunManifestById, saveRunTasks, updateRunStatus } from "../state/state-store.ts";
 import type { TeamTaskState } from "../state/types.ts";
 import { isWorkerHeartbeatStale } from "./worker-heartbeat.ts";
 import type { ManifestCache } from "./manifest-cache.ts";
@@ -215,6 +216,43 @@ function tryRemoveRunDirectories(entry: { stateRoot: string; cwd: string }): voi
 	// NOTE: artifactsRoot is shared across runs and cleaned up by pruneFinishedRuns/pruneUserLevelRuns — not deleted here.
 }
+/**
+ * Age (ms) of the team-level heartbeat file for a run. The team-runner writes
+ * `<stateRoot>/heartbeat.json` periodically while a workflow is executing
+ * (startTeamHeartbeat), so a fresh heartbeat is strong evidence the run is alive
+ * even when its recorded PID check is inconclusive or its active-run-index
+ * entry's `updatedAt` was frozen at registration. Returns Infinity when absent.
+ */
+function heartbeatAgeMs(entry: { stateRoot: string }, now: number): number {
+	try {
+		const mtime = fs.statSync(path.join(entry.stateRoot, "heartbeat.json")).mtimeMs;
+		return Number.isFinite(mtime) ? now - mtime : Infinity;
+	} catch {
+		return Infinity;
+	}
+}
+/**
+ * True if there is recent evidence the run is (or was very recently) alive, so
+ * it must NOT be purged. Any one of these signals is sufficient:
+ *   - on-disk `manifest.updatedAt` fresher than `staleThresholdMs` (rewritten on
+ *     every task transition / status change), and/or
+ *   - team-level `heartbeat.json` fresher than `staleThresholdMs`.
+ * `entry.updatedAt` is intentionally NOT consulted: it is frozen at
+ * registration and never refreshed during execution, which previously caused
+ * long-running legitimate runs to be falsely purged — destroying their
+ * stateRoot, and because saveRunTasks() silently no-ops once the state dir is
+ * gone, hanging the workflow permanently at the current task with no
+ * recoverable state ("Run not found").
+ */
+function hasRecentLifeEvidence(entry: { stateRoot: string }, manifestUpdatedAt: string | undefined, now: number, staleThresholdMs: number): boolean {
+	const manifestMs = manifestUpdatedAt ? new Date(manifestUpdatedAt).getTime() : NaN;
+	if (Number.isFinite(manifestMs) && now - manifestMs <= staleThresholdMs) return true;
+	const hbAge = heartbeatAgeMs(entry, now);
+	if (Number.isFinite(hbAge) && hbAge <= staleThresholdMs) return true;
+	return false;
+}
 /**
  * Purge the global active-run-index of entries whose manifest is no longer active.
  *
@@ -244,7 +282,7 @@ export function purgeStaleActiveRunIndex(staleThresholdMs = 300_000, now = Date.
 		}
 		// 3. Read manifest status
-		let manifest: { status?: string; async?: { pid?: number }; ownerSessionId?: string } | undefined;
+		let manifest: { status?: string; updatedAt?: string; async?: { pid?: number }; ownerSessionId?: string } | undefined;
 		try {
 			manifest = JSON.parse(fs.readFileSync(entry.manifestPath, "utf-8"));
 		} catch {
@@ -262,46 +300,52 @@ export function purgeStaleActiveRunIndex(staleThresholdMs = 300_000, now = Date.
 			continue;
 		}
-		// 5. Still "running" — check if worker PID is dead and no heartbeat
+		// 5. Still "running" with an async worker PID — only purge when the worker
+		// is actually dead AND there is no recent evidence of life. We must NOT
+		// rely solely on `entry.updatedAt` (frozen at registration) nor on a single
+		// dead-PID reading: a long-running worker (e.g. a 15-minute explorer)
+		// legitimately keeps the run "running" while periodically rewriting the
+		// on-disk manifest.updatedAt and heartbeat.json. Falsely purging such a run
+		// destroys its stateRoot, and because saveRunTasks() silently no-ops once
+		// the state dir is gone, the workflow then hangs permanently at the
+		// current task with no recoverable state ("Run not found"). When we do mark
+		// a run cancelled here, we KEEP its stateRoot so the run stays queryable/
+		// resumable and its diagnostics survive; the finished-run pruner removes
+		// the directory later on its normal schedule.
 		if (manifest?.status === "running" && manifest.async?.pid !== undefined) {
 			const pidAlive = checkProcessLiveness(manifest.async.pid).alive;
-			if (!pidAlive) {
-				// Check age — if manifest hasn't been updated in > threshold, it's stale
-				const updatedAt = new Date(entry.updatedAt).getTime();
-				if (Number.isFinite(updatedAt) && now - updatedAt > staleThresholdMs) {
-					// Dead PID + stale update → cancel the manifest and unregister
-					try {
-						const fullLoaded = loadRunManifestById(entry.cwd, entry.runId); // NOTE: no withRunLock - best-effort only; concurrent writes may cause inconsistency
-						if (fullLoaded) {
-							const now_iso = new Date(now).toISOString();
-							const repairedTasks = fullLoaded.tasks.map((task) => {
-								if (task.status === "running" || task.status === "queued" || task.status === "waiting") {
-									return { ...task, status: "cancelled" as const, finishedAt: now_iso, error: "Orphaned run: worker process dead and no recent activity" };
-								}
-								return task;
-							});
-							saveRunTasks(fullLoaded.manifest, repairedTasks);
-							for (const task of repairedTasks) { try { upsertCrewAgent(fullLoaded.manifest, recordFromTask(fullLoaded.manifest, task, "scaffold")); } catch { /* non-critical */ } }
-							updateRunStatus(fullLoaded.manifest, "cancelled", "Orphaned run: worker process dead and no recent activity");
-							saveRunManifest(fullLoaded.manifest);
-							void terminateLiveAgentsForRun(fullLoaded.manifest.runId, "cancelled", appendEvent, fullLoaded.manifest.eventsPath).catch((error) => logInternalError("crash-recovery.pid-dead.terminate", error, `runId=${fullLoaded.manifest.runId}`));
-						}
-					} catch {
-						// Best-effort manifest cleanup
+			if (!pidAlive && !hasRecentLifeEvidence(entry, manifest.updatedAt, now, staleThresholdMs)) {
+				// Dead PID + no recent life evidence → cancel the manifest and unregister
+				try {
+					const fullLoaded = loadRunManifestById(entry.cwd, entry.runId); // NOTE: no withRunLock - best-effort only; concurrent writes may cause inconsistency
+					if (fullLoaded) {
+						const now_iso = new Date(now).toISOString();
+						const repairedTasks = fullLoaded.tasks.map((task) => {
+							if (task.status === "running" || task.status === "queued" || task.status === "waiting") {
+								return { ...task, status: "cancelled" as const, finishedAt: now_iso, error: "Orphaned run: worker process dead and no recent activity" };
+							}
+							return task;
+						});
+						saveRunTasks(fullLoaded.manifest, repairedTasks);
+						for (const task of repairedTasks) { try { upsertCrewAgent(fullLoaded.manifest, recordFromTask(fullLoaded.manifest, task, "scaffold")); } catch { /* non-critical */ } }
+						updateRunStatus(fullLoaded.manifest, "cancelled", "Orphaned run: worker process dead and no recent activity");
+						void terminateLiveAgentsForRun(fullLoaded.manifest.runId, "cancelled", appendEvent, fullLoaded.manifest.eventsPath).catch((error) => logInternalError("crash-recovery.pid-dead.terminate", error, `runId=${fullLoaded.manifest.runId}`));
 					}
-					unregisterActiveRun(entry.runId);
-					tryRemoveRunDirectories(entry);
-					purged.push(entry.runId);
-					continue;
+				} catch {
+					// Best-effort manifest cleanup
 				}
+				unregisterActiveRun(entry.runId);
+				purged.push(entry.runId);
+				continue;
 			}
 		}
-		// 6. "running" but no async worker PID — possible orphaned run where manifest
-		// was never updated after worker exit. Check updatedAt age.
+		// 6. "running" but no async worker PID — possible orphaned run where the
+		// manifest was never updated to a terminal status after the worker exited.
+		// Uses the same life-evidence corroboration as condition 5; the stateRoot is
+		// kept on cancel so the run stays queryable/resumable with diagnostics.
 		if (manifest?.status === "running" && manifest.async === undefined) {
-			const updatedAt = new Date(entry.updatedAt).getTime();
-			if (Number.isFinite(updatedAt) && now - updatedAt > staleThresholdMs) {
+			if (!hasRecentLifeEvidence(entry, manifest.updatedAt, now, staleThresholdMs)) {
 				try {
 					const fullLoaded = loadRunManifestById(entry.cwd, entry.runId); // NOTE: no withRunLock - best-effort only; concurrent writes may cause inconsistency
 					if (fullLoaded && fullLoaded.manifest.status === "running") {
@@ -315,14 +359,12 @@ export function purgeStaleActiveRunIndex(staleThresholdMs = 300_000, now = Date.
 						saveRunTasks(fullLoaded.manifest, repairedTasks);
 						for (const task of repairedTasks) { try { upsertCrewAgent(fullLoaded.manifest, recordFromTask(fullLoaded.manifest, task, "scaffold")); } catch { /* non-critical */ } }
 						updateRunStatus(fullLoaded.manifest, "cancelled", "Orphaned run: no async worker and no manifest update in over " + Math.round(staleThresholdMs / 60000) + " minutes");
-						saveRunManifest(fullLoaded.manifest);
 						void terminateLiveAgentsForRun(fullLoaded.manifest.runId, "cancelled", appendEvent, fullLoaded.manifest.eventsPath).catch((error) => logInternalError("crash-recovery.pid-dead.terminate", error, `runId=${fullLoaded.manifest.runId}`));
 					}
 				} catch {
 					// Best-effort
 				}
 				unregisterActiveRun(entry.runId);
-				tryRemoveRunDirectories(entry);
 				purged.push(entry.runId);
 				continue;
 			}

package/src/runtime/dynamic-workflow-runner.ts CHANGED Viewed

@@ -85,6 +85,7 @@ async function loadWorkflowModule(scriptPath: string): Promise<DynamicWorkflowSc
 	// lazily so this module stays importable in environments without jiti (type-only consumers).
 	// Fix round-4: use createRequire(import.meta.url) so `require` works under the strip-types
 	// loader fallback (Node ≥ 22.6) where bare `require` is not defined in ESM scope.
+		// LAZY: defer dynamic import of node:module to its call site.
 	const { createRequire } = await import("node:module");
 	const require = createRequire(import.meta.url);
 	// eslint-disable-next-line @typescript-eslint/no-require-imports

package/src/runtime/goal-loop-runner.ts CHANGED Viewed

@@ -121,6 +121,7 @@ export const realGoalEvaluator = async (
 		}
 		if (!verificationCompromised) {
 			try {
+		// LAZY: defer dynamic import of ./verification-gates.ts to its call site.
 				const { executeVerificationCommands } = await import("./verification-gates.ts");
 				const contract = { requiredGreenLevel: "none" as const, commands: goal.verification.commands, allowManualEvidence: goal.verification.allowManualEvidence ?? false };
 				// Phase 1.5 #2 (RFC 16): run verification in a pristine git worktree at
@@ -131,6 +132,7 @@ export const realGoalEvaluator = async (
 				let worktreeCwd: string | undefined;
 				let worktreeCleanup: (() => void) | undefined;
 				try {
+		// LAZY: defer dynamic import of ./verification-worktree.ts to its call site.
 					const { checkWorktreeSandboxAvailable, prepareVerificationWorktree } = await import("./verification-worktree.ts");
 					const availability = checkWorktreeSandboxAvailable(goal.cwd);
 					if (availability.available) {

package/src/runtime/live-session-runtime.ts CHANGED Viewed

@@ -36,6 +36,7 @@ import { listLiveAgents } from "./live-agent-manager.ts";
  * Module-scoped latch for the optional peer dependency import. When N
  * in-process live-session subagents spawn CONCURRENTLY (e.g. several
  * `Agent({run_in_background:true})` started at once), each used to call
+		// LAZY: defer dynamic import of @earendil-works/pi-coding-agent to its call site.
  * `await import("@earendil-works/pi-coding-agent")` independently. Under the
  * tsx loader (registering load/resolve hooks), concurrent first-imports can
  * each enter the loader and race module-record instantiation, yielding

package/src/runtime/model-scope.ts CHANGED Viewed

@@ -128,6 +128,7 @@ export async function readEnabledModelsPatterns(cwd: string, agentDir?: string):
 		// SDK. SettingsManager is dynamically imported because the module
 		// shape differs across pi versions; the create() factory is the
 		// canonical, version-stable entry point.
+		// LAZY: defer dynamic import of @earendil-works/pi-coding-agent to its call site.
 		const mod = await import("@earendil-works/pi-coding-agent" as string).catch(() => null);
 		if (!mod) return [];
 		const SettingsManagerCtor = (mod as { SettingsManager?: { create?: (cwd: string, agentDir?: string) => { getEnabledModels?: () => string[] | undefined } } }).SettingsManager;

package/src/runtime/peer-dep.ts CHANGED Viewed

@@ -239,6 +239,7 @@ export function primePeerDep(): Promise<PeerDepModule> {
 		if (!resolved) {
 			throw new Error(buildMissingMessage());
 		}
+		// LAZY: defer dynamic import of module to its call site.
 		cachedModule = (await import(resolved.mainUrl)) as PeerDepModule;
 		return cachedModule;
 	})();

package/src/runtime/resilient-edit.ts CHANGED Viewed

@@ -133,6 +133,7 @@ export function wrapEditWithResilientReplace(pi: ExtensionAPI, tools?: { edit: T
 			throw new Error("old_string not found (and resilient retry skipped: missing path/old/new)");
 		}
+		// LAZY: defer dynamic import of node:fs/promises to its call site.
 		const fs = await import("node:fs/promises");
 		let content: string;
 		try {

package/src/runtime/task-runner.ts CHANGED Viewed

@@ -289,6 +289,7 @@ export async function runTeamTask(
 			// follow it and execute a script outside cwd. Throws on escape.
 			resolveRealContainedPath(manifest.cwd, input.step.preStepScript);
 			try {
+		// LAZY: defer dynamic import of node:child_process to its call site.
 				const { execFileSync } = await import("node:child_process");
 				preStepOutput = execFileSync(input.step.preStepScript, scriptArgs, {
 					timeout: scriptTimeout,

package/src/state/hook-instinct-bridge.ts CHANGED Viewed

@@ -11,7 +11,9 @@ let pathsInstance: typeof import("../utils/paths.js") | null = null;
 async function getStore() {
 	if (!storeInstance) {
+		// LAZY: defer dynamic import of ./instinct-store.js to its call site.
 		const { InstinctStore } = await import("./instinct-store.js");
+		// LAZY: defer dynamic import of ../utils/paths.js to its call site.
 		const paths = await import("../utils/paths.js");
 		storeInstance = new InstinctStore(paths.projectCrewRoot(process.cwd()));
 	}
@@ -20,6 +22,7 @@ async function getStore() {
 async function getPaths() {
 	if (!pathsInstance) {
+		// LAZY: defer dynamic import of ../utils/paths.js to its call site.
 		pathsInstance = await import("../utils/paths.js");
 	}
 	return pathsInstance;

package/src/utils/bm25-search.ts CHANGED Viewed

@@ -156,6 +156,7 @@ interface AgentSearchResult {
  * Uses dynamic import to avoid ESM/CJS issues at module load time.
  */
 export async function searchAgents(query: string, options?: { limit?: number }): Promise<AgentSearchResult[]> {
+		// LAZY: defer dynamic import of ../agents/discover-agents.ts to its call site.
   const { discoverAgents, allAgents } = await import("../agents/discover-agents.ts");
   const discovery = discoverAgents(process.cwd());
   const all = allAgents(discovery);
@@ -200,6 +201,7 @@ interface TeamSearchResult {
  * Uses dynamic import to avoid ESM/CJS issues at module load time.
  */
 export async function searchTeams(query: string, options?: { limit?: number }): Promise<TeamSearchResult[]> {
+		// LAZY: defer dynamic import of ../teams/discover-teams.ts to its call site.
   const { discoverTeams, allTeams } = await import("../teams/discover-teams.ts");
   const discovery = discoverTeams(process.cwd());
   const all = allTeams(discovery);