pi-crew 0.8.9 → 0.8.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,144 @@
1
1
  # Changelog
2
2
 
3
+ ## [0.8.11] — Split-scope install fix + transient-provider fallback (2026-06-17)
4
+
5
+ Bundle of two independent fixes that were triaged from real user reports on
6
+ 2026-06-17. Both are robustness fixes for failure modes that previously
7
+ killed team runs silently.
8
+
9
+ ### 1. `Cannot find module '@earendil-works/pi-coding-agent'` on Windows / global installs
10
+
11
+ **Symptom:** every `team` action (run / parallel / plan) crashed ~1 minute
12
+ after spawn, leaving all tasks permanently `queued`. The detached
13
+ background team-runner child threw:
14
+ ```
15
+ Error: Cannot find module '@earendil-works/pi-coding-agent'
16
+ Require stack:
17
+ - .../.pi/agent/npm/node_modules/pi-crew/src/runtime/skill-instructions.ts
18
+ ```
19
+
20
+ **Root cause:** pi-crew (an extension) is installed under
21
+ `~/.pi/agent/npm/node_modules/<ext>/`, but pi itself (the
22
+ `@earendil-works/pi-coding-agent` package extensions import from) lives in a
23
+ **separate** node_modules tree (nvm / `%APPDATA%\npm` / Volta / fnm /
24
+ pnpm-global). Node's resolver only walks UP ancestor `node_modules`, so a
25
+ static `import { getAgentDir } from "@earendil-works/pi-coding-agent"` in a
26
+ file loaded by the spawned child crashes. This is the **default** layout for
27
+ anyone who installs pi-crew via `pi install` — not a user misconfiguration.
28
+
29
+ **Additional constraint:** pi-coding-agent ships as **ESM-only**
30
+ (`type:module`, exports map with only an `import` condition). CJS
31
+ `createRequire(dir)(name)` / `require.resolve("<pkg>/package.json")` both
32
+ fail with `ERR_PACKAGE_PATH_NOT_EXPORTED` under node AND jiti/tsx (verified).
33
+ The ONLY working load mechanism is a dynamic `import()` of the resolved ESM
34
+ entry file URL.
35
+
36
+ **Fix — NEW `src/runtime/peer-dep.ts`:**
37
+ - `resolvePeerDep()` (sync): walks `node_modules` **manually** (bypasses the
38
+ restrictive exports map) across 6 strategies — env hint
39
+ (`PI_CREW_PEER_DEP_DIR`), this file, `process.argv[1]`, the node binary's
40
+ global node_modules (covers nvm/Volta/fnm), `npm root -g`, and
41
+ `%APPDATA%\npm`. Memoized.
42
+ - `primePeerDep()` (async): dynamic `import(fileURL)` the resolved ESM entry,
43
+ cache the module namespace. Memoized + retryable on failure.
44
+ - `getAgentDir()` (sync): reads the REAL fork-aware `getAgentDir` from the
45
+ primed cache; falls back to a computed default (`~/.pi/agent`, respecting
46
+ `PI_CODING_AGENT_DIR`) if not primed — **NEVER throws**.
47
+
48
+ **Rewired:**
49
+ - `skill-instructions.ts`, `discover-skills.ts` — static peer-dep import →
50
+ lazy `getAgentDir()` from `peer-dep.ts` (this is the crash site).
51
+ - `background-runner.ts` — `primePeerDep()` before importing `team-runner`
52
+ (child process).
53
+ - `register.ts` — `primePeerDep()` at extension entry (main process).
54
+ - `async-runner.ts` — propagate `PI_CREW_PEER_DEP_DIR` to children so they
55
+ skip the ~200ms `npm root -g` probe.
56
+
57
+ **Tests:** NEW `test/unit/peer-dep-resolver.test.ts` (9 cases) — env-hint
58
+ resolution, manual node_modules walk past exports map, ESM dynamic-import
59
+ loading, memoization, graceful fallback, `PI_CODING_AGENT_DIR` override,
60
+ loadable fileURL under the child's loader.
61
+
62
+ ### 2. `500 api_error "unknown error, 999 (1000)"` aborted the run instead of falling back
63
+
64
+ **Symptom:** when the model provider went hard-down with
65
+ `500 {"type":"error","error":{"type":"api_error","message":"unknown
66
+ error, 999 (1000)"}}`, the run died even when the user had configured a
67
+ fallback model that would have worked.
68
+
69
+ **Root cause:** pi has two safety layers. (1) pi-core provider-retry retries
70
+ 3× with exponential backoff — its regex already matches `500`. (2) pi-crew's
71
+ `model-fallback` layer is the last safety net: when all 3 retries fail, it
72
+ tries the next configured model. But `isRetryableModelFailure`'s pattern
73
+ list covered 429 / rate-limit / 502-504 / overloaded / timeout and **MISSED**
74
+ generic `500`, `api_error`, `unknown error`, and internal/server-error
75
+ phrasings. So a transient provider outage was retried 3× then **aborted**
76
+ instead of failing over.
77
+
78
+ **Fix:** added to `RETRYABLE_MODEL_FAILURE_PATTERNS` —
79
+ `\b500\b`, `\b501\b`, `api_error`, `unknown error`,
80
+ `internal(?:_server)?[ _]error`, `server error`, `bad gateway`.
81
+
82
+ `NON_RETRYABLE` (auth/billing/key) still wins — checked first in
83
+ `isRetryableModelFailure` — so a transient-looking 500 wrapping an auth
84
+ failure won't loop the chain.
85
+
86
+ **Tests:** 4 regression tests in `test/unit/model-fallback.test.ts` covering
87
+ the exact reported error, generic 5xx, auth-still-blocked, and undefined/empty.
88
+
89
+ ### Verification
90
+
91
+ typecheck clean; peer-dep suite 9/9; model-* suite 57/57; full suite 0 real
92
+ failures (1 known `result-watcher` fs.watch 10s timeout flake passes 7/7 in
93
+ isolation — unrelated).
94
+
95
+ ## [0.8.10] — Pre-warm 3 repro-observed cold-start crash-variant modules (2026-06-17)
96
+
97
+ The post-v0.8.9-restart 6-subagent repro surfaced 3 cold-start crash variants
98
+ in one batch: `existsSync` (peer-dep, latched v0.8.1 + warmup v0.8.6),
99
+ `effectiveRunConfig` (`team-tool/config-patch.ts`), `CREW_README`
100
+ (`state/crew-init.ts`, latched v0.8.9). v0.8.6's warmup covered `team-tool.ts`
101
+ transitively but not these specific modules explicitly — static-graph
102
+ reachability isn't reliable under tsx/jiti interop + concurrent fanout (the
103
+ `handleRun` latch serializes the CALL but not module-body instantiation of
104
+ `run.ts`'s static deps).
105
+
106
+ **Fix:** add the 3 repro-observed modules to `HOT_MODULE_SPECIFIERS` so their
107
+ module bodies instantiate at single-threaded registration:
108
+ `team-tool/run.ts`, `team-tool/config-patch.ts`, `workflows/validate-workflow.ts`.
109
+
110
+ Repro verification: 6/6 subagents clean (was 1/6) under loaded code.
111
+
112
+ ## [0.8.9] — crew-init dynamic-import latch (kills CREW_README TDZ race) (2026-06-17)
113
+
114
+ Module-scoped `loadCrewInit()` latch in `team-tool/run.ts` — concurrent `team`
115
+ tool calls share ONE in-flight import promise. Added `crew-init.ts` to
116
+ `HOT_MODULE_SPECIFIERS`. Targets the `CREW_README` TDZ variant observed in the
117
+ post-v0.8.8 repro.
118
+
119
+ ## [0.8.8] — Cross-project leak cwd-scope barrier (2026-06-17)
120
+
121
+ `collectInFlightRuns` filtered by STATUS only (queued/planning/running), not
122
+ by project scope. Multiple Pi sessions in the same project shared
123
+ `.crew/state/runs/`, so Session B's compaction picked up Session A's runs in
124
+ OTHER projects and injected them into Session B's continuation prompt.
125
+
126
+ The v0.8.8 (4bd6f5b) `ownerSessionId` filter was **unreliable** —
127
+ `ctx.sessionId` is absent on pi 0.79.6 `ExtensionContext`.
128
+
129
+ **Fix:** `isInProjectScope(run, queryCwd)` in `collectInFlightRuns` — keeps a
130
+ run only if `findRepoRoot(run.cwd) === findRepoRoot(queryCwd)`. Reliable,
131
+ version-independent. Filter at the consumption site, NOT in
132
+ `listRecentRuns`/`collectActiveRuns` (the cross-project dashboard view stays
133
+ unfiltered — 2 run-index tests pin that). Empirically verified: ambient
134
+ status shows only current-project runs, zero foreign-project bleed.
135
+
136
+ ## [0.8.7] — Doctor runtime-warmup status (2026-06-17)
137
+
138
+ `getRuntimeWarmupStatus()` diagnostic + a "Runtime warmup" section in
139
+ `team doctor` showing started/completed/duration/error. "Not started" is NOT
140
+ a doctor error (normal for direct unit-test calls).
141
+
3
142
  ## [0.8.6] — General cold-start race fix (runtime module-graph warmup) (2026-06-17)
4
143
 
5
144
  Fixes the `validateWorkflowForTeam` cold-start crash that v0.8.1 did NOT
package/README.md CHANGED
@@ -9,50 +9,48 @@ npm: pi-crew
9
9
  repo: https://github.com/baphuongna/pi-crew
10
10
  ```
11
11
 
12
- **v0.6.4**: See [CHANGELOG.md](CHANGELOG.md).
13
-
14
- ### Highlights (v0.6.4 → v0.7.0)
15
-
16
- This release implements **Phase 0 + Phase 1** of the long-term roadmap (synthesized from a 10-round research process), plus the **single-agent cliff hedge**. Principle: *build trust and cliff-resilience, stay lean, delete before adding.*
17
-
18
- - **🛡️ Compaction resilience (O10)** — the #1 user pain ("after auto-compact, the task stops midway") is fixed. In-flight crew runs are detected, a resume directive is injected into the compaction summary, and tasks re-attach after compaction.
19
- - **💰 Cost visibility (O1)** `team summary <runId>` now shows a full cost report with per-role attribution and token breakdown (`$0.77 — executor 79%, reviewer 14%...`).
20
- - **✋ Plan-level HITL for any workflow (O5)** — set `runtime.requirePlanApproval = true` to gate any workflow at the plan→execute boundary; approve via `team api op=approve-plan`.
21
- - **🧠 Cross-run memory (O4)** — `.crew/knowledge.md` is auto-injected into every run's system prompt. pi-crew remembers project context across runs.
22
- - **🎯 Single-agent cliff hedge** `team plan singleAgent=true` composes any workflow into one sequential prompt, so pi-crew's mission survives even if multi-agent is obsoleted by large-context models.
23
- - **🧹 2,335 LOC of dead code removed** + **Pi-api seam** centralizing the coupling surface.
24
-
25
- ### Highlights (v0.6.3 v0.6.4)
26
-
27
- - **Visually rich tool rendering** `team` and `Agent` tool calls now render as framed cards in the Pi TUI with box-drawing borders, colored status badges, and structured layouts
28
- - **Merged call+result into ONE connected frame** the call header and result body now form a single seamless frame instead of two disconnected boxes
29
- - **Animated live progress bar during runs** — real-time `████░░░░ N/M` task progress with elapsed time, rendered DURING the run; indeterminate "starting" phase uses an animated scanning bar
30
- - **Compact completion summary** collapsed cards show `✓ crew run 3/3 done · 1m2s · 26k tok · $0.068` with expand hint and per-agent briefs
31
- - **Critical crash fix on session resume** — `renderCall` was returning a `string` instead of a `Text` component, causing `TypeError: child.render is not a function` when Pi re-rendered stored tool calls
32
- - **Disabled brief tool overrides** — reverted the experimental brief mode that replaced Pi's superior native renderers (syntax highlighting, diff views, full content)
33
- - **Flaky test fix** — `AnimatedMascot` timing tests made CI-load-robust via polling loops
34
- - **CI green** 0 failures on Ubuntu, macOS, and Windows
35
-
36
- ### Highlights (v0.6.2 v0.6.3)
37
-
38
- - **137 commits** since v0.6.1200 files changed (+16,955 / −2,057 lines)
39
- - **4,792 tests**, 506 test files **0 failures** across the entire suite
40
- - **Cross-platform CI green** — 0 failures on Ubuntu, macOS, and Windows
41
- - **366 source files**, ~70K lines of TypeScript
42
- - **Worktree precondition validation** friendly errors instead of crashes when cwd is not a git repo or repo is dirty
43
- - **Cross-platform path handling** — `canonicalizePath` with `realpathSync.native` for Windows short-name/long-name aliasing; macOS symlink resolution
44
- - **Scheduled job lifecycle** — spawned runs are tracked, cancelling a job kills its runs
45
- - **Heartbeat false-positive fix** — PID liveness gate prevents dead detection during long LLM responses
46
- - **ENOENT crash fix** — prune/forget race no longer crashes pi when persisting to deleted runs
47
- - **Pipe buffer deadlock fix** — test runner no longer deadlocks when OS pipe buffer fills
48
- - **Plugin registry** extensible framework context injection for Next.js, Vite, Vitest
49
- - **Health score system** — penalty-based scoring with time-series snapshots
50
- - **CrewError taxonomy** — E001–E006 structured error codes replacing raw throws
51
- - **Atomic write v2** — fsync + rename pattern for crash-safe state persistence
52
- - **Pre-push review**: 56 unpushed commits reviewed, 1 release blocker found and fixed
53
- - **Security**: sandbox constructor escape strengthened; env-filter provider key handling fixed
54
- - **State-store race fix** — manifest/tasks mtime false positive eliminated
55
- - **Orphan worker/temp cleanup** — 4-layer defense with session-scoped tracking
12
+ **v0.8.11**: See [CHANGELOG.md](CHANGELOG.md).
13
+
14
+ ### Highlights (v0.6.4 → v0.8.11)
15
+
16
+ A long arc of **trust, cliff-resilience, and robustness** work. Principle: *build
17
+ trust and cliff-resilience, stay lean, delete before adding.*
18
+
19
+ #### v0.8.xhardening & reliability (2026-06-17)
20
+ - **🛠️ Split-scope install fix (v0.8.11)** — `team` runs no longer crash with
21
+ `Cannot find module '@earendil-works/pi-coding-agent'` when pi-crew and pi
22
+ live in separate node_modules trees (the default for `pi install`). New
23
+ `src/runtime/peer-dep.ts` resolves the ESM-only peer dep across 6 strategies.
24
+ - **🔄 Model fallback on transient 5xx (v0.8.11)** — a hard-down provider
25
+ (`500 api_error "unknown error"`) now triggers the configured fallback
26
+ model instead of aborting the run. `isRetryableModelFailure` extended.
27
+ - **🧊 Cold-start race eliminated (v0.8.6 v0.8.10)** under tsx, concurrent
28
+ subagent spawns raced module instantiation (`existsSync` / `CREW_README` /
29
+ `effectiveRunConfig` / `validateWorkflowForTeam`). Fixed graph-wide: warm at
30
+ registration + gate at spawn boundaries + per-site latches. 6/6 repro clean.
31
+ - **🔒 Cross-project leak fixed (v0.8.8)** — ambient status / compaction no
32
+ longer bleed foreign-project runs into the current session. Cwd-scope
33
+ barrier (`isInProjectScope`), version-independent.
34
+ - **🩺 Doctor runtime-warmup status (v0.8.7)** `team doctor` shows whether
35
+ the module-graph warmup fired.
36
+ - **🔍 Cold-verifier agent (v0.8.4)** adversarial cross-check that re-derives
37
+ claims WITHOUT trusting prior analysis, catching confirmation bias.
38
+ - **⚡ Per-write validator (v0.8.5)**zero-cost `JSON.parse` on every
39
+ `write`/`edit`, appends a `🔴` blocker on malformed files.
40
+ - **🎨 Terminal status (v0.8.3)** — tab title + Ghostty native progress bar.
41
+ - **🧠 Skill confidence revived (v0.8.2)** `adjustConfidence()` was dead
42
+ code; the effectiveness system now actually learns.
43
+ - **🔧 Tool-restriction unification (v0.8.0)** — single `resolveToolPolicy`
44
+ across both spawn paths.
45
+ - **🎯 F6/F1 interop granularity (v0.7.9)** — 7 skill roots, `.pi/agents/`
46
+ tier, tool wildcards, `excludeExtensions` denylist.
47
+
48
+ #### v0.7.0Phase 0 + Phase 1 roadmap
49
+ - **🛡️ Compaction resilience (O10)** — in-flight runs survive auto-compact.
50
+ - **💰 Cost visibility (O1)** — per-role token + cost attribution.
51
+ - **✋ Plan-level HITL (O5)** — `requirePlanApproval` gates any workflow.
52
+ - **🧠 Cross-run memory (O4)** `.crew/knowledge.md` injected every run.
53
+ - **🎯 Single-agent cliff hedge** `team plan singleAgent=true`.
56
54
 
57
55
  ---
58
56
 
@@ -99,6 +97,17 @@ pi-crew # after npm install
99
97
  node ./pi-crew/install.mjs # from local clone
100
98
  ```
101
99
 
100
+ > **Split-scope install note (v0.8.11+):** pi installs extensions under
101
+ > `~/.pi/agent/npm/node_modules/<ext>/`, separate from pi's own
102
+ > node_modules tree (nvm / `%APPDATA%\npm` / Volta / fnm). Since v0.8.11
103
+ > pi-crew resolves the `@earendil-works/pi-coding-agent` peer dep robustly
104
+ > across these layouts — no symlink/NODE_PATH workaround needed. If you ever
105
+ > do hit `Cannot find module '@earendil-works/pi-coding-agent'`, set
106
+ > `PI_CREW_PEER_DEP_DIR=<path to the pi-coding-agent package dir>` as a
107
+ > one-line workaround (or install pi-crew in pi's own scope:
108
+ > `npm install -g @earendil-works/pi-crew`).
109
+
110
+
102
111
  ---
103
112
 
104
113
  ## Quick Start
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-crew",
3
- "version": "0.8.9",
3
+ "version": "0.8.11",
4
4
  "description": "Pi extension for coordinated AI teams, workflows, worktrees, and async task orchestration",
5
5
  "author": "baphuongna",
6
6
  "license": "MIT",
@@ -84,6 +84,7 @@ import { runEventBus } from "../ui/run-event-bus.ts";
84
84
  import { createTerminalStatusController, type TerminalStatusController } from "../ui/terminal-status.ts";
85
85
  import { extractPathFromInput, validateWrittenFile, buildValidationBlocker } from "../runtime/per-write-validator.ts";
86
86
  import { startRuntimeWarmup } from "../runtime/runtime-warmup.ts";
87
+ import { primePeerDep } from "../runtime/peer-dep.ts";
87
88
  import { createRunSnapshotCache } from "../ui/run-snapshot-cache.ts";
88
89
  import { closeWatcher } from "../utils/fs-watch.ts";
89
90
  import { RunWatcherRegistry } from "../utils/run-watcher-registry.ts";
@@ -207,6 +208,11 @@ export function registerPiTeams(pi: ExtensionAPI): void {
207
208
  // Warming the graph here + awaiting it at spawn boundaries eliminates the
208
209
  // race window. See src/runtime/runtime-warmup.ts.
209
210
  startRuntimeWarmup();
211
+ // FIX (split-scope install): preload the ESM peer dep so discover-skills /
212
+ // skill-instructions can read the REAL getAgentDir (fork-aware) from cache.
213
+ // Fire-and-forget: getAgentDir() falls back to a safe computed default until
214
+ // this resolves. See src/runtime/peer-dep.ts.
215
+ primePeerDep().catch(() => {});
210
216
  // Deploy bundled themes (crew-dark, crew-dracula, etc.) to ~/.pi/agent/themes/
211
217
  // so Pi's theme loader discovers them. Best-effort, idempotent.
212
218
  deployBundledThemes();
@@ -6,6 +6,7 @@ import { fileURLToPath, pathToFileURL } from "node:url";
6
6
  import { logInternalError } from "../utils/internal-error.ts";
7
7
  import { appendEvent } from "../state/event-log.ts";
8
8
  import { sanitizeEnvSecrets } from "../utils/env-filter.ts";
9
+ import { resolvePeerDepDir, PEER_DEP_DIR_ENV } from "./peer-dep.ts";
9
10
  import {
10
11
  registerWorker,
11
12
  unregisterWorker,
@@ -202,6 +203,15 @@ export async function spawnBackgroundTeamRun(manifest: TeamRunManifest): Promise
202
203
  // FIX: removed delete workarounds — with explicit allowlist, these vars
203
204
  // are no longer auto-leaked. Matches child-pi.ts.
204
205
 
206
+ // FIX (split-scope install): pass the resolved peer-dep dir to the child so
207
+ // it can resolve @earendil-works/pi-coding-agent WITHOUT the ~200ms
208
+ // `npm root -g` probe. No-op when pi-crew and pi are co-located. See
209
+ // src/runtime/peer-dep.ts.
210
+ const peerDepDir = resolvePeerDepDir();
211
+ const childEnv = peerDepDir
212
+ ? { ...filteredEnv, [PEER_DEP_DIR_ENV]: peerDepDir }
213
+ : filteredEnv;
214
+
205
215
  const loader = resolveTypeScriptLoader();
206
216
  if (!loader) {
207
217
  const message = buildLoaderUnavailableMessage(packageRootFromRuntime());
@@ -227,7 +237,7 @@ export async function spawnBackgroundTeamRun(manifest: TeamRunManifest): Promise
227
237
  detached: true,
228
238
  setsid: true,
229
239
  stdio: ["ignore", "pipe", "pipe"],
230
- env: filteredEnv,
240
+ env: childEnv,
231
241
  windowsHide: true,
232
242
  } as unknown as Parameters<typeof spawn>[2];
233
243
  const child = spawn(process.execPath, command.args, spawnOpts);
@@ -19,6 +19,7 @@ import {
19
19
  } from "../workflows/discover-workflows.ts";
20
20
  // Heavy runtime — lazy-loaded to avoid pulling team-runner into background-runner
21
21
  // at module load time. Only needed when a background run actually starts.
22
+ import { primePeerDep } from "./peer-dep.ts";
22
23
  import type { executeTeamRun as ExecuteTeamRunFn } from "./team-runner.ts";
23
24
  import type { TeamRunManifest, TeamTaskState } from "../state/types.ts";
24
25
 
@@ -27,6 +28,10 @@ async function executeTeamRun(
27
28
  ...args: Parameters<typeof ExecuteTeamRunFn>
28
29
  ): Promise<Awaited<ReturnType<typeof ExecuteTeamRunFn>>> {
29
30
  if (!_cachedExecuteTeamRun) {
31
+ // FIX (split-scope install): prime the ESM peer dep BEFORE team-runner is
32
+ // imported, so its transitive skill-instructions.ts can read getAgentDir()
33
+ // from the primed cache instead of crashing on `Cannot find module`.
34
+ await primePeerDep().catch(() => {});
30
35
  // LAZY: avoid pulling team-runner into background-runner at module load time.
31
36
  const mod = await import("./team-runner.ts");
32
37
  _cachedExecuteTeamRun = mod.executeTeamRun;
@@ -186,6 +186,29 @@ const RETRYABLE_MODEL_FAILURE_PATTERNS = [
186
186
  /\b502\b/,
187
187
  /\b503\b/,
188
188
  /\b504\b/,
189
+ //
190
+ // Provider-side 5xx / generic api_error. The pi-core retry layer already
191
+ // retries these (agent-session.ts matches `500|server error|internal error`),
192
+ // but the pi-crew MODEL FALLBACK layer must ALSO treat them as retryable so
193
+ // that when the provider is hard-down across all 3 provider retries, we fail
194
+ // over to the next configured model instead of giving up. Reported case
195
+ // (2026-06-17): `500 {"type":"error","error":{"type":"api_error",
196
+ // "message":"unknown error, 999 (1000)"}}` — a transient provider outage that
197
+ // should trigger the fallback chain, not abort.
198
+ //
199
+ // `api_error` is the OpenAI-compatible generic error type (vs rate_limit_error
200
+ // / overloaded_error / etc.) and almost always means a transient server fault.
201
+ //
202
+ // `unknown error` is the body of the generic message; `internal`/`server`
203
+ // catch the common phrasings. `\b500\b`/`\b501\b` catch the HTTP status in
204
+ // the rendered error string.
205
+ /\b500\b/,
206
+ /\b501\b/,
207
+ /api_error/i,
208
+ /unknown error/i,
209
+ /internal(?:_server)?[ _]error/i,
210
+ /server error/i,
211
+ /bad gateway/i,
189
212
  ];
190
213
 
191
214
  // These patterns indicate auth/key/billing issues that will never succeed on retry.
@@ -0,0 +1,296 @@
1
+ /**
2
+ * Robust resolution + async loading of the @earendil-works/pi-coding-agent
3
+ * peer dependency. Fixes the "Cannot find module '@earendil-works/pi-coding-agent'"
4
+ * crash that blocks ALL team runs when pi-crew and pi are installed in
5
+ * SEPARATE node_modules trees.
6
+ *
7
+ * PROBLEM (Windows / global installs — reported 2026-06-17)
8
+ * pi-crew is a pi EXTENSION. pi installs extensions under
9
+ * `~/.pi/agent/npm/node_modules/<ext>/`, but pi itself (the
10
+ * @earendil-works/pi-coding-agent package that extensions import from)
11
+ * usually lives in a DIFFERENT node_modules tree — a global one (nvm,
12
+ * %APPDATA%\npm, Volta, fnm, pnpm-global). Node's resolver only walks UP
13
+ * through ancestor `node_modules` of the importing file, so a file under
14
+ * `~/.pi/agent/npm/node_modules/pi-crew/...` CANNOT resolve a peer dep
15
+ * installed under `~/.nvm/.../lib/node_modules/`. Every static
16
+ * `import { X } from "@earendil-works/pi-coding-agent"` that executes inside
17
+ * a SPAWNED CHILD PROCESS (the detached background team runner started by
18
+ * async-runner.spawnBackgroundTeamRun) therefore crashes at module load,
19
+ * leaving all team runs permanently `queued`.
20
+ *
21
+ * ADDITIONAL CONSTRAINT (verified empirically 2026-06-17)
22
+ * pi-coding-agent ships as ESM-only (`"type":"module"`, exports map has only
23
+ * an `import` condition). CJS `require()` / `createRequire(dir)(name)` fails
24
+ * with ERR_PACKAGE_PATH_NOT_EXPORTED under plain node AND under jiti/tsx. The
25
+ * ONLY working load mechanism is a dynamic `import()` of the resolved ESM
26
+ * entry file URL. Hence: sync resolution of the DIR, async load of the MODULE.
27
+ *
28
+ * APPROACH
29
+ * - resolvePeerDep() (sync) — find the install dir across many layouts.
30
+ * - primePeerDep() (async) — dynamic-import the resolved entry, cache
31
+ * the module namespace. Memoized. Called
32
+ * once per process during bootstrap.
33
+ * - getAgentDir() (sync) — read the cached module's getAgentDir.
34
+ * Falls back to a computed default if the
35
+ * cache was never primed, so it NEVER throws.
36
+ */
37
+ import * as fs from "node:fs";
38
+ import * as os from "node:os";
39
+ import * as path from "node:path";
40
+ import { fileURLToPath, pathToFileURL } from "node:url";
41
+ import { resolveNpmGlobalRoot } from "./pi-spawn.ts";
42
+
43
+ /**
44
+ * The pi-coding-agent peer dependency package name(s) we can be loaded by.
45
+ * @earendil-works is the canonical scope; @mariozechner is the historical fork.
46
+ */
47
+ export const PEER_DEP_NAMES = [
48
+ "@earendil-works/pi-coding-agent",
49
+ "@mariozechner/pi-coding-agent",
50
+ ] as const;
51
+
52
+ /**
53
+ * Env var a parent pi-crew process sets on spawned children so they can resolve
54
+ * the peer dep WITHOUT running `npm root -g` (~200ms probe). The resolver
55
+ * checks this FIRST. Absent (older parent, direct invocation, tests) → falls
56
+ * through to the probing strategies. Also lets users override the resolution
57
+ * explicitly as a last-resort fix.
58
+ */
59
+ export const PEER_DEP_DIR_ENV = "PI_CREW_PEER_DEP_DIR";
60
+
61
+ type PeerDepModule = typeof import("@earendil-works/pi-coding-agent");
62
+
63
+ interface ResolvedPeerDep {
64
+ dir: string;
65
+ name: string;
66
+ /** file:// URL of the ESM entry (exports["."].import || main). */
67
+ mainUrl: string;
68
+ }
69
+
70
+ let cachedResolve: ResolvedPeerDep | undefined | null = null;
71
+ let cachedModule: PeerDepModule | undefined;
72
+ let primingPromise: Promise<PeerDepModule> | undefined;
73
+
74
+ /**
75
+ * Build the ordered list of "resolution bases" — paths to seed
76
+ * `createRequire(...).resolve()` from. Node walks UP `node_modules` from each
77
+ * base's directory, so any base inside (or beside) the peer dep's package
78
+ * tree will find it. Pure given env/process inputs; exported for unit tests.
79
+ */
80
+ export function peerDepResolutionBases(): string[] {
81
+ const bases: string[] = [];
82
+
83
+ // 0. Parent-provided hint (fastest — no probe). Set by async-runner.
84
+ const envHint = process.env[PEER_DEP_DIR_ENV]?.trim();
85
+ if (envHint) bases.push(path.resolve(envHint));
86
+
87
+ // 1. This file's location — works when pi-crew and pi-coding-agent share a
88
+ // node_modules ancestor (the common co-located install).
89
+ bases.push(fileURLToPath(import.meta.url));
90
+
91
+ // 2. The entry script. In the PARENT (main pi process) argv[1] is pi's CLI
92
+ // script, which lives INSIDE pi-coding-agent's package → resolves. In a
93
+ // SPAWNED CHILD argv[1] is a pi-crew script → cheap miss, falls through.
94
+ const argv1 = process.argv[1];
95
+ if (argv1) bases.push(path.resolve(argv1));
96
+
97
+ // 3. The Node binary's global node_modules. Covers nvm / nvm-windows /
98
+ // Volta / fnm where pi-coding-agent is `npm i -g`'d: node is at
99
+ // <prefix>/bin/node and globals live at <prefix>/lib/node_modules.
100
+ try {
101
+ const execDir = path.dirname(fs.realpathSync.native(process.execPath));
102
+ bases.push(path.join(path.dirname(execDir), "lib", "node_modules"));
103
+ // Some layouts (Windows global, or a bare node_modules sibling of bin).
104
+ bases.push(path.join(execDir, "node_modules"));
105
+ } catch {
106
+ /* realpath best-effort */
107
+ }
108
+
109
+ // 4. `npm root -g` — the canonical cross-layout global root (memoized in
110
+ // pi-spawn.ts, ~200ms once). Derive the scoped package dirs from it.
111
+ const npmRoot = resolveNpmGlobalRoot();
112
+ if (npmRoot) {
113
+ for (const pkgName of PEER_DEP_NAMES) {
114
+ bases.push(path.join(npmRoot, ...pkgName.split("/")));
115
+ }
116
+ }
117
+
118
+ // 5. Windows %APPDATA%\npm static layout (legacy npm-global, pre-npm-root-g).
119
+ if (process.env.APPDATA) {
120
+ bases.push(path.join(process.env.APPDATA, "npm", "node_modules"));
121
+ }
122
+
123
+ return bases;
124
+ }
125
+
126
+ /** Pull the ESM entry path out of package.json (exports import || main). */
127
+ function extractEsmMain(pkg: unknown): string | undefined {
128
+ if (!pkg || typeof pkg !== "object") return undefined;
129
+ const p = pkg as Record<string, unknown>;
130
+ const exp = p.exports;
131
+ if (exp && typeof exp === "object") {
132
+ const dot = (exp as Record<string, unknown>)["."];
133
+ if (dot && typeof dot === "object") {
134
+ const d = dot as Record<string, unknown>;
135
+ const rel = d.import ?? d.default ?? d.module;
136
+ if (typeof rel === "string") return rel;
137
+ } else if (typeof dot === "string") {
138
+ return dot;
139
+ }
140
+ }
141
+ const main = p.main;
142
+ return typeof main === "string" ? main : undefined;
143
+ }
144
+
145
+ /**
146
+ * Walk the node_modules resolution algorithm MANUALLY from `start` looking for
147
+ * any of `names`. We do NOT use createRequire/require.resolve here because
148
+ * pi-coding-agent ships an ESM-only package with a restrictive exports map
149
+ * (only the `.` import condition) — `require.resolve("<pkg>/package.json")`
150
+ * and `require.resolve("<pkg>")` both throw ERR_PACKAGE_PATH_NOT_EXPORTED.
151
+ * Reading package.json directly from the walked dir sidesteps the exports map
152
+ * entirely (exports only governs subpath IMPORTS, not raw file reads).
153
+ *
154
+ * At each directory we check BOTH `<dir>/node_modules/<pkg>` (the standard
155
+ * container case) AND `<dir>/<pkg>` (handles a base that IS a node_modules
156
+ * dir, e.g. the output of `npm root -g`), then walk up to root.
157
+ */
158
+ function findPackageDir(
159
+ start: string,
160
+ names: readonly string[],
161
+ ): { dir: string; name: string } | undefined {
162
+ let dir = path.resolve(start);
163
+ try {
164
+ if (fs.statSync(dir).isFile()) dir = path.dirname(dir);
165
+ } catch {
166
+ /* treat as directory */
167
+ }
168
+ while (true) {
169
+ for (const name of names) {
170
+ const segs = name.split("/");
171
+ const candidates = [
172
+ path.join(dir, "node_modules", ...segs, "package.json"),
173
+ path.join(dir, ...segs, "package.json"),
174
+ ];
175
+ for (const pkgJson of candidates) {
176
+ try {
177
+ const pkg = JSON.parse(fs.readFileSync(pkgJson, "utf-8"));
178
+ if (pkg?.name === name) {
179
+ return { dir: path.dirname(pkgJson), name };
180
+ }
181
+ } catch {
182
+ /* not present at this candidate */
183
+ }
184
+ }
185
+ }
186
+ const parent = path.dirname(dir);
187
+ if (parent === dir) break; // reached filesystem root
188
+ dir = parent;
189
+ }
190
+ return undefined;
191
+ }
192
+
193
+ function tryResolveFrom(base: string): ResolvedPeerDep | undefined {
194
+ const found = findPackageDir(base, PEER_DEP_NAMES);
195
+ if (!found) return undefined;
196
+ try {
197
+ const pkg = JSON.parse(
198
+ fs.readFileSync(path.join(found.dir, "package.json"), "utf-8"),
199
+ );
200
+ const mainRel = extractEsmMain(pkg);
201
+ if (!mainRel) return undefined;
202
+ const mainAbs = path.resolve(found.dir, mainRel);
203
+ if (!fs.existsSync(mainAbs)) return undefined;
204
+ return { dir: found.dir, name: found.name, mainUrl: pathToFileURL(mainAbs).href };
205
+ } catch {
206
+ return undefined;
207
+ }
208
+ }
209
+
210
+ /** Resolve the peer dep install dir + ESM entry URL. Memoized (sync). */
211
+ export function resolvePeerDep(): ResolvedPeerDep | undefined {
212
+ if (cachedResolve !== null) return cachedResolve ?? undefined;
213
+ for (const base of peerDepResolutionBases()) {
214
+ const found = tryResolveFrom(base);
215
+ if (found) {
216
+ cachedResolve = found;
217
+ return found;
218
+ }
219
+ }
220
+ cachedResolve = null; // mark attempted-and-failed; don't re-probe per call
221
+ return undefined;
222
+ }
223
+
224
+ /** Just the install directory (for env-hint propagation to children). */
225
+ export function resolvePeerDepDir(): string | undefined {
226
+ return resolvePeerDep()?.dir;
227
+ }
228
+
229
+ /**
230
+ * Dynamic-import the peer dep module, caching the namespace. Memoized via a
231
+ * shared promise so concurrent callers share one load. On failure the promise
232
+ * is cleared so a later caller can retry. Safe to call repeatedly.
233
+ */
234
+ export function primePeerDep(): Promise<PeerDepModule> {
235
+ if (cachedModule) return Promise.resolve(cachedModule);
236
+ if (primingPromise) return primingPromise;
237
+ primingPromise = (async () => {
238
+ const resolved = resolvePeerDep();
239
+ if (!resolved) {
240
+ throw new Error(buildMissingMessage());
241
+ }
242
+ cachedModule = (await import(resolved.mainUrl)) as PeerDepModule;
243
+ return cachedModule;
244
+ })();
245
+ // Clear on failure so a later caller can retry (e.g. after env fix).
246
+ primingPromise.catch(() => {
247
+ primingPromise = undefined;
248
+ });
249
+ return primingPromise;
250
+ }
251
+
252
+ /** Async module accessor (primes if needed). */
253
+ export async function loadPeerDep(): Promise<PeerDepModule> {
254
+ return primePeerDep();
255
+ }
256
+
257
+ function buildMissingMessage(): string {
258
+ return (
259
+ `pi-crew could not resolve the @earendil-works/pi-coding-agent peer dependency.\n` +
260
+ `This usually means pi-crew and pi are installed in separate node_modules trees\n` +
261
+ `(e.g. pi-crew under ~/.pi/agent/npm/ but pi under an nvm/Volta/fnm global scope).\n` +
262
+ `Resolution bases tried:\n` +
263
+ peerDepResolutionBases().map((b) => ` - ${b}`).join("\n") +
264
+ `\nFix: install pi-crew in the SAME scope as pi, e.g.\n` +
265
+ ` npm install -g @earendil-works/pi-crew\n` +
266
+ `or set the env var ${PEER_DEP_DIR_ENV}=<path to the pi-coding-agent package dir>.`
267
+ );
268
+ }
269
+
270
+ /**
271
+ * Read the user agent dir via the REAL peer-dep getAgentDir (fork-aware:
272
+ * correct for pi, tau, and renamed forks). Sync; reads the primed cache.
273
+ *
274
+ * If the cache was never primed (e.g. called before bootstrap completes, or
275
+ * prime failed), falls back to a computed default so it NEVER throws. The
276
+ * default matches standard pi (`~/.pi/agent`) and respects the
277
+ * `PI_CODING_AGENT_DIR` override — correct for the overwhelmingly common
278
+ * case. Forks rely on the primed real function (register.ts primes at startup).
279
+ */
280
+ export function getAgentDir(): string {
281
+ if (cachedModule?.getAgentDir) {
282
+ try {
283
+ return cachedModule.getAgentDir();
284
+ } catch {
285
+ /* fall through to computed default */
286
+ }
287
+ }
288
+ return process.env.PI_CODING_AGENT_DIR || path.join(os.homedir(), ".pi", "agent");
289
+ }
290
+
291
+ /** @internal — reset all caches for unit tests. */
292
+ export function __resetPeerDepCacheForTest(): void {
293
+ cachedResolve = null;
294
+ cachedModule = undefined;
295
+ primingPromise = undefined;
296
+ }
@@ -53,7 +53,10 @@ const HOT_MODULE_SPECIFIERS = [
53
53
  "./live-session-runtime.ts",
54
54
  "./task-runner.ts",
55
55
  "../extension/team-tool.ts",
56
+ "../extension/team-tool/run.ts", // handleRun path — latched in team-tool.ts but its static graph (config-patch, validate-workflow) still cold-start-races under concurrent fanout
57
+ "../extension/team-tool/config-patch.ts", // effectiveRunConfig (crash variant observed in repro)
56
58
  "../extension/validate-resources.ts",
59
+ "../workflows/validate-workflow.ts", // validateWorkflowForTeam (crash variant observed across sessions)
57
60
  "../state/crew-init.ts", // TDZ-prone top-level consts (CREW_README); dynamically imported by team-tool/run.ts
58
61
  ] as const;
59
62
 
@@ -22,7 +22,11 @@ const PACKAGE_SKILLS_DIR = path.resolve(
22
22
  "skills",
23
23
  );
24
24
  import * as os from "node:os";
25
- import { getAgentDir } from "@earendil-works/pi-coding-agent";
25
+ // peer-dep.ts resolves @earendil-works/pi-coding-agent robustly across install
26
+ // layouts (extension-under-~/.pi + pi-under-global). A static `import { getAgentDir }`
27
+ // here crashes detached child processes when pi-crew and pi live in separate
28
+ // node_modules trees. See src/runtime/peer-dep.ts.
29
+ import { getAgentDir } from "../runtime/peer-dep.ts";
26
30
  const MAX_SKILL_CHARS = 1500;
27
31
  const MAX_TOTAL_CHARS = 6000;
28
32
  const MAX_SKILL_NAME_CHARS = 80;
@@ -2,7 +2,9 @@ import * as fs from "node:fs";
2
2
  import * as os from "node:os";
3
3
  import * as path from "node:path";
4
4
  import { fileURLToPath } from "node:url";
5
- import { getAgentDir } from "@earendil-works/pi-coding-agent";
5
+ // peer-dep.ts resolves @earendil-works/pi-coding-agent robustly across install
6
+ // layouts. See src/runtime/peer-dep.ts (split-scope install fix).
7
+ import { getAgentDir } from "../runtime/peer-dep.ts";
6
8
  import { logInternalError } from "../utils/internal-error.ts";
7
9
  import { isSafePathId, resolveContainedPath, resolveRealContainedPath } from "../utils/safe-paths.ts";
8
10