pi-crew 0.9.5 → 0.9.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -15,15 +15,22 @@ and intermediate data out of the main context window.
15
15
  export default async function (ctx) {
16
16
  const endpoints = [/* ... */];
17
17
  const shards = chunk(endpoints, 3);
18
+
19
+ ctx.phase("Scan"); // round-12: mark the start of a logical phase
18
20
  const reports = await ctx.fanOut(shards, 3, (s) =>
19
21
  ctx.agent({ role: "explorer", prompt: `Audit ${s.join(",")} for auth + input validation` })
20
22
  );
23
+
24
+ ctx.phase("Synthesize");
21
25
  const synth = await ctx.agent({ role: "analyst", prompt: "Merge + dedupe findings", inputs: reports.map(r => r.artifactPath) });
26
+
27
+ ctx.phase("Review");
22
28
  for (let i = 0; i < 3; i++) {
23
29
  const review = await ctx.review(synth.taskId, "reviewer");
24
30
  if (review.outcome === "accept") break;
25
31
  await ctx.retry(synth.taskId, { feedback: review.feedback });
26
32
  }
33
+
27
34
  ctx.setResult(synth.artifactPath, { summary: "security audit complete" });
28
35
  }
29
36
  ```
@@ -42,20 +49,158 @@ Slash command: `/workflows` lists all workflows (static + dynamic).
42
49
 
43
50
  | Method | Purpose |
44
51
  |---|---|
45
- | `ctx.agent({role, prompt, model?, skill?, maxTurns?, inputs?})` | Spawn one agent, await `{ok, text, structured, artifactPath, usage}`. Concurrency enforced by `ctx.semaphore`. |
52
+ | `ctx.agent({role, prompt, model?, skill?, maxTurns?, inputs?, schema?, worktree?})` | Spawn one agent, await `{ok, text, structured, artifactPath, usage}`. Concurrency enforced by `ctx.semaphore`. `schema?` (round-13) is a TypeBox schema — when set, output is validated and mismatch yields `ok:false`. `worktree?` (round-17) spawns the agent in an isolated git worktree (default false; falls back to normal cwd + warning in a non-git repo). |
46
53
  | `ctx.fanOut(items, limit, fn)` | Bounded parallel fan-out (wraps `mapConcurrent`). |
54
+ | `ctx.pipeline(items, ...stages)` | **round-16.** Multi-stage pipeline: each item passes through all stages sequentially; different items run concurrently (bounded by `ctx.semaphore`). A failed stage yields `null` for that item (logged via `ctx.log`) and other items continue. Aborts propagate. Returns `(TResult\|null)[]`. |
47
55
  | `ctx.review(taskId, reviewerRole?)` | Run a reviewer; parse `{outcome, feedback}`. |
48
56
  | `ctx.retry(taskId, {feedback?})` | Re-run with feedback (wraps `executeWithRetry`). |
49
57
  | `ctx.mail(to, body, opts?)` | Mailbox message to another agent/leader. |
50
58
  | `ctx.gatherReplies(ids, deadlineMs)` | Block until N replies arrive or deadline. |
51
59
  | `ctx.renderTemplate(name, vars)` | Render a built-in plan template. |
52
- | `ctx.vars` | Script-local variables. |
60
+ | `ctx.vars` | Script-local variables. (round-18: hydrated from the last checkpoint on resume — see [Resume & Checkpoint](#resume--checkpoint-round-18-p2-3).) |
61
+ | `ctx.phase(title)` | Mark the start of a named workflow phase. Emits `dwf.phase_started` (and `dwf.phase_completed` for the previous phase, if any) to the run's events.jsonl. Idempotent on the same title. Phase events let downstream consumers (UI, log readers) group agents by logical phase. |
62
+ | `ctx.log(message)` | **round-14.** Append a workflow-level log line. Stringifies non-strings, keeps a bounded in-memory copy (capped at 1000), and always emits a `dwf.log` event (`{message}`) to `events.jsonl`. |
63
+ | `ctx.budget` | **round-14.** Frozen `{total, spent(), remaining()}` token-budget surface. `total` is `null` when unbounded (default). `ctx.agent()` auto-rejects with `ok:false` (`"workflow token budget exhausted"`) once exhausted. `spent()` accumulates each agent run's reported usage. Set via `workflow.maxTokenBudget` or the run `tokenBudget` param. |
64
+ | `ctx.args<T>()` | **round-14.** Typed workflow arguments (sourced from `manifest.args`, passed via the run `args` param). Defaults to `{}`. Narrow with a generic: `ctx.args<{target:string}>()`. |
53
65
  | `ctx.setResult(artifactPath, meta?)` | Mark the final result. ONLY this reaches the main context. |
54
66
 
55
67
  `ctx.agent({role})` resolves the role to an `AgentConfig` via 4-tier precedence:
56
68
  explicit `agent` name → `team.roles[].agent` → `discoverAgents` by name → synthesize
57
69
  minimal (`source:'dynamic'`).
58
70
 
71
+ ### Pipeline (round-16 P2-1)
72
+
73
+ `ctx.pipeline(items, ...stages)` runs a **multi-stage transform** over a list of
74
+ items. Unlike `ctx.fanOut()` (a single parallel map), the pipeline chains stages:
75
+
76
+ each item flows through **all stages in sequence** (stage 1 → stage 2 → …), while
77
+ **different items run concurrently**, bounded by `ctx.semaphore` (the workflow
78
+ concurrency). Each stage receives `(previous, original, index)` — `previous` is the
79
+ prior stage's output (the raw item for the first stage), `original` is the unchanged
80
+ input item, and `index` is the item position. This mirrors `reduce`, but parallelized
81
+ across items.
82
+
83
+ - A stage that **throws** yields `null` for that item, logs `pipeline[i] failed: <msg>`
84
+ via `ctx.log()`, and the **other items continue**.
85
+ - On **abort**, the error propagates (it is not swallowed into `null`).
86
+ - Returns `(TResult | null)[]` (order-preserving).
87
+
88
+ ```ts
89
+ // scan → analyze → review each shard, up to `concurrency` shards at a time.
90
+ const verdicts = await ctx.pipeline(
91
+ shards,
92
+ (s) => ctx.agent({ role: "scanner", prompt: `scan ${s}` }),
93
+ (prev) => ctx.agent({ role: "analyst", prompt: `analyze ${prev.text}` }),
94
+ (prev) => ctx.review(prev.taskId ?? "", "reviewer"),
95
+ );
96
+ // verdicts[i] is null if shard i failed at any stage; others are unaffected.
97
+ ```
98
+
99
+ `pipeline` uses the same bounded-concurrency primitive as `fanOut` (`mapConcurrent`),
100
+ so item-level parallelism respects the workflow's configured concurrency. Stages that
101
+ spawn agents additionally acquire `ctx.semaphore` for agent-level throttling.
102
+
103
+ ### Phases (round-12)
104
+
105
+ `ctx.phase(title)` lets the script mark logical phases. Each call:
106
+
107
+ - Emits a `dwf.phase_started` event with `{phase: title}` to the run's `events.jsonl`.
108
+ - If a previous phase is still open, emits a `dwf.phase_completed` event for it
109
+ **before** opening the new one (so consumers never see two open phases at once).
110
+ - Is idempotent: calling `ctx.phase("Scan")` twice does not emit a duplicate event.
111
+ - Validates the title (non-empty string, otherwise `TypeError`).
112
+ - Caps the in-memory `phases[]` list at 100 distinct titles (events still flow past
113
+ the cap; the events log is the durable source of truth).
114
+ - The runner auto-closes the last open phase when the script returns, so
115
+ `dwf.completed` is always preceded by a matching `dwf.phase_completed`.
116
+
117
+ #### Phase UI display (round-15 P1-4)
118
+
119
+ The progress pane now **consumes** the `dwf.phase_started` / `dwf.phase_completed`
120
+ events and renders a phase overview with status markers:
121
+
122
+ ```
123
+ Progress pane: 2/4 completed · running=2 queued=0 failed=0
124
+ ── DWF Phases ──
125
+ ✓ Phase: Scan
126
+ ▶ Phase: Plan
127
+ ⏸ Phase: Review
128
+ ...
129
+ ```
130
+
131
+ - `▶ Phase: <name>` — the currently running phase.
132
+ - `✓ Phase: <name>` — a completed phase.
133
+ - `⏸ Phase: <name>` — a phase whose completion scrolled out of the recent-event
134
+ window and is not the current one (indeterminate).
135
+
136
+ Phase state is derived purely from the tailed `recentEvents` window (no extra
137
+ I/O), so this is **backward compatible**: non-DWF runs (static workflows,
138
+ goal-loops) produce no `dwf.phase_*` events and show no phase markers at all.
139
+ For terminals that mis-render the Unicode glyphs, ASCII fallbacks
140
+ (`[>]`/`[v]`/`[ ]`) are available via `renderDwfPhaseLines(state, { ascii: true })`.
141
+
142
+ ### Log API (round-14 P1-3)
143
+
144
+ `ctx.log(message)` appends a workflow-level log line. It stringifies non-string
145
+ values (`JSON.stringify`), keeps a bounded in-memory copy (capped at **1000**
146
+ entries), and always emits a durable `dwf.log` event (`{message}`) to the run's
147
+ `events.jsonl`. The events log is the source of truth; the in-memory buffer is
148
+ only for convenience/bounded telemetry.
149
+
150
+ ```ts
151
+ ctx.log("scan complete");
152
+ ctx.log({ findings: 3, warnings: [] }); // stringified to '{"findings":3,"warnings":[]}'
153
+ ```
154
+
155
+ ### Token budget (round-14 P1-2)
156
+
157
+ `ctx.budget` is a frozen `{total, spent(), remaining()}` surface. When a
158
+ per-workflow token budget is set, `ctx.agent()` auto-rejects with `ok:false`
159
+ (`"workflow token budget exhausted"`) once exhausted — **before** spawning a
160
+ child worker, so no tokens are wasted past the limit.
161
+
162
+ - `total` is `null` (unbounded) by default; `remaining()` is `Infinity` then.
163
+ - `spent()` accumulates each `ctx.agent()` run's reported `usage.input + usage.output`.
164
+ - Set it via the workflow's `maxTokenBudget` field, or the run `tokenBudget` param
165
+ (the param overrides the workflow value).
166
+
167
+ ```ts
168
+ if (ctx.budget.total !== null && ctx.budget.remaining() < 500) {
169
+ ctx.log("approaching budget limit");
170
+ }
171
+ ```
172
+
173
+ ### Typed args (round-14 P1-5)
174
+
175
+ `ctx.args<T>()` returns typed workflow arguments (sourced from `manifest.args`,
176
+ passed via the run `args` param). Defaults to `{}` when unset. Narrow with a
177
+ generic so the rest of your script is type-checked:
178
+
179
+ ```ts
180
+ const { target, retries } = ctx.args<{ target: string; retries: number }>();
181
+ ```
182
+
183
+ ### Authoring types / IDE IntelliSense (round-14 P1-1)
184
+
185
+ For TypeScript IntelliSense in `.dwf.ts` scripts, import the authoring types from
186
+ the package's `./workflow` export (`types/dwf.d.ts`):
187
+
188
+ ```ts
189
+ import type { WorkflowCtx } from "pi-crew/workflow";
190
+
191
+ export default async function run(ctx: WorkflowCtx): Promise<void> {
192
+ ctx.phase("scan");
193
+ ctx.log("starting");
194
+ const res = await ctx.agent({ role: "explorer", prompt: "survey" });
195
+ const { target } = ctx.args<{ target: string }>();
196
+ ctx.setResult(res.artifactPath ?? "", { target });
197
+ }
198
+ ```
199
+
200
+ The package self-references via its `exports` map, so this resolves from within
201
+ any project that depends on `pi-crew`. The interfaces mirror the runtime types in
202
+ `src/runtime/dynamic-workflow-context.ts` (authoring-only — no runtime values).
203
+
59
204
  ## Security model (IMPORTANT)
60
205
 
61
206
  `.dwf.ts` files are **postinstall-equivalent trust** — treat them as `node script.js`.
@@ -81,6 +226,114 @@ script can reach `process`/`require` directly or via constructor walking. The
81
226
  (e.g. `require('child'+'_process')`, `globalThis.process.mainModule.require`).
82
227
  The real boundary is commit-review + the path-allowlist, not the content check.
83
228
 
229
+ ## Determinism (round-13 P0-2)
230
+
231
+ Dynamic workflow scripts must be **deterministic** — the runner rejects
232
+ `Date.now()`, `Math.random()`, and `new Date()` at workflow-load time so that
233
+ two runs of the same script against the same inputs produce the same outputs.
234
+
235
+ The check uses an **AST walk** (not regex) so that:
236
+
237
+ - Prompts mentioning `Date.now()` as a string literal are accepted.
238
+ - Comments mentioning `Math.random()` are accepted.
239
+ - `Date.parse()`, `Date.UTC()`, `Math.floor()`, etc. are accepted (only `now`
240
+ and `random` are blocked).
241
+ - `Date["now"]()` is also blocked — the bracket-property is resolved to the
242
+ string `"now"` statically before the comparison.
243
+
244
+ **Escape hatch:** set `PI_CREW_DWF_SKIP_DETERMINISM_CHECK=1` to bypass the
245
+ check (intended for benchmark scripts that intentionally depend on time or
246
+ randomness). The check is **enabled by default**.
247
+
248
+ ```ts
249
+ // .crew/workflows/deterministic.dwf.ts
250
+ export default async function (ctx) {
251
+ // OK: Date.parse and Math.floor are permitted.
252
+ const ts = Date.parse("2024-01-01");
253
+ const rounded = Math.floor(3.14);
254
+
255
+ // OK: Date.now() in a string literal.
256
+ const label = "Date.now() is forbidden at runtime";
257
+
258
+ // REJECTED at load time:
259
+ // const t = Date.now();
260
+ // const r = Math.random();
261
+ // const d = new Date();
262
+ }
263
+ ```
264
+
265
+ When the check fails, the runner throws a clear error before `jiti` executes
266
+ the script:
267
+
268
+ ```
269
+ Workflow scripts must be deterministic: Date.now()/Math.random()/new Date() are
270
+ unavailable. These introduce non-reproducible behavior across runs. Use ctx.vars
271
+ for cached state, or pass a fixed seed via ctx.setArgs(). To bypass this check
272
+ (escape hatch), set PI_CREW_DWF_SKIP_DETERMINISM_CHECK=1.
273
+ ```
274
+
275
+ ## Structured output (round-13 P0-3)
276
+
277
+ Dynamic workflow scripts can request **typed JSON output** from `ctx.agent()` by
278
+ passing a TypeBox `schema` in the call opts. When set, the runner validates the
279
+ extracted JSON against the schema and returns `ok:false` with a clear error on
280
+ mismatch.
281
+
282
+ ```ts
283
+ // .crew/workflows/typed-agent.dwf.ts
284
+ import { Type, type Static } from "@sinclair/typebox";
285
+
286
+ const ReviewSchema = Type.Object({
287
+ outcome: Type.Union([
288
+ Type.Literal("accept"),
289
+ Type.Literal("reject"),
290
+ Type.Literal("changes_requested"),
291
+ ]),
292
+ feedback: Type.String(),
293
+ });
294
+ type Review = Static<typeof ReviewSchema>;
295
+
296
+ export default async function (ctx) {
297
+ const result = await ctx.agent({
298
+ role: "reviewer",
299
+ prompt: "Review the diff and judge.",
300
+ schema: ReviewSchema, // <-- new round-13 field
301
+ });
302
+ if (!result.ok) {
303
+ // result.error explains what didn't match.
304
+ ctx.setResult("/tmp/error.md", { error: result.error });
305
+ return;
306
+ }
307
+ const review = result.structured as Review;
308
+ // review is now type-checked as Review.
309
+ ctx.setResult("/tmp/review.md", { review });
310
+ }
311
+ ```
312
+
313
+ Backwards compatibility: when `schema` is **omitted**, behavior is identical to
314
+ the previous regex-based extractor. Existing scripts that don't pass a schema
315
+ continue to work unchanged.
316
+
317
+ **How it works:** the runner appends a JSON-output instruction to both the agent's
318
+ system prompt (so it knows the expected shape) and the user prompt (so the
319
+ output directive is the last thing the model reads). After the agent emits its
320
+ final text, the runner validates against the schema using `Value.Check`.
321
+ Validation failure surfaces as `ok:false, error: "structured output does not
322
+ match schema: ..."`.
323
+
324
+ ## Abort listener cleanup (round-13 P0-5)
325
+
326
+ `runChildPi` registers two abort listeners on the parent signal (the `abort`
327
+ handler that cancels the child process and the `onParentAbort` handler that
328
+ sets the internal `abortDueToParentSignal` flag). Both are removed in the
329
+ `settle()` function so they do not leak when many child-pi calls share one
330
+ AbortSignal (the common pattern under `background-runner`).
331
+
332
+ The fix was originally landed in round 27 (BUG 4). Round-13's audit confirmed
333
+ the cleanup is correct: both `input.signal?.removeEventListener("abort", ...)`
334
+ calls fire before `settle()` returns, regardless of whether the run completed
335
+ normally, hit a timeout, or was aborted. No code changes were needed.
336
+
84
337
  ## Isolation
85
338
 
86
339
  Worker output → artifact file (via `runChildPi` + `writeArtifact`). The dynamic runner
@@ -88,3 +341,63 @@ holds results only in JS variables + `ctx.vars`. Only `ctx.setResult(artifactPat
88
341
  read back into the tool result returned to the main context — mirroring the static
89
342
  workflow `summary.md` contract. The orchestrator's context never holds raw worker
90
343
  output.
344
+
345
+ ## Resume & Checkpoint (round-18 P2-3)
346
+
347
+ When a dynamic-workflow script crashes (timeout, OOM, agent error) between
348
+ `ctx.agent()` calls, all in-memory state (JS vars, phases, logs, budget) is lost and
349
+ the user previously had to re-run from scratch. Round-18 adds a durable checkpoint.
350
+
351
+ **How it works:**
352
+
353
+ 1. After **every** `ctx.agent()` call (success or failure), the runner persists a
354
+ checkpoint to `<stateRoot>/dwf-checkpoint.json` (atomic write via
355
+ `atomicWriteJson`). The checkpoint captures `ctx.vars`, the phase list + current
356
+ phase, the log buffer (capped at 1000), `ctx.budget.spent()`, and an `agentCount`.
357
+ 2. On a clean completion (`ctx.setResult()` + script returns normally), the checkpoint
358
+ is **deleted** so a re-run with the same `runId` starts fresh.
359
+ 3. `team action='resume' runId='X'` re-dispatches the run with `runKind='dynamic-workflow'`.
360
+ The runner detects the checkpoint, emits a `dwf.resumed` event, and **hydrates**
361
+ `ctx.vars`/phases/logs/spent/agentCount from it before re-executing the script.
362
+
363
+ A missing or corrupt checkpoint is treated as a fresh run — resuming is always safe.
364
+
365
+ ### Writing defensive (resumable) scripts
366
+
367
+ Because the script re-runs **from the top** on resume (not from the crash point), you
368
+ should write it defensively: record progress in `ctx.vars` after each agent call and
369
+ skip work that already completed.
370
+
371
+ ```ts
372
+ export default async function run(ctx) {
373
+ // Phase 1 — scan (skipped on resume if it already ran)
374
+ if (ctx.vars.lastPhase !== "scan") {
375
+ const res = await ctx.agent({ role: "explorer", prompt: "scan the repo" });
376
+ ctx.vars.scanResult = res.text; // checkpointed after this ctx.agent() call
377
+ ctx.vars.lastPhase = "scan"; // marker for the defensive guard
378
+ }
379
+
380
+ // Phase 2 — analyze (re-uses the resumed/hydrated scanResult)
381
+ const analysis = await ctx.agent({
382
+ role: "analyst",
383
+ prompt: `Analyze: ${ctx.vars.scanResult ?? ""}`,
384
+ });
385
+ ctx.vars.lastPhase = "analyze";
386
+
387
+ ctx.setResult(analysis.artifactPath ?? "");
388
+ }
389
+ ```
390
+
391
+ Key constraints:
392
+
393
+ - **No partial-resume of an agent**: if the crash happens *mid-agent*, that agent
394
+ re-runs from scratch on resume. Agent results should be idempotent-ish.
395
+ - **Checkpoint AFTER the agent completes** (never before), so a failed/incomplete
396
+ agent call is never persisted as “done.”
397
+ - **Capped state**: logs are capped at 1000, phases at 100 — the checkpoint does not
398
+ grow unbounded.
399
+ - **Backward compatible**: fresh runs (no checkpoint) behave exactly as before; the
400
+ checkpoint file only appears when an agent call has run and the run hasn't completed.
401
+
402
+ See `src/runtime/dwf-state-store.ts` (`DwfStore`) and the runner wiring in
403
+ `src/runtime/dynamic-workflow-runner.ts`.
@@ -0,0 +1,219 @@
1
+ # Fix Plan — HB-003a: `ctx.agent({disableTools:true})` returns `exit null`
2
+
3
+ > **Status:** PROPOSED (planning only — not yet implemented)
4
+ > **Discovered:** 2026-06-24 real-world smoke testing (see `CHANGELOG.md` "Known issues")
5
+ > **Severity:** medium (blocks the `disableTools:true` verdict-judge pattern; workaround exists)
6
+ > **Owner:** TBD
7
+ > **Related:** HB-004 (smoke-test harness), commits `c55d3e2` + `ab481e6` (sibling bugs already fixed)
8
+
9
+ ## 1. Problem statement (confirmed evidence)
10
+
11
+ `ctx.agent({disableTools: true})` (and the equivalent direct `runChildPi({agent:{disableTools:true}})`)
12
+ returns `exitCode: null` (process killed by signal) instead of `0`, **only when the
13
+ calling process exits promptly after the promise resolves.** If the caller stays alive
14
+ ~10s after `runChildPi` resolves, `exitCode` comes back `0` correctly.
15
+
16
+ ### Repro matrix (all verified 2026-06-24)
17
+
18
+ | Scenario | disableTools | Caller keep-alive | Result |
19
+ |---|---|---|---|
20
+ | `pi --no-tools ...` standalone | yes | n/a | ✅ exit 0, correct answer |
21
+ | `runChildPi` + keep-alive 10s | yes | yes | ✅ exitCode=0, finalDrain=true |
22
+ | `runChildPi` + exit immediately | yes | no | ❌ exitCode=null |
23
+ | `runChildPi` (has tools) | no | either | ✅ exitCode=0 |
24
+ | DWF `ctx.agent({disableTools:true})` | yes | (workflow) | ❌ exit null |
25
+
26
+ ### What the lifecycle events show (failing case)
27
+
28
+ ```
29
+ spawned → exit(code=null) → close(code=null)
30
+ ```
31
+
32
+ **Notably ABSENT:** `final_drain`, `hard_kill`, `response_timeout` lifecycle events.
33
+ So the signal did NOT come from `child-pi.ts`'s own timers. `stdout` *does* contain
34
+ the model's answer — pi produced output, then died via signal.
35
+
36
+ ## 2. Root-cause hypotheses (to confirm in Phase 0)
37
+
38
+ ### ✅ PHASE 0 COMPLETE — root cause confirmed (2026-06-24)
39
+
40
+ **Root cause: erroneous steer-backpressure kill at `child-pi.ts:716-726`** (NOT the
41
+ final-drain race hypothesised in H1).
42
+
43
+ When `maxTurns` is reached on a `turn_end` event, the code injects a "wrap up"
44
+ steer by writing to `child.stdin`. Node's `writable.write()` returns `false` when
45
+ the internal buffer is above the high-water mark (normal backpressure) OR when
46
+ the stream is draining. The current code treats **any** `false` return as a
47
+ fatal injection failure and calls `killProcessTree(child.pid, child)` → SIGTERM.
48
+
49
+ This fires deterministically for the `ctx.agent({maxTurns:1, disableTools:true})`
50
+ pattern (and the smoke-test repro): with `--no-tools`, pi finishes in exactly one
51
+ real turn, so `turn_end` arrives the instant the answer is ready; pi has nothing
52
+ more to read from stdin, the write returns `false`, and the worker is killed mid-
53
+ answer. The answer IS in stdout, but exit comes back `null` (SIGTERM).
54
+
55
+ **Repro confirmed via Phase-0 instrumentation** (`PI_TEAMS_DEBUG=1`):
56
+ ```
57
+ [pi-crew:child-pi.kill-process-tree-invoked] pid=783270 called from:
58
+ at killProcessTree (src/runtime/child-pi.ts:102:23)
59
+ at Object.onJsonEvent (src/runtime/child-pi.ts:731:11) ← steer-backpressure kill
60
+ ```
61
+ maxTurns=1 × 5 runs: 3/5 exit=null (flaky, depends on OS buffer state).
62
+ maxTurns=5 × 5 runs: 5/5 exit=0 (soft limit not hit on turn 1).
63
+
64
+ **The `disableTools` correlation was a red herring** — the real trigger is
65
+ `maxTurns:1` (the smoke workflow happened to combine both). Any single-turn
66
+ agent call hitting `maxTurns` on its first `turn_end` can reproduce this.
67
+
68
+ ### Original hypotheses (kept for the audit trail)
69
+
70
+ The killer was initially unidentified because the `signal` arg of the `exit`
71
+ event was discarded. The leading hypotheses, in priority order:
72
+
73
+ - **H1 (DISPROVEN): final-drain timer race.** Instrumentation showed
74
+ `forcedFinalDrain=false` on failing runs — the final-drain timer was armed but
75
+ never fired. The SIGTERM came from elsewhere.
76
+ - **H2 (DISPROVEN): pi self-terminates via signal.** stdout contains the answer;
77
+ the kill-process-tree caller stack points squarely at the steer-injection path.
78
+ - **H3 (already ruled out): external signal / parent-guard.** `startParentGuard`
79
+ is never invoked in `src/`; abort would set `cancelled:true` (it stayed false).
80
+
81
+ ## 3. Phased plan
82
+
83
+ ### Phase 0 — Diagnostic (READ-ONLY, no behavior change) [~0.5 day]
84
+
85
+ Goal: identify the exact signal and the code path that sent it. **No fix yet.**
86
+
87
+ 1. **Capture the signal.** In `child.on("exit", (code, signal) => ...)`, add `signal`
88
+ to the `exitStatus` record and to the `exit` lifecycle event payload. Also capture
89
+ `forcedFinalDrain`, `hardKilled`, `finalDrainTimer` truthiness, and a timestamp
90
+ relative to spawn. This is the single highest-value change — it turns the bug from
91
+ "exit null, cause unknown" into "exit null via SIGTERM at T+Xms while timer armed".
92
+ 2. **Add a focused repro workflow** `.crew/workflows/debug/dwf-disabletools.dwf.ts`
93
+ already exists — extend it to log `exitStatus` (signal, forcedFinalDrain, timing).
94
+ Re-run until the captured data distinguishes H1 vs H2.
95
+ 3. **Confirm H1 with a log-only check:** temporarily log the wall-clock time of
96
+ `forcedFinalDrain = true` vs the `exit` event. If `exit` precedes the assignment,
97
+ H1 is confirmed.
98
+ 4. **Exit Phase 0** with a written finding (append to this file's §2): which signal,
99
+ which path, deterministic repro steps.
100
+
101
+ **Deliverable:** amended §2 with the confirmed root cause; no commit to `main`
102
+ beyond the read-only instrumentation (kept behind a debug flag or reverted).
103
+
104
+ ### Phase 1 — Fix [~0.5 day]
105
+
106
+ **Confirmed fix: stop killing the worker on a normal backpressure `write() === false`.**
107
+
108
+ At `child-pi.ts:723-726`:
109
+ ```ts
110
+ const writeSucceeded = child.stdin.write(steerPayload);
111
+ if (!writeSucceeded) {
112
+ logInternalError("child-pi.steer-backpressure", ...);
113
+ steerInjectionFailed = true;
114
+ killProcessTree(child.pid, child); // ← BUG: backpressure is not fatal
115
+ }
116
+ ```
117
+
118
+ `Writable.write()` returning `false` is **normal backpressure** — Node buffers the
119
+ write and emits `'drain'` later. It does NOT mean the write failed. Killing the
120
+ worker on it destroys a perfectly good answer (stdout already has it). The
121
+ original intent was to handle a genuinely unwritable stdin (the `else` branch at
122
+ line 727 logs `steer-not-writable` and ALSO kills — that one is more defensible
123
+ but still too aggressive).
124
+
125
+ **Proposed change:** keep the steer-injection best-effort. On `write() === false`,
126
+ simply wait for `'drain'` (or do nothing — the soft-limit steer is advisory). If
127
+ the worker ignores it and runs past `maxTurns + graceTurns`, the existing hard-
128
+ abort at line 735 (`turnCount >= maxTurns + graceTurns`) already terminates it.
129
+
130
+ ```ts
131
+ const writeSucceeded = child.stdin.write(steerPayload);
132
+ if (!writeSucceeded) {
133
+ // Backpressure: Node buffered the write and will flush on 'drain'. This is
134
+ // NOT a failure — do NOT kill the worker. The steer is advisory; if the worker
135
+ // keeps running, the hard-abort at maxTurns + graceTurns (line ~735) handles it.
136
+ logInternalError("child-pi.steer-backpressure", new Error("stdin write returned false (normal backpressure); steer buffered, worker NOT killed"), `pid=${child.pid}`);
137
+ }
138
+ ```
139
+
140
+ Keep the `else` branch (stdin not writable at all) as-is for now, but downgrade
141
+ it too in a follow-up — a closed stdin after the worker is done is also not fatal.
142
+
143
+ **Verification gate:** the repro matrix in §1 must go all-green with `exitCode=0`
144
+ and the answer present in stdout, run **10× consecutively** (the bug is flaky at
145
+ ~60%, so a single pass is insufficient). No regression in the existing
146
+ `test/unit/child-pi-*.test.ts` suites (5 files, ~85 tests). Add a unit test that
147
+ fakes a `child.stdin` whose `write()` returns `false` and asserts the worker is
148
+ NOT killed and the buffered write eventually flushes.
149
+
150
+ ### Phase 2 — Regression prevention (HB-004) [~1 day]
151
+
152
+ Land the smoke-test harness proposed in `HB-004` so this class of bug is caught by
153
+ CI, not only by live runs. Gate behind `PI_CREW_SMOKE=1` (token cost). Minimum:
154
+ one workflow per feature family that actually shells out to real `pi`
155
+ (`agent` plain, `agent`+schema, `agent`+disableTools, `pipeline`, `phase`/`log`).
156
+ Add a CI job (manual-dispatch workflow) that runs the smoke suite on
157
+ ubuntu/windows/macos × Node 22.
158
+
159
+ ## 4. Files touched (estimate)
160
+
161
+ | Phase | File | Change |
162
+ |---|---|---|
163
+ | 0 | `src/runtime/child-pi.ts` | capture `signal` + timing in `exit`/`exitStatus` (log-only) |
164
+ | 0 | `.crew/workflows/debug/dwf-disabletools.dwf.ts` | extend logging |
165
+ | 1 | `src/runtime/child-pi.ts` | Fix A: `finalDrainArmed` + close-handler override |
166
+ | 1 | `test/unit/child-pi-*.test.ts` | add race-simulation unit test (fake child emitting exit before forcedFinalDrain) |
167
+ | 2 | `test/smoke/*.dwf.ts` (new) | HB-004 harness |
168
+ | 2 | `.github/workflows/smoke.yml` (new) | manual-dispatch smoke CI |
169
+ | 1 | `CHANGELOG.md`, `docs/troubleshooting.md` | move "Known issues" entry to "Fixed"; remove workaround note |
170
+
171
+ ## 5. Test plan
172
+
173
+ - **Phase 0:** before/after instrumentation output showing the captured signal.
174
+ - **Phase 1 unit:** a unit test that injects a fake child process whose `exit`
175
+ fires with `code=null` *before* the timer callback runs, asserting `finalExitCode === 0`
176
+ and that stdout content is preserved. This is the regression guard for the race.
177
+ - **Phase 1 real-binary (manual):** re-run the §1 repro matrix; all rows ✅.
178
+ - **Phase 1 regression:** `npm run test:unit` + `npm run test:integration` green;
179
+ typecheck + lazy-imports clean; TABS.
180
+ - **Phase 2 CI:** smoke workflow green on all 3 OSes.
181
+
182
+ ## 6. Risk analysis
183
+
184
+ | Risk | Likelihood | Mitigation |
185
+ |---|---|---|
186
+ | Fix A hides a *real* crash as a clean exit | medium | Telemetry log (`final-drain-zero-exit` style) on the override; only override when stdout is non-empty AND finalDrain timer was armed. Never override `responseTimeoutHit` or `abortRequested` paths. |
187
+ | Race fix changes behavior for the common (has-tools) path | low | The `finalDrainArmed` condition only adds to the existing `forcedFinalDrain` branch; has-tools path already sets `forcedFinalDrain=true` normally. Unit test covers both. |
188
+ | Cross-platform signal differences (Windows has no signals) | medium | Windows already uses `taskkill`/`undefined` signal semantics; Fix A keys off `exitCode === null` which is platform-consistent for signal/force-kill death. Verify on Windows CI. |
189
+ | Phase 0 instrumentation itself changes timing | low | Keep it log-only; use monotonic `performance.now()`; revert before merge if it perturbs the race. |
190
+
191
+ ## 7. Out of scope
192
+
193
+ - P2-2 VM sandbox / isolated-vm (separate, v1.5).
194
+ - Refactoring the 919-line `child-pi.ts` (tempting but out of scope; surgical fix only).
195
+ - Changing the final-drain timeout constants (`FINAL_DRAIN_MS=5s`, `HARD_KILL_MS=3s`).
196
+ - The `runKind:'goal-loop'` foreground-dispatch note from smoke testing (separate item).
197
+
198
+ ## 8. Open questions for Phase 0 — ANSWERED (2026-06-24)
199
+
200
+ All three resolved by the Phase-0 root-cause finding (steer-backpressure kill,
201
+ NOT the final-drain race). Kept for the audit trail.
202
+
203
+ 1. **ANSWERED: not the final-drain timer.** The keep-alive case returned `exitCode=0`
204
+ by coincidence — the SIGTERM came from `killProcessTree` on the steer-injection
205
+ path (`child-pi.ts:731`), not the final-drain timer (`forcedFinalDrain=false`
206
+ on failing runs). Keep-alive merely changed the OS-buffer state so the stdin
207
+ `write()` happened to return `true`. Red herring.
208
+ 2. **ANSWERED: yes — the steer-injection path (`child-pi.ts:716-726`).** Stack
209
+ capture under `PI_TEAMS_DEBUG=1` proved `killProcessTree` was invoked from
210
+ `onJsonEvent` on a `turn_end` where `maxTurns` was reached and `stdin.write()`
211
+ returned `false` (normal backpressure, mis-treated as fatal).
212
+ 3. **ANSWERED: no extension self-termination.** The kill stack is entirely
213
+ inside `child-pi.ts`; the prompt-runtime extension is not on the stack.
214
+
215
+ ---
216
+
217
+ **Recommendation:** execute Phase 0 first (cheap, read-only, removes all guesswork),
218
+ then pick Fix A or B based on the finding. Do NOT implement Phase 1 blind — the bug
219
+ is in core runtime and a wrong fix could mask real crashes across every agent call.
@@ -155,3 +155,79 @@ code + a help hint inline. Common ones:
155
155
  - `team action='summary' runId=…` — includes common failure-pattern detection
156
156
  ("4 of 5 failures share 2 root causes").
157
157
  - `team action='events' runId=…` — full event timeline for forensics.
158
+
159
+ ## Stuck / orphaned sub-agent processes ("zombies")
160
+
161
+ A pi-crew sub-agent whose parent crashed may linger as an orphaned process.
162
+ **Do NOT kill `pi` processes by eye** (uptime/RSS heuristics will match your
163
+ own interactive main session — that is unrecoverable). Use the safe scanner:
164
+
165
+ ```
166
+ team action='doctor' focus='zombies'
167
+ ```
168
+
169
+ This is **read-only**. It matches ONLY processes carrying the authoritative
170
+ `PI_CREW_KIND=subagent` env marker (set by every child-pi spawn) whose
171
+ `PI_CREW_PARENT_PID` is no longer alive. Your main session never carries the
172
+ marker, so it can never appear in the list. (The marker is an env var, not an
173
+ argv flag — pi's strict option parser rejects unknown flags, so we can't use
174
+ a `--crew-subagent` CLI flag.)
175
+
176
+ To kill a confirmed zombie: `kill <PID>` (the OS reaps it). The scanner never
177
+ kills on your behalf.
178
+
179
+ ### Why the marker exists
180
+
181
+ Before `PI_CREW_KIND`, a heuristic zombie "cleanup" killed a live main session
182
+ by accident. The marker makes sub-agent identity authoritative rather than
183
+ guessed. See `src/runtime/zombie-scanner.ts` and `.crew/knowledge.md`.
184
+
185
+ ## `ctx.agent({disableTools: true})` — historical `exit null` (FIXED)
186
+
187
+ Previously, `ctx.agent({disableTools: true, maxTurns: 1})` could return
188
+ `exit null` because the steer-injection code mis-treated normal Node stdin
189
+ backpressure (`write() === false`) as a fatal failure and `killProcessTree`'d
190
+ the worker mid-answer. **Fixed**: steer injection is now advisory — a
191
+ backpressure return or non-writable stdin is logged, not fatal; the
192
+ hard-abort at `maxTurns + graceTurns` remains the safety net for genuine
193
+ runaways. The `disableTools` correlation was a red herring — the real trigger
194
+ was `maxTurns:1` hitting on the first turn. See CHANGELOG "Real-world smoke
195
+ testing findings" and `test/unit/child-pi-steer-backpressure.test.ts`.
196
+
197
+ ## Running the real-binary smoke suite (HB-004)
198
+
199
+ The default `npm test` mocks child-pi (`PI_TEAMS_MOCK_CHILD_PI`), so it cannot
200
+ catch bugs that only manifest against the real `pi` binary. The smoke suite
201
+ shells out to real pi + makes real LLM calls, so it bills tokens and is gated
202
+ behind `PI_CREW_SMOKE=1`.
203
+
204
+ ### Run locally
205
+
206
+ ```bash
207
+ # All smoke tests (~5 tests, ~1 min, bills tokens):
208
+ PI_CREW_SMOKE=1 npm run test:smoke
209
+
210
+ # One smoke test in isolation:
211
+ PI_CREW_SMOKE=1 npx tsx --test test/smoke/agent-disabletools.smoke.ts
212
+ ```
213
+
214
+ Smoke tests live in `test/smoke/*.smoke.ts` and are NOT picked up by the default
215
+ `npm test` glob (`test/unit/*` + `test/integration/*`). Each test self-skips
216
+ unless `PI_CREW_SMOKE=1`.
217
+
218
+ ### What each covers
219
+
220
+ | File | Feature family | Catches |
221
+ |---|---|---|
222
+ | `argv-flags.smoke.ts` | buildPiWorkerArgs argv | unknown-flag rejection (e.g. `--crew-subagent`) |
223
+ | `agent-plain.smoke.ts` | ctx.agent() baseline | spawn-path breakage |
224
+ | `agent-schema.smoke.ts` | ctx.agent({schema, systemPrompt}) | persona-leak / schema-validation failures |
225
+ | `agent-disabletools.smoke.ts` | ctx.agent({disableTools, maxTurns:1}) ×5 | HB-003a steer-backpressure exit-null (flaky → 5×) |
226
+ | `dwf-workflow.smoke.ts` | full DWF end-to-end | phase/log/args/budget/pipeline/agent/setResult integration |
227
+
228
+ ### Run in CI (manual dispatch)
229
+
230
+ GitHub Actions → "Smoke (real-binary, manual)" → Run workflow → pick OS.
231
+ Requires the `PI_AUTH_JSON` repo secret (the contents of `~/.pi/agent/auth.json`)
232
+ so the spawned `pi` can authenticate with the model provider. If unset, the
233
+ LLM-calling smoke tests fail with a clear auth error.