npm - pi-crew - Versions diffs - 0.9.5 → 0.9.8 - Mend

pi-crew 0.9.5 → 0.9.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

package/CHANGELOG.md +556 -0
package/README.md +10 -3
package/docs/HARNESS_BACKLOG.md +51 -3
package/docs/dynamic-workflows.md +315 -2
package/docs/fix-plan-disabletools-exit-null.md +219 -0
package/docs/troubleshooting.md +76 -0
package/package.json +10 -3
package/src/config/defaults.ts +8 -4
package/src/extension/team-tool/doctor.ts +14 -0
package/src/extension/team-tool/run.ts +2 -0
package/src/runtime/background-runner.ts +1 -1
package/src/runtime/capability-inventory.ts +20 -1
package/src/runtime/child-pi.ts +109 -11
package/src/runtime/deterministic-ast.ts +161 -0
package/src/runtime/dwf-state-store.ts +97 -0
package/src/runtime/dynamic-workflow-context.ts +381 -7
package/src/runtime/dynamic-workflow-runner.ts +93 -2
package/src/runtime/pi-args.ts +11 -0
package/src/runtime/result-extractor.ts +72 -7
package/src/runtime/task-output-context.ts +25 -9
package/src/runtime/team-runner.ts +8 -3
package/src/runtime/zombie-scanner.ts +297 -0
package/src/schema/team-tool-schema.ts +28 -0
package/src/skills/discover-skills.ts +61 -8
package/src/skills/validate.ts +267 -0
package/src/state/contracts.ts +1 -0
package/src/state/state-store.ts +3 -0
package/src/state/types.ts +9 -0
package/src/ui/dashboard-panes/progress-pane.ts +5 -0
package/src/ui/dwf-phase-display.ts +151 -0
package/src/ui/keybinding-map.ts +128 -41
package/src/ui/run-event-bus.ts +83 -0
package/src/ui/run-snapshot-cache.ts +4 -0
package/src/ui/snapshot-types.ts +3 -0
package/src/workflows/workflow-config.ts +3 -0
package/src/worktree/worktree-manager.ts +94 -0
package/types/dwf.d.ts +187 -0

package/docs/HARNESS_BACKLOG.md CHANGED Viewed

@@ -14,7 +14,13 @@ Use when an agent discovers a missing harness capability but should not change t
 **Risk**: normal
-**Status**: proposed
+**Status**: ✅ PARTIALLY DONE (2026-06-24). The bulk of HB-001 was already
+covered by 21 existing `test/integration/` files (team-runner path via
+`mock-child-run`, `full-feature-smoke`, `phase3-6-*`). The genuine remaining
+gap — interleaved manifest+task+event writes reloaded consistently (the
+realistic run-load pattern) — is now covered by
+`test/integration/state-durability-hb001.test.ts`. Child-process exit →
+state-store reconcile is covered by `async-restart-recovery.test.ts`.
 ### HB-002: Windows-specific test coverage
@@ -26,7 +32,12 @@ Use when an agent discovers a missing harness capability but should not change t
 **Risk**: normal
-**Status**: proposed
+**Status**: ✅ DONE (2026-06-24). `test/platform/` ships with two files:
+`windows-rename.test.ts` (EBUSY/EPERM rename retry path via `renameWithRetry`,
+self-skips off win32) and `posix-tools.test.ts` (BSD-vs-GNU grep, /var →
+/private/var realpath, POSIX-shell resolution — self-skips on win32).
+Runbook in `test/platform/README.md`. The CI OS matrix (ubuntu/windows/macos)
+exercises each platform's tests.
 ### HB-003: Performance regression baseline
@@ -38,4 +49,41 @@ Use when an agent discovers a missing harness capability but should not change t
 **Risk**: tiny
-**Status**: proposed
+**Status**: ✅ DONE (2026-06-24). `test/bench/` now has 6 benchmarks:
+the pre-existing `register-startup`, `render-flush`, `snapshot-cache`, plus
+three new ones covering the gaps HB-003 flagged — `atomic-write.bench.ts`
+(`atomicWriteJson` cold/warm), `event-append.bench.ts` (serial lock
+contention vs batch), `task-graph-scheduler.bench.ts` (DAG build/refresh/
+full-run). All run via `npm run bench` → `test/bench/results.json`; baseline
+via `npm run bench:capture`. Each prints min/p50/p95/p99/max percentiles.
+### HB-004: Real-binary smoke tests for ctx.agent() paths
+**Discovered while**: Real-world `team action='run'` smoke testing on 2026-06-24
+caught three bugs that the unit suite (which mocks child-pi) missed entirely.
+**Current pain**: The unit tests for `dynamic-workflow-context.ts` and
+`child-pi.ts` use `PI_TEAMS_MOCK_CHILD_PI` and never shell out to the real `pi`
+binary. As a result they cannot catch:
+  - argv flags the real `pi` rejects (e.g. the `--crew-subagent` regression),
+  - env/persona interactions that change real model output (e.g. the
+    schema+systemPrompt drop),
+  - exit-code races in the real spawn lifecycle (e.g. the
+    `disableTools:true` → `exit null` race).
+**Suggested improvement**: Add `test/smoke/` (gated behind a `PI_CREW_SMOKE=1`
+env so CI doesn't bill tokens by default) that runs real `.dwf.ts` workflows
+end-to-end via `team action='run'` and asserts on the resulting
+`events.jsonl` + `summary.md`. One workflow per feature family
+(phase/log/pipeline/agent/schema/worktree). Document the runbook in
+`docs/troubleshooting.md`.
+**Risk**: normal (token cost when run; otherwise read-only)
+**Status**: ✅ DONE (2026-06-24). `test/smoke/` shipped with 5 smoke tests
+(argv-flags, agent-plain, agent-schema, agent-disabletools, dwf-workflow),
+all gated behind `PI_CREW_SMOKE=1`. `npm run test:smoke` runs them. CI
+manual-dispatch workflow at `.github/workflows/smoke.yml` (requires
+`PI_AUTH_JSON` secret). Runbook in `docs/troubleshooting.md`. Each smoke test
+maps to a real bug it would have caught (HB-003a, the schema+systemPrompt
+drop, the `--crew-subagent` argv regression).

package/docs/dynamic-workflows.md CHANGED Viewed

@@ -15,15 +15,22 @@ and intermediate data out of the main context window.
 export default async function (ctx) {
   const endpoints = [/* ... */];
   const shards = chunk(endpoints, 3);
+  ctx.phase("Scan");  // round-12: mark the start of a logical phase
   const reports = await ctx.fanOut(shards, 3, (s) =>
     ctx.agent({ role: "explorer", prompt: `Audit ${s.join(",")} for auth + input validation` })
   );
+  ctx.phase("Synthesize");
   const synth = await ctx.agent({ role: "analyst", prompt: "Merge + dedupe findings", inputs: reports.map(r => r.artifactPath) });
+  ctx.phase("Review");
   for (let i = 0; i < 3; i++) {
     const review = await ctx.review(synth.taskId, "reviewer");
     if (review.outcome === "accept") break;
     await ctx.retry(synth.taskId, { feedback: review.feedback });
   }
   ctx.setResult(synth.artifactPath, { summary: "security audit complete" });
 }
 ```
@@ -42,20 +49,158 @@ Slash command: `/workflows` lists all workflows (static + dynamic).
 | Method | Purpose |
 |---|---|
-| `ctx.agent({role, prompt, model?, skill?, maxTurns?, inputs?})` | Spawn one agent, await `{ok, text, structured, artifactPath, usage}`. Concurrency enforced by `ctx.semaphore`. |
+| `ctx.agent({role, prompt, model?, skill?, maxTurns?, inputs?, schema?, worktree?})` | Spawn one agent, await `{ok, text, structured, artifactPath, usage}`. Concurrency enforced by `ctx.semaphore`. `schema?` (round-13) is a TypeBox schema — when set, output is validated and mismatch yields `ok:false`. `worktree?` (round-17) spawns the agent in an isolated git worktree (default false; falls back to normal cwd + warning in a non-git repo). |
 | `ctx.fanOut(items, limit, fn)` | Bounded parallel fan-out (wraps `mapConcurrent`). |
+| `ctx.pipeline(items, ...stages)` | **round-16.** Multi-stage pipeline: each item passes through all stages sequentially; different items run concurrently (bounded by `ctx.semaphore`). A failed stage yields `null` for that item (logged via `ctx.log`) and other items continue. Aborts propagate. Returns `(TResult\|null)[]`. |
 | `ctx.review(taskId, reviewerRole?)` | Run a reviewer; parse `{outcome, feedback}`. |
 | `ctx.retry(taskId, {feedback?})` | Re-run with feedback (wraps `executeWithRetry`). |
 | `ctx.mail(to, body, opts?)` | Mailbox message to another agent/leader. |
 | `ctx.gatherReplies(ids, deadlineMs)` | Block until N replies arrive or deadline. |
 | `ctx.renderTemplate(name, vars)` | Render a built-in plan template. |
-| `ctx.vars` | Script-local variables. |
+| `ctx.vars` | Script-local variables. (round-18: hydrated from the last checkpoint on resume — see [Resume & Checkpoint](#resume--checkpoint-round-18-p2-3).) |
+| `ctx.phase(title)` | Mark the start of a named workflow phase. Emits `dwf.phase_started` (and `dwf.phase_completed` for the previous phase, if any) to the run's events.jsonl. Idempotent on the same title. Phase events let downstream consumers (UI, log readers) group agents by logical phase. |
+| `ctx.log(message)` | **round-14.** Append a workflow-level log line. Stringifies non-strings, keeps a bounded in-memory copy (capped at 1000), and always emits a `dwf.log` event (`{message}`) to `events.jsonl`. |
+| `ctx.budget` | **round-14.** Frozen `{total, spent(), remaining()}` token-budget surface. `total` is `null` when unbounded (default). `ctx.agent()` auto-rejects with `ok:false` (`"workflow token budget exhausted"`) once exhausted. `spent()` accumulates each agent run's reported usage. Set via `workflow.maxTokenBudget` or the run `tokenBudget` param. |
+| `ctx.args<T>()` | **round-14.** Typed workflow arguments (sourced from `manifest.args`, passed via the run `args` param). Defaults to `{}`. Narrow with a generic: `ctx.args<{target:string}>()`. |
 | `ctx.setResult(artifactPath, meta?)` | Mark the final result. ONLY this reaches the main context. |
 `ctx.agent({role})` resolves the role to an `AgentConfig` via 4-tier precedence:
 explicit `agent` name → `team.roles[].agent` → `discoverAgents` by name → synthesize
 minimal (`source:'dynamic'`).
+### Pipeline (round-16 P2-1)
+`ctx.pipeline(items, ...stages)` runs a **multi-stage transform** over a list of
+items. Unlike `ctx.fanOut()` (a single parallel map), the pipeline chains stages:
+each item flows through **all stages in sequence** (stage 1 → stage 2 → …), while
+**different items run concurrently**, bounded by `ctx.semaphore` (the workflow
+concurrency). Each stage receives `(previous, original, index)` — `previous` is the
+prior stage's output (the raw item for the first stage), `original` is the unchanged
+input item, and `index` is the item position. This mirrors `reduce`, but parallelized
+across items.
+- A stage that **throws** yields `null` for that item, logs `pipeline[i] failed: <msg>`
+  via `ctx.log()`, and the **other items continue**.
+- On **abort**, the error propagates (it is not swallowed into `null`).
+- Returns `(TResult | null)[]` (order-preserving).
+```ts
+// scan → analyze → review each shard, up to `concurrency` shards at a time.
+const verdicts = await ctx.pipeline(
+  shards,
+  (s) => ctx.agent({ role: "scanner", prompt: `scan ${s}` }),
+  (prev) => ctx.agent({ role: "analyst", prompt: `analyze ${prev.text}` }),
+  (prev) => ctx.review(prev.taskId ?? "", "reviewer"),
+);
+// verdicts[i] is null if shard i failed at any stage; others are unaffected.
+```
+`pipeline` uses the same bounded-concurrency primitive as `fanOut` (`mapConcurrent`),
+so item-level parallelism respects the workflow's configured concurrency. Stages that
+spawn agents additionally acquire `ctx.semaphore` for agent-level throttling.
+### Phases (round-12)
+`ctx.phase(title)` lets the script mark logical phases. Each call:
+- Emits a `dwf.phase_started` event with `{phase: title}` to the run's `events.jsonl`.
+- If a previous phase is still open, emits a `dwf.phase_completed` event for it
+  **before** opening the new one (so consumers never see two open phases at once).
+- Is idempotent: calling `ctx.phase("Scan")` twice does not emit a duplicate event.
+- Validates the title (non-empty string, otherwise `TypeError`).
+- Caps the in-memory `phases[]` list at 100 distinct titles (events still flow past
+  the cap; the events log is the durable source of truth).
+- The runner auto-closes the last open phase when the script returns, so
+  `dwf.completed` is always preceded by a matching `dwf.phase_completed`.
+#### Phase UI display (round-15 P1-4)
+The progress pane now **consumes** the `dwf.phase_started` / `dwf.phase_completed`
+events and renders a phase overview with status markers:
+```
+Progress pane: 2/4 completed · running=2 queued=0 failed=0
+  ── DWF Phases ──
+  ✓ Phase: Scan
+  ▶ Phase: Plan
+  ⏸ Phase: Review
+  ...
+```
+- `▶ Phase: <name>` — the currently running phase.
+- `✓ Phase: <name>` — a completed phase.
+- `⏸ Phase: <name>` — a phase whose completion scrolled out of the recent-event
+  window and is not the current one (indeterminate).
+Phase state is derived purely from the tailed `recentEvents` window (no extra
+I/O), so this is **backward compatible**: non-DWF runs (static workflows,
+goal-loops) produce no `dwf.phase_*` events and show no phase markers at all.
+For terminals that mis-render the Unicode glyphs, ASCII fallbacks
+(`[>]`/`[v]`/`[ ]`) are available via `renderDwfPhaseLines(state, { ascii: true })`.
+### Log API (round-14 P1-3)
+`ctx.log(message)` appends a workflow-level log line. It stringifies non-string
+values (`JSON.stringify`), keeps a bounded in-memory copy (capped at **1000**
+entries), and always emits a durable `dwf.log` event (`{message}`) to the run's
+`events.jsonl`. The events log is the source of truth; the in-memory buffer is
+only for convenience/bounded telemetry.
+```ts
+ctx.log("scan complete");
+ctx.log({ findings: 3, warnings: [] }); // stringified to '{"findings":3,"warnings":[]}'
+```
+### Token budget (round-14 P1-2)
+`ctx.budget` is a frozen `{total, spent(), remaining()}` surface. When a
+per-workflow token budget is set, `ctx.agent()` auto-rejects with `ok:false`
+(`"workflow token budget exhausted"`) once exhausted — **before** spawning a
+child worker, so no tokens are wasted past the limit.
+- `total` is `null` (unbounded) by default; `remaining()` is `Infinity` then.
+- `spent()` accumulates each `ctx.agent()` run's reported `usage.input + usage.output`.
+- Set it via the workflow's `maxTokenBudget` field, or the run `tokenBudget` param
+  (the param overrides the workflow value).
+```ts
+if (ctx.budget.total !== null && ctx.budget.remaining() < 500) {
+  ctx.log("approaching budget limit");
+}
+```
+### Typed args (round-14 P1-5)
+`ctx.args<T>()` returns typed workflow arguments (sourced from `manifest.args`,
+passed via the run `args` param). Defaults to `{}` when unset. Narrow with a
+generic so the rest of your script is type-checked:
+```ts
+const { target, retries } = ctx.args<{ target: string; retries: number }>();
+```
+### Authoring types / IDE IntelliSense (round-14 P1-1)
+For TypeScript IntelliSense in `.dwf.ts` scripts, import the authoring types from
+the package's `./workflow` export (`types/dwf.d.ts`):
+```ts
+import type { WorkflowCtx } from "pi-crew/workflow";
+export default async function run(ctx: WorkflowCtx): Promise<void> {
+  ctx.phase("scan");
+  ctx.log("starting");
+  const res = await ctx.agent({ role: "explorer", prompt: "survey" });
+  const { target } = ctx.args<{ target: string }>();
+  ctx.setResult(res.artifactPath ?? "", { target });
+}
+```
+The package self-references via its `exports` map, so this resolves from within
+any project that depends on `pi-crew`. The interfaces mirror the runtime types in
+`src/runtime/dynamic-workflow-context.ts` (authoring-only — no runtime values).
 ## Security model (IMPORTANT)
 `.dwf.ts` files are **postinstall-equivalent trust** — treat them as `node script.js`.
@@ -81,6 +226,114 @@ script can reach `process`/`require` directly or via constructor walking. The
   (e.g. `require('child'+'_process')`, `globalThis.process.mainModule.require`).
   The real boundary is commit-review + the path-allowlist, not the content check.
+## Determinism (round-13 P0-2)
+Dynamic workflow scripts must be **deterministic** — the runner rejects
+`Date.now()`, `Math.random()`, and `new Date()` at workflow-load time so that
+two runs of the same script against the same inputs produce the same outputs.
+The check uses an **AST walk** (not regex) so that:
+- Prompts mentioning `Date.now()` as a string literal are accepted.
+- Comments mentioning `Math.random()` are accepted.
+- `Date.parse()`, `Date.UTC()`, `Math.floor()`, etc. are accepted (only `now`
+  and `random` are blocked).
+- `Date["now"]()` is also blocked — the bracket-property is resolved to the
+  string `"now"` statically before the comparison.
+**Escape hatch:** set `PI_CREW_DWF_SKIP_DETERMINISM_CHECK=1` to bypass the
+check (intended for benchmark scripts that intentionally depend on time or
+randomness). The check is **enabled by default**.
+```ts
+// .crew/workflows/deterministic.dwf.ts
+export default async function (ctx) {
+  // OK: Date.parse and Math.floor are permitted.
+  const ts = Date.parse("2024-01-01");
+  const rounded = Math.floor(3.14);
+  // OK: Date.now() in a string literal.
+  const label = "Date.now() is forbidden at runtime";
+  // REJECTED at load time:
+  // const t = Date.now();
+  // const r = Math.random();
+  // const d = new Date();
+}
+```
+When the check fails, the runner throws a clear error before `jiti` executes
+the script:
+```
+Workflow scripts must be deterministic: Date.now()/Math.random()/new Date() are
+unavailable. These introduce non-reproducible behavior across runs. Use ctx.vars
+for cached state, or pass a fixed seed via ctx.setArgs(). To bypass this check
+(escape hatch), set PI_CREW_DWF_SKIP_DETERMINISM_CHECK=1.
+```
+## Structured output (round-13 P0-3)
+Dynamic workflow scripts can request **typed JSON output** from `ctx.agent()` by
+passing a TypeBox `schema` in the call opts. When set, the runner validates the
+extracted JSON against the schema and returns `ok:false` with a clear error on
+mismatch.
+```ts
+// .crew/workflows/typed-agent.dwf.ts
+import { Type, type Static } from "@sinclair/typebox";
+const ReviewSchema = Type.Object({
+  outcome: Type.Union([
+    Type.Literal("accept"),
+    Type.Literal("reject"),
+    Type.Literal("changes_requested"),
+  ]),
+  feedback: Type.String(),
+});
+type Review = Static<typeof ReviewSchema>;
+export default async function (ctx) {
+  const result = await ctx.agent({
+    role: "reviewer",
+    prompt: "Review the diff and judge.",
+    schema: ReviewSchema, // <-- new round-13 field
+  });
+  if (!result.ok) {
+    // result.error explains what didn't match.
+    ctx.setResult("/tmp/error.md", { error: result.error });
+    return;
+  }
+  const review = result.structured as Review;
+  // review is now type-checked as Review.
+  ctx.setResult("/tmp/review.md", { review });
+}
+```
+Backwards compatibility: when `schema` is **omitted**, behavior is identical to
+the previous regex-based extractor. Existing scripts that don't pass a schema
+continue to work unchanged.
+**How it works:** the runner appends a JSON-output instruction to both the agent's
+system prompt (so it knows the expected shape) and the user prompt (so the
+output directive is the last thing the model reads). After the agent emits its
+final text, the runner validates against the schema using `Value.Check`.
+Validation failure surfaces as `ok:false, error: "structured output does not
+match schema: ..."`.
+## Abort listener cleanup (round-13 P0-5)
+`runChildPi` registers two abort listeners on the parent signal (the `abort`
+handler that cancels the child process and the `onParentAbort` handler that
+sets the internal `abortDueToParentSignal` flag). Both are removed in the
+`settle()` function so they do not leak when many child-pi calls share one
+AbortSignal (the common pattern under `background-runner`).
+The fix was originally landed in round 27 (BUG 4). Round-13's audit confirmed
+the cleanup is correct: both `input.signal?.removeEventListener("abort", ...)`
+calls fire before `settle()` returns, regardless of whether the run completed
+normally, hit a timeout, or was aborted. No code changes were needed.
 ## Isolation
 Worker output → artifact file (via `runChildPi` + `writeArtifact`). The dynamic runner
@@ -88,3 +341,63 @@ holds results only in JS variables + `ctx.vars`. Only `ctx.setResult(artifactPat
 read back into the tool result returned to the main context — mirroring the static
 workflow `summary.md` contract. The orchestrator's context never holds raw worker
 output.
+## Resume & Checkpoint (round-18 P2-3)
+When a dynamic-workflow script crashes (timeout, OOM, agent error) between
+`ctx.agent()` calls, all in-memory state (JS vars, phases, logs, budget) is lost and
+the user previously had to re-run from scratch. Round-18 adds a durable checkpoint.
+**How it works:**
+1. After **every** `ctx.agent()` call (success or failure), the runner persists a
+   checkpoint to `<stateRoot>/dwf-checkpoint.json` (atomic write via
+   `atomicWriteJson`). The checkpoint captures `ctx.vars`, the phase list + current
+   phase, the log buffer (capped at 1000), `ctx.budget.spent()`, and an `agentCount`.
+2. On a clean completion (`ctx.setResult()` + script returns normally), the checkpoint
+   is **deleted** so a re-run with the same `runId` starts fresh.
+3. `team action='resume' runId='X'` re-dispatches the run with `runKind='dynamic-workflow'`.
+   The runner detects the checkpoint, emits a `dwf.resumed` event, and **hydrates**
+   `ctx.vars`/phases/logs/spent/agentCount from it before re-executing the script.
+A missing or corrupt checkpoint is treated as a fresh run — resuming is always safe.
+### Writing defensive (resumable) scripts
+Because the script re-runs **from the top** on resume (not from the crash point), you
+should write it defensively: record progress in `ctx.vars` after each agent call and
+skip work that already completed.
+```ts
+export default async function run(ctx) {
+  // Phase 1 — scan (skipped on resume if it already ran)
+  if (ctx.vars.lastPhase !== "scan") {
+    const res = await ctx.agent({ role: "explorer", prompt: "scan the repo" });
+    ctx.vars.scanResult = res.text;      // checkpointed after this ctx.agent() call
+    ctx.vars.lastPhase = "scan";          // marker for the defensive guard
+  }
+  // Phase 2 — analyze (re-uses the resumed/hydrated scanResult)
+  const analysis = await ctx.agent({
+    role: "analyst",
+    prompt: `Analyze: ${ctx.vars.scanResult ?? ""}`,
+  });
+  ctx.vars.lastPhase = "analyze";
+  ctx.setResult(analysis.artifactPath ?? "");
+}
+```
+Key constraints:
+- **No partial-resume of an agent**: if the crash happens *mid-agent*, that agent
+  re-runs from scratch on resume. Agent results should be idempotent-ish.
+- **Checkpoint AFTER the agent completes** (never before), so a failed/incomplete
+  agent call is never persisted as “done.”
+- **Capped state**: logs are capped at 1000, phases at 100 — the checkpoint does not
+  grow unbounded.
+- **Backward compatible**: fresh runs (no checkpoint) behave exactly as before; the
+  checkpoint file only appears when an agent call has run and the run hasn't completed.
+See `src/runtime/dwf-state-store.ts` (`DwfStore`) and the runner wiring in
+`src/runtime/dynamic-workflow-runner.ts`.

package/docs/fix-plan-disabletools-exit-null.md ADDED Viewed

@@ -0,0 +1,219 @@
+# Fix Plan — HB-003a: `ctx.agent({disableTools:true})` returns `exit null`
+> **Status:** PROPOSED (planning only — not yet implemented)
+> **Discovered:** 2026-06-24 real-world smoke testing (see `CHANGELOG.md` "Known issues")
+> **Severity:** medium (blocks the `disableTools:true` verdict-judge pattern; workaround exists)
+> **Owner:** TBD
+> **Related:** HB-004 (smoke-test harness), commits `c55d3e2` + `ab481e6` (sibling bugs already fixed)
+## 1. Problem statement (confirmed evidence)
+`ctx.agent({disableTools: true})` (and the equivalent direct `runChildPi({agent:{disableTools:true}})`)
+returns `exitCode: null` (process killed by signal) instead of `0`, **only when the
+calling process exits promptly after the promise resolves.** If the caller stays alive
+~10s after `runChildPi` resolves, `exitCode` comes back `0` correctly.
+### Repro matrix (all verified 2026-06-24)
+| Scenario | disableTools | Caller keep-alive | Result |
+|---|---|---|---|
+| `pi --no-tools ...` standalone | yes | n/a | ✅ exit 0, correct answer |
+| `runChildPi` + keep-alive 10s | yes | yes | ✅ exitCode=0, finalDrain=true |
+| `runChildPi` + exit immediately | yes | no | ❌ exitCode=null |
+| `runChildPi` (has tools) | no | either | ✅ exitCode=0 |
+| DWF `ctx.agent({disableTools:true})` | yes | (workflow) | ❌ exit null |
+### What the lifecycle events show (failing case)
+```
+spawned → exit(code=null) → close(code=null)
+```
+**Notably ABSENT:** `final_drain`, `hard_kill`, `response_timeout` lifecycle events.
+So the signal did NOT come from `child-pi.ts`'s own timers. `stdout` *does* contain
+the model's answer — pi produced output, then died via signal.
+## 2. Root-cause hypotheses (to confirm in Phase 0)
+### ✅ PHASE 0 COMPLETE — root cause confirmed (2026-06-24)
+**Root cause: erroneous steer-backpressure kill at `child-pi.ts:716-726`** (NOT the
+final-drain race hypothesised in H1).
+When `maxTurns` is reached on a `turn_end` event, the code injects a "wrap up"
+steer by writing to `child.stdin`. Node's `writable.write()` returns `false` when
+the internal buffer is above the high-water mark (normal backpressure) OR when
+the stream is draining. The current code treats **any** `false` return as a
+fatal injection failure and calls `killProcessTree(child.pid, child)` → SIGTERM.
+This fires deterministically for the `ctx.agent({maxTurns:1, disableTools:true})`
+pattern (and the smoke-test repro): with `--no-tools`, pi finishes in exactly one
+real turn, so `turn_end` arrives the instant the answer is ready; pi has nothing
+more to read from stdin, the write returns `false`, and the worker is killed mid-
+answer. The answer IS in stdout, but exit comes back `null` (SIGTERM).
+**Repro confirmed via Phase-0 instrumentation** (`PI_TEAMS_DEBUG=1`):
+```
+[pi-crew:child-pi.kill-process-tree-invoked] pid=783270 called from:
+    at killProcessTree (src/runtime/child-pi.ts:102:23)
+    at Object.onJsonEvent (src/runtime/child-pi.ts:731:11)   ← steer-backpressure kill
+```
+maxTurns=1 × 5 runs: 3/5 exit=null (flaky, depends on OS buffer state).
+maxTurns=5 × 5 runs: 5/5 exit=0 (soft limit not hit on turn 1).
+**The `disableTools` correlation was a red herring** — the real trigger is
+`maxTurns:1` (the smoke workflow happened to combine both). Any single-turn
+agent call hitting `maxTurns` on its first `turn_end` can reproduce this.
+### Original hypotheses (kept for the audit trail)
+The killer was initially unidentified because the `signal` arg of the `exit`
+event was discarded. The leading hypotheses, in priority order:
+- **H1 (DISPROVEN): final-drain timer race.** Instrumentation showed
+  `forcedFinalDrain=false` on failing runs — the final-drain timer was armed but
+  never fired. The SIGTERM came from elsewhere.
+- **H2 (DISPROVEN): pi self-terminates via signal.** stdout contains the answer;
+  the kill-process-tree caller stack points squarely at the steer-injection path.
+- **H3 (already ruled out): external signal / parent-guard.** `startParentGuard`
+  is never invoked in `src/`; abort would set `cancelled:true` (it stayed false).
+## 3. Phased plan
+### Phase 0 — Diagnostic (READ-ONLY, no behavior change)  [~0.5 day]
+Goal: identify the exact signal and the code path that sent it. **No fix yet.**
+1. **Capture the signal.** In `child.on("exit", (code, signal) => ...)`, add `signal`
+   to the `exitStatus` record and to the `exit` lifecycle event payload. Also capture
+   `forcedFinalDrain`, `hardKilled`, `finalDrainTimer` truthiness, and a timestamp
+   relative to spawn. This is the single highest-value change — it turns the bug from
+   "exit null, cause unknown" into "exit null via SIGTERM at T+Xms while timer armed".
+2. **Add a focused repro workflow** `.crew/workflows/debug/dwf-disabletools.dwf.ts`
+   already exists — extend it to log `exitStatus` (signal, forcedFinalDrain, timing).
+   Re-run until the captured data distinguishes H1 vs H2.
+3. **Confirm H1 with a log-only check:** temporarily log the wall-clock time of
+   `forcedFinalDrain = true` vs the `exit` event. If `exit` precedes the assignment,
+   H1 is confirmed.
+4. **Exit Phase 0** with a written finding (append to this file's §2): which signal,
+   which path, deterministic repro steps.
+**Deliverable:** amended §2 with the confirmed root cause; no commit to `main`
+beyond the read-only instrumentation (kept behind a debug flag or reverted).
+### Phase 1 — Fix  [~0.5 day]
+**Confirmed fix: stop killing the worker on a normal backpressure `write() === false`.**
+At `child-pi.ts:723-726`:
+```ts
+const writeSucceeded = child.stdin.write(steerPayload);
+if (!writeSucceeded) {
+  logInternalError("child-pi.steer-backpressure", ...);
+  steerInjectionFailed = true;
+  killProcessTree(child.pid, child);   // ← BUG: backpressure is not fatal
+}
+```
+`Writable.write()` returning `false` is **normal backpressure** — Node buffers the
+write and emits `'drain'` later. It does NOT mean the write failed. Killing the
+worker on it destroys a perfectly good answer (stdout already has it). The
+original intent was to handle a genuinely unwritable stdin (the `else` branch at
+line 727 logs `steer-not-writable` and ALSO kills — that one is more defensible
+but still too aggressive).
+**Proposed change:** keep the steer-injection best-effort. On `write() === false`,
+simply wait for `'drain'` (or do nothing — the soft-limit steer is advisory). If
+the worker ignores it and runs past `maxTurns + graceTurns`, the existing hard-
+abort at line 735 (`turnCount >= maxTurns + graceTurns`) already terminates it.
+```ts
+const writeSucceeded = child.stdin.write(steerPayload);
+if (!writeSucceeded) {
+  // Backpressure: Node buffered the write and will flush on 'drain'. This is
+  // NOT a failure — do NOT kill the worker. The steer is advisory; if the worker
+  // keeps running, the hard-abort at maxTurns + graceTurns (line ~735) handles it.
+  logInternalError("child-pi.steer-backpressure", new Error("stdin write returned false (normal backpressure); steer buffered, worker NOT killed"), `pid=${child.pid}`);
+}
+```
+Keep the `else` branch (stdin not writable at all) as-is for now, but downgrade
+it too in a follow-up — a closed stdin after the worker is done is also not fatal.
+**Verification gate:** the repro matrix in §1 must go all-green with `exitCode=0`
+and the answer present in stdout, run **10× consecutively** (the bug is flaky at
+~60%, so a single pass is insufficient). No regression in the existing
+`test/unit/child-pi-*.test.ts` suites (5 files, ~85 tests). Add a unit test that
+fakes a `child.stdin` whose `write()` returns `false` and asserts the worker is
+NOT killed and the buffered write eventually flushes.
+### Phase 2 — Regression prevention (HB-004)  [~1 day]
+Land the smoke-test harness proposed in `HB-004` so this class of bug is caught by
+CI, not only by live runs. Gate behind `PI_CREW_SMOKE=1` (token cost). Minimum:
+one workflow per feature family that actually shells out to real `pi`
+(`agent` plain, `agent`+schema, `agent`+disableTools, `pipeline`, `phase`/`log`).
+Add a CI job (manual-dispatch workflow) that runs the smoke suite on
+ubuntu/windows/macos × Node 22.
+## 4. Files touched (estimate)
+| Phase | File | Change |
+|---|---|---|
+| 0 | `src/runtime/child-pi.ts` | capture `signal` + timing in `exit`/`exitStatus` (log-only) |
+| 0 | `.crew/workflows/debug/dwf-disabletools.dwf.ts` | extend logging |
+| 1 | `src/runtime/child-pi.ts` | Fix A: `finalDrainArmed` + close-handler override |
+| 1 | `test/unit/child-pi-*.test.ts` | add race-simulation unit test (fake child emitting exit before forcedFinalDrain) |
+| 2 | `test/smoke/*.dwf.ts` (new) | HB-004 harness |
+| 2 | `.github/workflows/smoke.yml` (new) | manual-dispatch smoke CI |
+| 1 | `CHANGELOG.md`, `docs/troubleshooting.md` | move "Known issues" entry to "Fixed"; remove workaround note |
+## 5. Test plan
+- **Phase 0:** before/after instrumentation output showing the captured signal.
+- **Phase 1 unit:** a unit test that injects a fake child process whose `exit`
+  fires with `code=null` *before* the timer callback runs, asserting `finalExitCode === 0`
+  and that stdout content is preserved. This is the regression guard for the race.
+- **Phase 1 real-binary (manual):** re-run the §1 repro matrix; all rows ✅.
+- **Phase 1 regression:** `npm run test:unit` + `npm run test:integration` green;
+  typecheck + lazy-imports clean; TABS.
+- **Phase 2 CI:** smoke workflow green on all 3 OSes.
+## 6. Risk analysis
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| Fix A hides a *real* crash as a clean exit | medium | Telemetry log (`final-drain-zero-exit` style) on the override; only override when stdout is non-empty AND finalDrain timer was armed. Never override `responseTimeoutHit` or `abortRequested` paths. |
+| Race fix changes behavior for the common (has-tools) path | low | The `finalDrainArmed` condition only adds to the existing `forcedFinalDrain` branch; has-tools path already sets `forcedFinalDrain=true` normally. Unit test covers both. |
+| Cross-platform signal differences (Windows has no signals) | medium | Windows already uses `taskkill`/`undefined` signal semantics; Fix A keys off `exitCode === null` which is platform-consistent for signal/force-kill death. Verify on Windows CI. |
+| Phase 0 instrumentation itself changes timing | low | Keep it log-only; use monotonic `performance.now()`; revert before merge if it perturbs the race. |
+## 7. Out of scope
+- P2-2 VM sandbox / isolated-vm (separate, v1.5).
+- Refactoring the 919-line `child-pi.ts` (tempting but out of scope; surgical fix only).
+- Changing the final-drain timeout constants (`FINAL_DRAIN_MS=5s`, `HARD_KILL_MS=3s`).
+- The `runKind:'goal-loop'` foreground-dispatch note from smoke testing (separate item).
+## 8. Open questions for Phase 0 — ANSWERED (2026-06-24)
+All three resolved by the Phase-0 root-cause finding (steer-backpressure kill,
+NOT the final-drain race). Kept for the audit trail.
+1. **ANSWERED: not the final-drain timer.** The keep-alive case returned `exitCode=0`
+   by coincidence — the SIGTERM came from `killProcessTree` on the steer-injection
+   path (`child-pi.ts:731`), not the final-drain timer (`forcedFinalDrain=false`
+   on failing runs). Keep-alive merely changed the OS-buffer state so the stdin
+   `write()` happened to return `true`. Red herring.
+2. **ANSWERED: yes — the steer-injection path (`child-pi.ts:716-726`).** Stack
+   capture under `PI_TEAMS_DEBUG=1` proved `killProcessTree` was invoked from
+   `onJsonEvent` on a `turn_end` where `maxTurns` was reached and `stdin.write()`
+   returned `false` (normal backpressure, mis-treated as fatal).
+3. **ANSWERED: no extension self-termination.** The kill stack is entirely
+   inside `child-pi.ts`; the prompt-runtime extension is not on the stack.
+---
+**Recommendation:** execute Phase 0 first (cheap, read-only, removes all guesswork),
+then pick Fix A or B based on the finding. Do NOT implement Phase 1 blind — the bug
+is in core runtime and a wrong fix could mask real crashes across every agent call.