pi-crew 0.9.8 → 0.9.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +33 -0
- package/README.md +2 -2
- package/package.json +1 -1
- package/src/extension/register.ts +94 -21
- package/src/extension/registration/subagent-helpers.ts +1 -0
- package/src/extension/registration/subagent-tools.ts +9 -0
- package/src/runtime/batch-barrier.ts +145 -0
- package/src/runtime/child-pi.ts +15 -2
- package/src/runtime/crash-classification.ts +208 -0
- package/src/runtime/custom-tools/irc-tool.ts +47 -7
- package/src/runtime/live-agent-manager.ts +185 -0
- package/src/runtime/process-lifecycle.ts +481 -0
- package/src/runtime/subagent-manager.ts +6 -0
- package/src/runtime/task-output-context.ts +52 -1
- package/src/runtime/tool-output-pruner.ts +334 -0
- package/src/state/types.ts +5 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,38 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## [v0.9.9] — gajae-code distillation (4 P0) + notification race fix (2026-06-25)
|
|
4
|
+
|
|
5
|
+
Six changes: four high-impact/low-effort features distilled from researching [Yeachan-Heo/gajae-code](https://github.com/Yeachan-Heo/gajae-code) (full report: `research-findings/gajae-code-distill.md`), plus a fix for a redundant-notification bug the leader directly hit while running that research. Each was calibrated against real pi-crew code — two reported "gaps" turned out to be patterns pi-crew already implements (prompt-level stablePrefix, detached spawning), and four areas where pi-crew is already superior were deliberately left untouched (crash-recovery byte-offset cursor, declarative workflow + semaphores, run-snapshot-cache, event sourcing).
|
|
6
|
+
|
|
7
|
+
### P0 #1 — Crash classification taxonomy (commit fb8c4a8)
|
|
8
|
+
|
|
9
|
+
`child-pi.ts` captured stderr/exit codes but never bucketed failure modes. New pure `classifyProcessCrash()` (port of gajae-code's `crash-diagnostics.ts`) maps exits to 9 semantic classes (`clean_exit | non_zero_exit | signal_exit | timeout | cancelled | spawn_error | protocol_exit | native_panic | unknown`) with precedence timeout > cancelled > spawn_error > native_panic > signal. Attached to `WorkerExitStatus` at both settle paths; kill/drain/timeout logic untouched. 30 unit tests.
|
|
10
|
+
|
|
11
|
+
### P0 #2 — Staleness-aware tool output pruning (commit 13acf37)
|
|
12
|
+
|
|
13
|
+
The L4 size-based compaction retained every copy of a re-read file until a size threshold tripped. New `tool-output-pruner.ts` (port of gajae-code's `pruning.ts`) drops superseded tool results — same-file re-reads and read-then-edit — **before** they are injected into a downstream worker's prompt via `task-output-context.ts`. Replaces stale content with a digest notice (first/last lines + count for bash/grep/search). OPT-IN via `DEFAULT_PRUNE_CONFIG`; does NOT regress the L4 head+tail(75/25) behavior. 25 unit tests.
|
|
14
|
+
|
|
15
|
+
### P0 #3 — OwnedProcess abstraction (commit fe3bdde)
|
|
16
|
+
|
|
17
|
+
Background spawns used `detached:true` but had no unified ownership primitive for guaranteed teardown. New `process-lifecycle.ts` adds `OwnedProcess` (escalating SIGTERM → grace → SIGKILL; Windows `taskkill /F /T /PID` fallback; idempotent `dispose()`; bounded `awaitExit()`; `onExit`) plus `registerResourceOwner()`/`disposeAllOwners()` for non-process resources (timers, sockets, Workers) and root-exit drain reconciliation. **Incremental adoption** — deliberately NOT migrating `child-pi.ts`'s battle-tested `killProcessTree`/post-exit-stdio-guard/hard-kill-timer or `async-runner.ts`'s intentionally-detached background spawns; the primitive is available for future ownership-scoped spawns (MCP/LSP/DAP servers, eval workers). 22 unit tests.
|
|
18
|
+
|
|
19
|
+
### P0 #4 — IRC reply support, side-channel Q&A (commit 43bcd65)
|
|
20
|
+
|
|
21
|
+
The `irc-tool` was fire-and-forget despite an `awaitReply` param marked "Not yet supported". New `respondAsBackground()` on `live-agent-manager.ts` delivers a DM to a recipient's session **without blocking its main loop** (`sendCustomMessage({triggerTurn:false})`) and awaits an event-driven, timeout-bounded reply via an in-memory pending-reply registry keyed by correlation id. `awaitReply:true` DMs now route through this side-channel and return reply content; broadcast stays fire-and-forget. Coexists with mailbox.ts's existing file-based reply fields (cross-process). 10 unit tests.
|
|
22
|
+
|
|
23
|
+
### Notification race fix — Rule 2 + Rule 1 (commits 592d9ea, c22cbb9)
|
|
24
|
+
|
|
25
|
+
While running the gajae-code research, the leader observed redundant "background subagent changed state" notifications arriving a turn late, after results were already read. Root cause: the completion callback (`SubagentManager.onComplete`) fires from inside the `record.promise` IIFE `finally` block — **before the promise resolves** — so a leader calling `get_subagent_result(wait:true)` sets `resultConsumed=true` only afterward, and the synchronous `if (record.resultConsumed) return` guard always saw `false`. A latent test bug (`assert sentMessages.length === 0` on an array that was unconditionally empty because `sendAgentWakeUp` prefers `sendUserMessage`) masked it.
|
|
26
|
+
|
|
27
|
+
- **Rule 2** (592d9ea): defer notification emission to a `setTimeout(0)` **macrotask** (not `queueMicrotask` — microtasks queued in the finally run before the promise-resolution microtask), then recheck `resultConsumed` (in-memory `getRecord` + persisted `readPersistedSubagentRecord`) before emitting; suppress if already consumed. Covers all three `onComplete` call sites via the single emit point. Fixed the test assertion to `sentUserMessages` and added two explicit regression tests (notify still fires when leader does NOT pre-consume; notify suppressed when leader pre-consumes via `wait:true`).
|
|
28
|
+
- **Rule 1** (c22cbb9): new `BatchBarrier` registry + optional `batch_id` param on the Agent tool. Background agents sharing a `batch_id` never emit individual notifications; instead each completion is recorded in the barrier and **one consolidated** "All N background subagents in batch \"X\" have finished" notification fires exactly once when every member reaches a terminal state (`blocked` is NOT terminal — a blocked agent resumes later). Verified end-to-end with 1/2/5-agent batches (one with a queued member and staggered 10–25s sleeps): exactly 1 consolidated notification, 0 individual leaks. Composes with Rule 2 (a batched agent whose result was already consumed is still suppressed via the `resultConsumed` recheck). 10 unit tests + 2 integration tests.
|
|
29
|
+
|
|
30
|
+
Design doc: `research-findings/subagent-notification-race-fix.md`.
|
|
31
|
+
|
|
32
|
+
### What was NOT adopted (pi-crew already superior)
|
|
33
|
+
|
|
34
|
+
Crash recovery (`crash-recovery.ts`, 421 lines, byte-offset event-log cursor), declarative workflow + semaphores, `run-snapshot-cache` (disk rebuild, TTL 1500ms), and event sourcing (`readEventsCursor`) are all more sophisticated than gajae-code's equivalents and were left intact.
|
|
35
|
+
|
|
3
36
|
## [v0.9.8] — deer-flow learning integration: L1/L2/L3/L4 (2026-06-24)
|
|
4
37
|
|
|
5
38
|
Four improvements distilled from researching [bytedance/deer-flow](https://github.com/bytedance/deer-flow) and the wider Pi-ecosystem (pi-boomerang, pi-subagents, pi-dynamic-workflows). Each was calibrated against real pi-crew code (the research over-reported gaps — several patterns pi-crew already does *better* than deer-flow) and sized from measured data, not guesses.
|
package/README.md
CHANGED
|
@@ -39,9 +39,9 @@ npm: pi-crew
|
|
|
39
39
|
repo: https://github.com/baphuongna/pi-crew
|
|
40
40
|
```
|
|
41
41
|
|
|
42
|
-
**v0.9.4 / v0.9.5 / v0.9.8**: See [CHANGELOG.md](CHANGELOG.md).
|
|
42
|
+
**v0.9.4 / v0.9.5 / v0.9.8 / v0.9.9**: See [CHANGELOG.md](CHANGELOG.md).
|
|
43
43
|
|
|
44
|
-
### Highlights (v0.6.4 → v0.9.
|
|
44
|
+
### Highlights (v0.6.4 → v0.9.9)
|
|
45
45
|
|
|
46
46
|
A long arc of **trust, cliff-resilience, and robustness** work. Principle: *build
|
|
47
47
|
trust and cliff-resilience, stay lean, delete before adding.*
|
package/package.json
CHANGED
|
@@ -56,7 +56,11 @@ import { createManifestCache } from "../runtime/manifest-cache.ts";
|
|
|
56
56
|
import { CrewScheduler } from "../runtime/scheduler.ts";
|
|
57
57
|
import { loadRunManifestById, updateRunStatus } from "../state/state-store.ts";
|
|
58
58
|
import type { TeamRunManifest } from "../state/types.ts";
|
|
59
|
-
import {
|
|
59
|
+
import {
|
|
60
|
+
SubagentManager,
|
|
61
|
+
readPersistedSubagentRecord,
|
|
62
|
+
} from "../subagents/manager.ts";
|
|
63
|
+
import { BatchBarrier, type BatchMember } from "../runtime/batch-barrier.ts";
|
|
60
64
|
import { terminateActiveChildPiProcesses } from "../subagents/spawn.ts";
|
|
61
65
|
import {
|
|
62
66
|
type CrewWidgetState,
|
|
@@ -635,6 +639,7 @@ export function registerPiTeams(pi: ExtensionAPI): void {
|
|
|
635
639
|
!cleanedUp &&
|
|
636
640
|
currentCtx === ctx &&
|
|
637
641
|
sessionGeneration === ownerGeneration;
|
|
642
|
+
const batchBarrier = new BatchBarrier();
|
|
638
643
|
const subagentManager = new SubagentManager(
|
|
639
644
|
4,
|
|
640
645
|
(record) => {
|
|
@@ -651,22 +656,90 @@ export function registerPiTeams(pi: ExtensionAPI): void {
|
|
|
651
656
|
durationMs: record.durationMs,
|
|
652
657
|
});
|
|
653
658
|
}
|
|
654
|
-
if (!record.background
|
|
659
|
+
if (!record.background) return;
|
|
655
660
|
if (!isOwnerSessionCurrent(record.ownerSessionGeneration)) return;
|
|
656
661
|
if (
|
|
657
|
-
record.status
|
|
658
|
-
record.status
|
|
659
|
-
record.status
|
|
660
|
-
record.status
|
|
661
|
-
record.status
|
|
662
|
-
)
|
|
662
|
+
record.status !== "completed" &&
|
|
663
|
+
record.status !== "failed" &&
|
|
664
|
+
record.status !== "cancelled" &&
|
|
665
|
+
record.status !== "blocked" &&
|
|
666
|
+
record.status !== "error"
|
|
667
|
+
)
|
|
668
|
+
return;
|
|
669
|
+
// Rule 2 (consume-race fix): this callback fires from inside the
|
|
670
|
+
// record.promise IIFE `finally` block — BEFORE the promise resolves,
|
|
671
|
+
// i.e. before a leader calling `get_subagent_result(wait:true)` can
|
|
672
|
+
// set resultConsumed=true. The old synchronous guard always saw
|
|
673
|
+
// resultConsumed=false here. Defer emission to a MACROTASK
|
|
674
|
+
// (setTimeout, NOT queueMicrotask): macrotasks run only after the
|
|
675
|
+
// microtask queue drains — which includes the leader's
|
|
676
|
+
// `await record.promise` continuation that marks resultConsumed=true.
|
|
677
|
+
// Then recheck in-memory + persisted before emitting.
|
|
678
|
+
const agentId = record.id;
|
|
679
|
+
const ownerGen = record.ownerSessionGeneration;
|
|
680
|
+
const agentStatus = record.status;
|
|
681
|
+
const agentType = record.type;
|
|
682
|
+
const agentDescription = record.description;
|
|
683
|
+
const agentRunId = record.runId;
|
|
684
|
+
const agentBatchId = record.batchId;
|
|
685
|
+
setTimeout(() => {
|
|
686
|
+
if (cleanedUp) return;
|
|
687
|
+
const fresh = subagentManager.getRecord(agentId);
|
|
688
|
+
const persisted = currentCtx
|
|
689
|
+
? readPersistedSubagentRecord(currentCtx.cwd, agentId)
|
|
690
|
+
: undefined;
|
|
691
|
+
// Leader already joined the result -> suppress redundant notify.
|
|
692
|
+
if (fresh?.resultConsumed || persisted?.resultConsumed) return;
|
|
693
|
+
if (!isOwnerSessionCurrent(fresh?.ownerSessionGeneration ?? ownerGen))
|
|
694
|
+
return;
|
|
695
|
+
// Rule 1 (batch coalescing): if this agent belongs to a batch, never
|
|
696
|
+
// emit an individual notification. Instead record its terminal state
|
|
697
|
+
// in the barrier; emit ONE consolidated notification only when ALL
|
|
698
|
+
// members are terminal. Suppressed members wait silently.
|
|
699
|
+
if (agentBatchId) {
|
|
700
|
+
const member: BatchMember = {
|
|
701
|
+
id: agentId,
|
|
702
|
+
description: agentDescription,
|
|
703
|
+
type: agentType,
|
|
704
|
+
status: agentStatus,
|
|
705
|
+
};
|
|
706
|
+
const snap = batchBarrier.markTerminal(agentBatchId, member);
|
|
707
|
+
if (snap.allDone && !snap.notified) {
|
|
708
|
+
batchBarrier.markNotified(agentBatchId);
|
|
709
|
+
const roster = snap.terminal
|
|
710
|
+
.map(
|
|
711
|
+
(m) =>
|
|
712
|
+
`- ${m.id} [${m.status}] (${m.type ?? "agent"}): ${m.description ?? ""}`,
|
|
713
|
+
)
|
|
714
|
+
.join("\n");
|
|
715
|
+
const joinInstruction = [
|
|
716
|
+
`All ${snap.terminal.length} background subagents in batch "${agentBatchId}" have finished.`,
|
|
717
|
+
"Members:",
|
|
718
|
+
roster,
|
|
719
|
+
"",
|
|
720
|
+
`Call get_subagent_result for each agent_id above, read the outputs, then continue the user's original task.`,
|
|
721
|
+
].join("\n");
|
|
722
|
+
sendAgentWakeUp(pi, joinInstruction);
|
|
723
|
+
notifyOperator({
|
|
724
|
+
id: `subagent-batch:${agentBatchId}:completed`,
|
|
725
|
+
severity: "info",
|
|
726
|
+
source: "subagent-completed",
|
|
727
|
+
runId: agentRunId,
|
|
728
|
+
title: `pi-crew batch "${agentBatchId}" complete (${snap.terminal.length} agents).`,
|
|
729
|
+
body: `Members: ${snap.terminal.map((m) => m.id).join(", ")}`,
|
|
730
|
+
});
|
|
731
|
+
}
|
|
732
|
+
// Either we just emitted the consolidated notify, or we are still
|
|
733
|
+
// waiting for other members — in both cases do NOT emit individual.
|
|
734
|
+
return;
|
|
735
|
+
}
|
|
663
736
|
const metadata = JSON.stringify(
|
|
664
737
|
{
|
|
665
|
-
id:
|
|
666
|
-
status:
|
|
667
|
-
type:
|
|
668
|
-
runId:
|
|
669
|
-
description:
|
|
738
|
+
id: agentId,
|
|
739
|
+
status: agentStatus,
|
|
740
|
+
type: agentType,
|
|
741
|
+
runId: agentRunId,
|
|
742
|
+
description: agentDescription,
|
|
670
743
|
},
|
|
671
744
|
null,
|
|
672
745
|
2,
|
|
@@ -677,19 +750,18 @@ export function registerPiTeams(pi: ExtensionAPI): void {
|
|
|
677
750
|
"```json",
|
|
678
751
|
metadata,
|
|
679
752
|
"```",
|
|
680
|
-
`Call get_subagent_result with agent_id="${
|
|
753
|
+
`Call get_subagent_result with agent_id="${agentId}" now, read the output, then continue the user's original task without waiting for another user prompt.`,
|
|
681
754
|
].join("\n");
|
|
682
755
|
sendAgentWakeUp(pi, joinInstruction);
|
|
683
756
|
notifyOperator({
|
|
684
|
-
id: `subagent:${
|
|
685
|
-
severity:
|
|
686
|
-
record.status === "completed" ? "info" : "warning",
|
|
757
|
+
id: `subagent:${agentId}:${agentStatus}`,
|
|
758
|
+
severity: agentStatus === "completed" ? "info" : "warning",
|
|
687
759
|
source: "subagent-completed",
|
|
688
|
-
runId:
|
|
689
|
-
title: `pi-crew subagent ${
|
|
690
|
-
body: `Use get_subagent_result with agent_id=${
|
|
760
|
+
runId: agentRunId,
|
|
761
|
+
title: `pi-crew subagent ${agentId} ${agentStatus}.`,
|
|
762
|
+
body: `Use get_subagent_result with agent_id=${agentId} for output.`,
|
|
691
763
|
});
|
|
692
|
-
}
|
|
764
|
+
}, 0);
|
|
693
765
|
},
|
|
694
766
|
1000,
|
|
695
767
|
(event, payload) => {
|
|
@@ -2044,6 +2116,7 @@ export function registerPiTeams(pi: ExtensionAPI): void {
|
|
|
2044
2116
|
ownerSessionGeneration: captureSessionGeneration,
|
|
2045
2117
|
startForegroundRun: (ctx, runner, runId) =>
|
|
2046
2118
|
startForegroundRun(ctx as ExtensionContext, runner, runId),
|
|
2119
|
+
batchBarrier,
|
|
2047
2120
|
});
|
|
2048
2121
|
time("register.tools");
|
|
2049
2122
|
|
|
@@ -98,5 +98,6 @@ export function __test__subagentSpawnParams(params: Record<string, unknown>, ctx
|
|
|
98
98
|
model: typeof params.model === "string" && params.model.trim() ? params.model.trim() : undefined,
|
|
99
99
|
skill: parseSkillParam(params.skill),
|
|
100
100
|
maxTurns: typeof params.max_turns === "number" && Number.isFinite(params.max_turns) ? params.max_turns : undefined,
|
|
101
|
+
batchId: typeof params.batch_id === "string" && params.batch_id.trim() ? params.batch_id.trim() : undefined,
|
|
101
102
|
};
|
|
102
103
|
}
|
|
@@ -15,6 +15,7 @@ async function handleTeamTool(params: Parameters<typeof HandleTeamToolFn>[0], ct
|
|
|
15
15
|
}
|
|
16
16
|
import { checkSubagentSpawnPermission, currentCrewRole } from "../../runtime/role-permission.ts";
|
|
17
17
|
import { readPersistedSubagentRecord, savePersistedSubagentRecord, type SubagentManager, type SubagentSpawnOptions } from "../../subagents/manager.ts";
|
|
18
|
+
import type { BatchBarrier } from "../../runtime/batch-barrier.ts";
|
|
18
19
|
import { loadConfig } from "../../config/config.ts";
|
|
19
20
|
import { logInternalError } from "../../utils/internal-error.ts";
|
|
20
21
|
import { __test__subagentSpawnParams, formatSubagentRecord, readSubagentRunResult, refreshPersistedSubagentRecord, subagentToolResult } from "./subagent-helpers.ts";
|
|
@@ -32,6 +33,9 @@ type OnUpdate = (chunk: { content: { type: "text"; text: string }[] }) => void;
|
|
|
32
33
|
export interface SubagentToolRegistrationOptions {
|
|
33
34
|
ownerSessionGeneration?: () => number;
|
|
34
35
|
startForegroundRun?: (ctx: unknown, runner: (signal?: AbortSignal) => Promise<void>, runId?: string) => void;
|
|
36
|
+
/** Rule 1 batch barrier. When present, agents spawned with a batchId are
|
|
37
|
+
* registered here so their completion notifications are coalesced. */
|
|
38
|
+
batchBarrier?: BatchBarrier;
|
|
35
39
|
}
|
|
36
40
|
|
|
37
41
|
export function registerSubagentTools(pi: ExtensionAPI, subagentManager: SubagentManager, options: SubagentToolRegistrationOptions = {}): void {
|
|
@@ -53,6 +57,7 @@ export function registerSubagentTools(pi: ExtensionAPI, subagentManager: Subagen
|
|
|
53
57
|
skill: Type.Optional(Type.Union([Type.String(), Type.Array(Type.String()), Type.Boolean()], { description: "Skill name(s) to inject for this subagent, or false to disable selected/default skills." })),
|
|
54
58
|
max_turns: Type.Optional(Type.Number({ description: "Reserved for live-session subagents; child-process runtime may ignore this." })),
|
|
55
59
|
run_in_background: Type.Optional(Type.Boolean({ description: "Run in background and return an agent ID immediately." })),
|
|
60
|
+
batch_id: Type.Optional(Type.String({ description: "Optional batch grouping id. Background agents sharing the same batch_id receive ONE consolidated completion notification when ALL members finish (instead of N individual notifications). Use this when launching several background agents in one turn and you do not join them immediately. Omit for the default individual-notification behavior." })),
|
|
56
61
|
}) as never,
|
|
57
62
|
async execute(_id, params, signal, onUpdate, ctx) {
|
|
58
63
|
// Diagnostic: detect pre-aborted signal before spawn
|
|
@@ -71,6 +76,10 @@ export function registerSubagentTools(pi: ExtensionAPI, subagentManager: Subagen
|
|
|
71
76
|
const ctxWithSession = withSessionId(ctx);
|
|
72
77
|
const runner = async (currentOptions: SubagentSpawnOptions, childSignal?: AbortSignal) => handleTeamTool({ action: "run", agent: currentOptions.type, goal: currentOptions.prompt, model: currentOptions.model, skill: currentOptions.skill, async: currentOptions.background, config: currentOptions.maxTurns ? { runtime: { maxTurns: currentOptions.maxTurns } } : undefined } as TeamToolParamsValue, { ...ctxWithSession, signal: childSignal, ...(options.startForegroundRun ? { startForegroundRun: (runRunner: (sig?: AbortSignal) => Promise<void>, runId?: string) => options.startForegroundRun!(ctxWithSession, runRunner, runId) } : {}) });
|
|
73
78
|
const record = subagentManager.spawn(spawnOptions, runner, spawnOptions.background ? undefined : signal);
|
|
79
|
+
// Rule 1: register batch membership so completions can be coalesced.
|
|
80
|
+
if (spawnOptions.batchId && spawnOptions.background) {
|
|
81
|
+
options.batchBarrier?.register(spawnOptions.batchId, record.id, { description: record.description, type: record.type });
|
|
82
|
+
}
|
|
74
83
|
if (spawnOptions.background || record.status === "queued") {
|
|
75
84
|
// Phase 1.1a: Terminate turn for background queued — no LLM follow-up needed.
|
|
76
85
|
// Phase 1.6: Record was terminated for telemetry.
|
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* BatchBarrier — Rule 1 (no-wait batch grouping).
|
|
3
|
+
*
|
|
4
|
+
* When a leader launches several background subagents with the SAME `batchId`
|
|
5
|
+
* and does NOT join them immediately (`get_subagent_result(wait:true)`), the
|
|
6
|
+
* completion notifications are coalesced: instead of N individual
|
|
7
|
+
* "changed state" wake-ups, the leader receives ONE consolidated notification
|
|
8
|
+
* once ALL members of the batch have reached a terminal state.
|
|
9
|
+
*
|
|
10
|
+
* Semantics:
|
|
11
|
+
* - `register(batchId, agentId)` is called at spawn time (synchronous within a
|
|
12
|
+
* leader turn). All members of a batch are therefore known by the time the
|
|
13
|
+
* first completion fires (completion is observed via the 1000ms poll loop).
|
|
14
|
+
* - `markTerminal(batchId, agentId)` returns whether THIS completion made every
|
|
15
|
+
* registered member terminal ("allDone"). When allDone, the caller emits a
|
|
16
|
+
* single consolidated notification and calls `markNotified`.
|
|
17
|
+
* - If a member reaches terminal after the batch already notified (late spawn
|
|
18
|
+
* edge case), `markTerminal` returns allDone=false for the straggler path is
|
|
19
|
+
* NOT covered — but `alreadyNotified` lets the caller suppress stray
|
|
20
|
+
* individual notifications once the consolidated one fired.
|
|
21
|
+
*
|
|
22
|
+
* Thread-safety: single-threaded JS event loop. No locks needed.
|
|
23
|
+
*/
|
|
24
|
+
|
|
25
|
+
export interface BatchMember {
|
|
26
|
+
id: string;
|
|
27
|
+
description?: string;
|
|
28
|
+
type?: string;
|
|
29
|
+
status: string;
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
export interface BatchSnapshot {
|
|
33
|
+
batchId: string;
|
|
34
|
+
members: BatchMember[];
|
|
35
|
+
terminal: BatchMember[];
|
|
36
|
+
/** true when every registered member has reached a terminal state. */
|
|
37
|
+
allDone: boolean;
|
|
38
|
+
/** true once the consolidated notification has been emitted. */
|
|
39
|
+
notified: boolean;
|
|
40
|
+
}
|
|
41
|
+
|
|
42
|
+
const TERMINAL_STATUSES = new Set([
|
|
43
|
+
"completed",
|
|
44
|
+
"failed",
|
|
45
|
+
"cancelled",
|
|
46
|
+
"error",
|
|
47
|
+
"stopped",
|
|
48
|
+
]);
|
|
49
|
+
|
|
50
|
+
export function isTerminalStatus(status: string): boolean {
|
|
51
|
+
return TERMINAL_STATUSES.has(status);
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
export class BatchBarrier {
|
|
55
|
+
private readonly batches = new Map<
|
|
56
|
+
string,
|
|
57
|
+
{
|
|
58
|
+
members: Map<string, BatchMember>;
|
|
59
|
+
terminal: Map<string, BatchMember>;
|
|
60
|
+
notified: boolean;
|
|
61
|
+
}
|
|
62
|
+
>();
|
|
63
|
+
|
|
64
|
+
/** Register a member at spawn time. Idempotent per (batchId, agentId). */
|
|
65
|
+
register(batchId: string, agentId: string, meta?: { description?: string; type?: string }): void {
|
|
66
|
+
let batch = this.batches.get(batchId);
|
|
67
|
+
if (!batch) {
|
|
68
|
+
batch = { members: new Map(), terminal: new Map(), notified: false };
|
|
69
|
+
this.batches.set(batchId, batch);
|
|
70
|
+
}
|
|
71
|
+
if (!batch.members.has(agentId)) {
|
|
72
|
+
batch.members.set(agentId, {
|
|
73
|
+
id: agentId,
|
|
74
|
+
description: meta?.description,
|
|
75
|
+
type: meta?.type,
|
|
76
|
+
status: "running",
|
|
77
|
+
});
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
/**
|
|
82
|
+
* Record that a member reached a terminal state. Returns the batch snapshot.
|
|
83
|
+
* `snapshot.allDone` is true iff every registered member is now terminal.
|
|
84
|
+
* If the batch was never seen (defensive edge case), the member is registered
|
|
85
|
+
* on-the-fly as a batch-of-one so its terminal state is not silently lost.
|
|
86
|
+
*/
|
|
87
|
+
markTerminal(batchId: string, member: BatchMember): BatchSnapshot {
|
|
88
|
+
let batch = this.batches.get(batchId);
|
|
89
|
+
if (!batch) {
|
|
90
|
+
batch = { members: new Map(), terminal: new Map(), notified: false };
|
|
91
|
+
this.batches.set(batchId, batch);
|
|
92
|
+
}
|
|
93
|
+
// Ensure the member is known (auto-register for the defensive case).
|
|
94
|
+
if (!batch.members.has(member.id)) {
|
|
95
|
+
batch.members.set(member.id, { ...member, status: member.status });
|
|
96
|
+
}
|
|
97
|
+
if (isTerminalStatus(member.status)) {
|
|
98
|
+
batch.terminal.set(member.id, { ...member });
|
|
99
|
+
const existing = batch.members.get(member.id);
|
|
100
|
+
if (existing) batch.members.set(member.id, { ...existing, status: member.status });
|
|
101
|
+
}
|
|
102
|
+
const allDone =
|
|
103
|
+
batch.members.size > 0 &&
|
|
104
|
+
[...batch.members.keys()].every((id) => batch.terminal.has(id));
|
|
105
|
+
return {
|
|
106
|
+
batchId,
|
|
107
|
+
members: [...batch.members.values()],
|
|
108
|
+
terminal: [...batch.terminal.values()],
|
|
109
|
+
allDone,
|
|
110
|
+
notified: batch.notified,
|
|
111
|
+
};
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
/** Has the consolidated notification already been emitted for this batch? */
|
|
115
|
+
alreadyNotified(batchId: string): boolean {
|
|
116
|
+
return this.batches.get(batchId)?.notified ?? false;
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
/** Mark the consolidated notification as emitted. No-op if already set. */
|
|
120
|
+
markNotified(batchId: string): void {
|
|
121
|
+
const batch = this.batches.get(batchId);
|
|
122
|
+
if (batch) batch.notified = true;
|
|
123
|
+
}
|
|
124
|
+
|
|
125
|
+
/** Read-only snapshot (for tests / debugging). */
|
|
126
|
+
snapshot(batchId: string): BatchSnapshot | undefined {
|
|
127
|
+
const batch = this.batches.get(batchId);
|
|
128
|
+
if (!batch) return undefined;
|
|
129
|
+
return {
|
|
130
|
+
batchId,
|
|
131
|
+
members: [...batch.members.values()],
|
|
132
|
+
terminal: [...batch.terminal.values()],
|
|
133
|
+
allDone:
|
|
134
|
+
batch.members.size > 0 &&
|
|
135
|
+
[...batch.members.keys()].every((id) => batch.terminal.has(id)),
|
|
136
|
+
notified: batch.notified,
|
|
137
|
+
};
|
|
138
|
+
}
|
|
139
|
+
|
|
140
|
+
/** Drop a batch (used on cleanup / test reset). */
|
|
141
|
+
dispose(batchId?: string): void {
|
|
142
|
+
if (batchId === undefined) this.batches.clear();
|
|
143
|
+
else this.batches.delete(batchId);
|
|
144
|
+
}
|
|
145
|
+
}
|
package/src/runtime/child-pi.ts
CHANGED
|
@@ -12,6 +12,7 @@ import { attachPostExitStdioGuard, trySignalChild } from "./post-exit-stdio-guar
|
|
|
12
12
|
import { redactJsonLine } from "../utils/redaction.ts";
|
|
13
13
|
import { sanitizeEnvSecrets } from "../utils/env-filter.ts";
|
|
14
14
|
import { registerChildProcess, unregisterChildProcess } from "../extension/crew-cleanup.ts";
|
|
15
|
+
import { classifyProcessCrash } from "./crash-classification.ts";
|
|
15
16
|
import { resolveRealContainedPath } from "../utils/safe-paths.ts";
|
|
16
17
|
|
|
17
18
|
const POST_EXIT_STDIO_GUARD_MS = DEFAULT_CHILD_PI.postExitStdioGuardMs;
|
|
@@ -912,7 +913,7 @@ export async function runChildPi(input: ChildPiRunInput): Promise<ChildPiRunResu
|
|
|
912
913
|
} catch (err) {
|
|
913
914
|
logInternalError("child-pi.on-lifecycle-event", err, `event=error, pid=${child.pid}`);
|
|
914
915
|
}
|
|
915
|
-
settle({ exitCode: null, stdout, stderr, error: processError.message });
|
|
916
|
+
settle({ exitCode: null, stdout, stderr, error: processError.message, exitStatus: { exitCode: null, cancelled: abortRequested, timedOut: responseTimeoutHit, killed: false, cleanupErrors, finalDrainMs, crashClass: classifyProcessCrash({ exitCode: null, cancelled: abortRequested, timedOut: responseTimeoutHit, spawnError: error, stderrSnippet: stderr ? stderr.slice(-1000) : undefined }).crashClass } });
|
|
916
917
|
});
|
|
917
918
|
child.on("exit", (code, signal) => {
|
|
918
919
|
if (child.pid) {
|
|
@@ -1001,7 +1002,19 @@ export async function runChildPi(input: ChildPiRunInput): Promise<ChildPiRunResu
|
|
|
1001
1002
|
// is logged, not fatal). The steerError branch is retained for safety in
|
|
1002
1003
|
// case a future change reintroduces a fatal steer path.
|
|
1003
1004
|
const steerError = steerInjectionFailed ? "Steer injection failed due to stdin backpressure; process killed" : undefined;
|
|
1004
|
-
|
|
1005
|
+
// P0 crash taxonomy: classify the exit so callers/dashboards can bucket
|
|
1006
|
+
// failure modes (timeout vs cancel vs native panic vs signal …).
|
|
1007
|
+
// The classifier is a pure function; this is the single integration point.
|
|
1008
|
+
const crashClassification = classifyProcessCrash({
|
|
1009
|
+
exitCode: finalExitCode,
|
|
1010
|
+
signal: child.signalCode ?? undefined,
|
|
1011
|
+
cancelled: abortRequested,
|
|
1012
|
+
timedOut: responseTimeoutHit,
|
|
1013
|
+
killed: hardKilled,
|
|
1014
|
+
spawnError: undefined,
|
|
1015
|
+
stderrSnippet: stderr ? stderr.slice(-1000) : undefined,
|
|
1016
|
+
});
|
|
1017
|
+
settle({ exitCode: finalExitCode, stdout, stderr, ...(timeoutError ? { error: timeoutError.error } : {}), ...(steerError ? { error: steerError } : {}), aborted: wasGraceAborted || wasParentAborted, steered: softLimitReached && !wasGraceAborted, exitStatus: { exitCode: finalExitCode, cancelled: abortRequested, timedOut: responseTimeoutHit, killed: hardKilled, cleanupErrors, finalDrainMs, crashClass: crashClassification.crashClass } });
|
|
1005
1018
|
});
|
|
1006
1019
|
});
|
|
1007
1020
|
} finally {
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Crash Classification Taxonomy — pure function for categorizing worker exits.
|
|
3
|
+
*
|
|
4
|
+
* Distilled from gajae-code's `debug/crash-diagnostics.ts` (P0 item #1).
|
|
5
|
+
* Unlike the original, this module is pure: it does NOT write crash reports or
|
|
6
|
+
* touch the filesystem. The file-I/O layer is intentionally omitted; callers
|
|
7
|
+
* that want durable crash logs can layer them on top of {@link classifyProcessCrash}.
|
|
8
|
+
*
|
|
9
|
+
* The classification precedence (most-significant first) mirrors the
|
|
10
|
+
* reference implementation:
|
|
11
|
+
*
|
|
12
|
+
* 1. timeout — process was terminated by the response-timeout guard
|
|
13
|
+
* 2. cancelled — cooperative cancellation (AbortSignal) triggered exit
|
|
14
|
+
* 3. spawn_error — child_process emitted an `error` event before `exit`
|
|
15
|
+
* 4. native_panic — stderr indicates a native crash (SIGSEGV / abort / panic)
|
|
16
|
+
* 5. signal_exit — the process was terminated by an OS signal
|
|
17
|
+
* 6. clean_exit — exit code 0
|
|
18
|
+
* 7. non_zero_exit — exit code != 0 (and != null)
|
|
19
|
+
* 8. protocol_exit — exit code is null with no signal (protocol/stream
|
|
20
|
+
* ended before a normal exit was observed)
|
|
21
|
+
* 9. unknown — defensive fallback (should not occur in practice)
|
|
22
|
+
*
|
|
23
|
+
* NOTE on timeout-vs-cancel precedence: when BOTH `timedOut` and `cancelled`
|
|
24
|
+
* are true, `timeout` wins (the timeout terminated the process). This matches
|
|
25
|
+
* gajae-code and the existing child-pi.ts response-timeout guard, which fires
|
|
26
|
+
* the hard kill and is the proximate cause.
|
|
27
|
+
*/
|
|
28
|
+
|
|
29
|
+
/**
|
|
30
|
+
* Categorical classification of why a worker process ended.
|
|
31
|
+
*
|
|
32
|
+
* @see classifyProcessCrash
|
|
33
|
+
*/
|
|
34
|
+
export type CrashClass =
|
|
35
|
+
| "clean_exit"
|
|
36
|
+
| "non_zero_exit"
|
|
37
|
+
| "signal_exit"
|
|
38
|
+
| "timeout"
|
|
39
|
+
| "cancelled"
|
|
40
|
+
| "spawn_error"
|
|
41
|
+
| "protocol_exit"
|
|
42
|
+
| "native_panic"
|
|
43
|
+
| "unknown";
|
|
44
|
+
|
|
45
|
+
/**
|
|
46
|
+
* Inputs to {@link classifyProcessCrash}. All fields are optional/safe-defaulting
|
|
47
|
+
* so callers can pass a partial view (e.g. just `{ exitCode: 0 }`).
|
|
48
|
+
*
|
|
49
|
+
* Field semantics:
|
|
50
|
+
* - `exitCode` — the OS exit code, or `null` when no code was observed.
|
|
51
|
+
* - `signal` — the terminating signal name (e.g. `"SIGTERM"`) or `null`.
|
|
52
|
+
* - `cancelled` — true when cooperative cancellation (AbortSignal) was requested.
|
|
53
|
+
* - `timedOut` — true when the response-timeout guard fired (and likely killed).
|
|
54
|
+
* - `killed` — true when the parent explicitly killed the child (best-effort).
|
|
55
|
+
* - `spawnError` — truthy when the child emitted a spawn/process `error` event.
|
|
56
|
+
* - `stderrSnippet` — tail of captured stderr, used to detect native panics.
|
|
57
|
+
*/
|
|
58
|
+
export interface CrashClassificationInput {
|
|
59
|
+
exitCode?: number | null;
|
|
60
|
+
signal?: string | null;
|
|
61
|
+
cancelled?: boolean;
|
|
62
|
+
timedOut?: boolean;
|
|
63
|
+
killed?: boolean;
|
|
64
|
+
spawnError?: unknown;
|
|
65
|
+
stderrSnippet?: string;
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
/**
|
|
69
|
+
* Result of classifying an exit. `crashClass` is machine-readable;
|
|
70
|
+
* `reason` is a human-friendly one-liner suitable for logs/diagnostics.
|
|
71
|
+
*/
|
|
72
|
+
export interface CrashClassification {
|
|
73
|
+
crashClass: CrashClass;
|
|
74
|
+
reason: string;
|
|
75
|
+
}
|
|
76
|
+
|
|
77
|
+
// ── native-panic detection ──────────────────────────────────────────────────
|
|
78
|
+
//
|
|
79
|
+
// We look for a small, well-known set of native-crash signatures in the stderr
|
|
80
|
+
// tail. This is deliberately conservative: false positives would mislabel
|
|
81
|
+
// ordinary non-zero exits as native panics. The patterns are anchored on
|
|
82
|
+
// substrings that do not appear in normal application output.
|
|
83
|
+
|
|
84
|
+
interface NativePanicSignature {
|
|
85
|
+
/** Substring to search for (case-insensitive). */
|
|
86
|
+
pattern: string;
|
|
87
|
+
/** Human-readable class-specific reason suffix. */
|
|
88
|
+
label: string;
|
|
89
|
+
}
|
|
90
|
+
|
|
91
|
+
const NATIVE_PANIC_SIGNATURES: readonly NativePanicSignature[] = [
|
|
92
|
+
{ pattern: "sigsegv", label: "segmentation fault" },
|
|
93
|
+
{ pattern: "segfault", label: "segmentation fault" },
|
|
94
|
+
{ pattern: "segmentation fault", label: "segmentation fault" },
|
|
95
|
+
{ pattern: "sigabrt", label: "abort signal" },
|
|
96
|
+
{ pattern: "abort(", label: "abort" },
|
|
97
|
+
{ pattern: "fatal error", label: "V8/node fatal error" },
|
|
98
|
+
{ pattern: "panic:", label: "rust/go panic" },
|
|
99
|
+
{ pattern: "thread '", label: "rust panic (thread context)" },
|
|
100
|
+
{ pattern: "illegal instruction", label: "illegal instruction" },
|
|
101
|
+
{ pattern: "double free", label: "heap corruption (double free)" },
|
|
102
|
+
];
|
|
103
|
+
|
|
104
|
+
/**
|
|
105
|
+
* If the stderr tail contains a recognizable native-crash signature, return the
|
|
106
|
+
* matching label; otherwise `null`. Case-insensitive.
|
|
107
|
+
*/
|
|
108
|
+
function detectNativePanic(stderrSnippet: string | undefined): string | null {
|
|
109
|
+
if (!stderrSnippet) return null;
|
|
110
|
+
const lower = stderrSnippet.toLowerCase();
|
|
111
|
+
for (const sig of NATIVE_PANIC_SIGNATURES) {
|
|
112
|
+
if (lower.includes(sig.pattern)) return sig.label;
|
|
113
|
+
}
|
|
114
|
+
return null;
|
|
115
|
+
}
|
|
116
|
+
|
|
117
|
+
/** Normalize an optional/signal-ish value to `string | null`. */
|
|
118
|
+
function normalizeSignal(signal: string | null | undefined): string | null {
|
|
119
|
+
return signal ?? null;
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
/**
|
|
123
|
+
* Classify a worker exit into a {@link CrashClass}.
|
|
124
|
+
*
|
|
125
|
+
* Pure: no I/O, no globals, no side effects. Deterministic given the same input.
|
|
126
|
+
* Safe to call from any context (including signal handlers).
|
|
127
|
+
*
|
|
128
|
+
* @example
|
|
129
|
+
* classifyProcessCrash({ exitCode: 0 }) // → clean_exit
|
|
130
|
+
* classifyProcessCrash({ exitCode: 1 }) // → non_zero_exit
|
|
131
|
+
* classifyProcessCrash({ signal: "SIGTERM" }) // → signal_exit
|
|
132
|
+
* classifyProcessCrash({ timedOut: true, exitCode: null }) // → timeout
|
|
133
|
+
* classifyProcessCrash({ cancelled: true, exitCode: null }) // → cancelled
|
|
134
|
+
* classifyProcessCrash({ spawnError: new Error("ENOENT") }) // → spawn_error
|
|
135
|
+
* classifyProcessCrash({ exitCode: null }) // → protocol_exit
|
|
136
|
+
* classifyProcessCrash({ exitCode: 139, signal: "SIGSEGV" }) // → signal_exit
|
|
137
|
+
* classifyProcessCrash({ exitCode: 134, stderrSnippet: "abort()" }) // → native_panic
|
|
138
|
+
*/
|
|
139
|
+
export function classifyProcessCrash(input: CrashClassificationInput): CrashClassification {
|
|
140
|
+
const exitCode = input.exitCode ?? null;
|
|
141
|
+
const signal = normalizeSignal(input.signal);
|
|
142
|
+
|
|
143
|
+
// 1. Timeout takes precedence: the response-timeout guard is the proximate
|
|
144
|
+
// cause of death even if cancellation was also requested.
|
|
145
|
+
if (input.timedOut) {
|
|
146
|
+
return { crashClass: "timeout", reason: "process timed out (response timeout guard fired)" };
|
|
147
|
+
}
|
|
148
|
+
|
|
149
|
+
// 2. Cooperative cancellation.
|
|
150
|
+
if (input.cancelled) {
|
|
151
|
+
return { crashClass: "cancelled", reason: "process was cancelled (abort requested)" };
|
|
152
|
+
}
|
|
153
|
+
|
|
154
|
+
// 3. Spawn error: the child never started or emitted a process error.
|
|
155
|
+
if (input.spawnError !== undefined && input.spawnError !== null) {
|
|
156
|
+
return {
|
|
157
|
+
crashClass: "spawn_error",
|
|
158
|
+
reason: `spawn error: ${stringifyError(input.spawnError)}`,
|
|
159
|
+
};
|
|
160
|
+
}
|
|
161
|
+
|
|
162
|
+
// 4. Native panic from stderr (only when we have a signal/abnormal exit —
|
|
163
|
+
// never reclassify a clean exit as a panic based on stderr noise).
|
|
164
|
+
const abnormalExit = signal !== null || (exitCode !== null && exitCode !== 0);
|
|
165
|
+
if (abnormalExit) {
|
|
166
|
+
const panic = detectNativePanic(input.stderrSnippet);
|
|
167
|
+
if (panic !== null) {
|
|
168
|
+
return { crashClass: "native_panic", reason: `native panic detected: ${panic}` };
|
|
169
|
+
}
|
|
170
|
+
}
|
|
171
|
+
|
|
172
|
+
// 5. Signal exit.
|
|
173
|
+
if (signal !== null) {
|
|
174
|
+
return { crashClass: "signal_exit", reason: `process exited after signal ${signal}` };
|
|
175
|
+
}
|
|
176
|
+
|
|
177
|
+
// 6. Clean exit.
|
|
178
|
+
if (exitCode === 0) {
|
|
179
|
+
return { crashClass: "clean_exit", reason: "process exited cleanly" };
|
|
180
|
+
}
|
|
181
|
+
|
|
182
|
+
// 7. Non-zero exit.
|
|
183
|
+
if (exitCode !== null) {
|
|
184
|
+
return { crashClass: "non_zero_exit", reason: `process exited with code ${exitCode}` };
|
|
185
|
+
}
|
|
186
|
+
|
|
187
|
+
// 8. Protocol exit: exitCode is null with no signal — the process stream
|
|
188
|
+
// ended before a normal exit was observed (e.g. stdio closed unexpectedly).
|
|
189
|
+
// If `killed` is true but no signal was recorded, treat as protocol_exit
|
|
190
|
+
// (the kill may not have delivered a signal we could capture).
|
|
191
|
+
if (input.killed) {
|
|
192
|
+
return { crashClass: "protocol_exit", reason: "process was killed but no signal/exit code was captured" };
|
|
193
|
+
}
|
|
194
|
+
|
|
195
|
+
// 8b. Truly null exitCode with no other context — protocol/stream ended early.
|
|
196
|
+
return { crashClass: "protocol_exit", reason: "process exited before protocol completion (exit code unknown)" };
|
|
197
|
+
}
|
|
198
|
+
|
|
199
|
+
/** Render an unknown error value to a short message string. */
|
|
200
|
+
function stringifyError(error: unknown): string {
|
|
201
|
+
if (error instanceof Error) return error.message || error.name;
|
|
202
|
+
if (typeof error === "string") return error;
|
|
203
|
+
try {
|
|
204
|
+
return String(error);
|
|
205
|
+
} catch {
|
|
206
|
+
return "(unstringifiable error)";
|
|
207
|
+
}
|
|
208
|
+
}
|