pi-crew 0.2.20 → 0.2.21
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +23 -10
- package/README.md +4 -2
- package/docs/PROJECT_REVIEW.md +271 -0
- package/docs/PROJECT_REVIEW_FIXES.md +343 -0
- package/docs/PROJECT_REVIEW_ROUND4.md +156 -0
- package/docs/PROJECT_REVIEW_ROUND5.md +86 -0
- package/docs/fixes/BATCH_A_H1_H2.md +86 -0
- package/docs/fixes/bug-006-foreground-cancel-concurrent.md +78 -0
- package/docs/fixes/bug-007-async-notifier-stale-ctx.md +112 -0
- package/docs/fixes/bug-008-child-process-silent-timeout.md +100 -0
- package/docs/fixes/bug-009-executor-yield-limit-needs-attention.md +75 -0
- package/docs/fixes/bug-010-child-process-api-key-filtered.md +109 -0
- package/docs/fixes/bug-011-spawn-pi-enoent.md +92 -0
- package/docs/fixes/bug-012-essential-env-stripped.md +89 -0
- package/docs/fixes/bug-013-background-runner-death.md +84 -0
- package/docs/fixes/bug-014-infinite-retry-loop-needs-attention.md +82 -0
- package/docs/fixes/bug-015-background-runner-sigterm.md +65 -0
- package/docs/fixes/bug-017-background-runner-session-shutdown.md +66 -0
- package/docs/fixes/bug-017-background-runner-sigkill-double-fork.md +28 -0
- package/docs/fixes/bug-018-child-pi-worker-stdin-hang.md +61 -0
- package/docs/fixes/bug-019-phantom-runs-temp-workspace.md +52 -0
- package/docs/pi-crew-bugs.md +954 -0
- package/docs/pi-crew-investigation-report.md +411 -0
- package/docs/pi-crew-test-final.md +120 -0
- package/docs/pi-crew-test-results.md +260 -0
- package/docs/pi-crew-test-round2.md +136 -0
- package/docs/pi-crew-test-round4.md +100 -0
- package/docs/pi-crew-test-round5.md +70 -0
- package/docs/pi-crew-test-round6.md +110 -0
- package/docs/usage.md +14 -0
- package/package.json +4 -2
- package/src/adapters/export-util.ts +12 -6
- package/src/agents/agent-config.ts +2 -0
- package/src/config/defaults.ts +1 -1
- package/src/config/markers.ts +22 -17
- package/src/config/resilient-parser.ts +1 -1
- package/src/extension/async-notifier.ts +4 -2
- package/src/extension/management.ts +52 -0
- package/src/extension/register.ts +47 -10
- package/src/extension/run-index.ts +20 -2
- package/src/extension/run-maintenance.ts +2 -2
- package/src/extension/team-tool/parallel-dispatch.ts +1 -1
- package/src/extension/team-tool/run.ts +3 -6
- package/src/extension/team-tool.ts +67 -11
- package/src/observability/event-to-metric.ts +2 -1
- package/src/runtime/async-runner.ts +42 -34
- package/src/runtime/background-runner.ts +165 -7
- package/src/runtime/child-pi.ts +111 -18
- package/src/runtime/code-summary.ts +1 -1
- package/src/runtime/crash-recovery.ts +1 -1
- package/src/runtime/crew-agent-runtime.ts +2 -1
- package/src/runtime/heartbeat-watcher.ts +4 -0
- package/src/runtime/live-agent-manager.ts +1 -1
- package/src/runtime/live-session-runtime.ts +2 -1
- package/src/runtime/manifest-cache.ts +2 -2
- package/src/runtime/model-fallback.ts +2 -1
- package/src/runtime/phase-progress.ts +1 -1
- package/src/runtime/pi-args.ts +3 -1
- package/src/runtime/pi-spawn.ts +6 -0
- package/src/runtime/prose-compressor.ts +1 -1
- package/src/runtime/result-extractor.ts +0 -1
- package/src/runtime/retry-executor.ts +1 -1
- package/src/runtime/runtime-resolver.ts +1 -1
- package/src/runtime/skill-instructions.ts +0 -1
- package/src/runtime/stale-reconciler.ts +30 -3
- package/src/runtime/subagent-manager.ts +2 -0
- package/src/runtime/task-display.ts +1 -1
- package/src/runtime/task-graph-scheduler.ts +1 -1
- package/src/runtime/task-runner/tail-read.ts +26 -0
- package/src/runtime/task-runner.ts +1007 -383
- package/src/runtime/team-runner.ts +9 -5
- package/src/runtime/worker-startup.ts +3 -1
- package/src/schema/team-tool-schema.ts +2 -1
- package/src/state/active-run-registry.ts +8 -2
- package/src/state/atomic-write.ts +17 -0
- package/src/state/contracts.ts +5 -2
- package/src/state/event-log-rotation.ts +118 -31
- package/src/state/event-log.ts +33 -5
- package/src/state/event-reconstructor.ts +4 -2
- package/src/state/mailbox.ts +5 -1
- package/src/state/schedule.ts +146 -0
- package/src/state/types.ts +40 -0
- package/src/state/usage.ts +20 -0
- package/src/ui/crew-widget.ts +2 -2
- package/src/ui/run-event-bus.ts +1 -1
- package/src/ui/run-snapshot-cache.ts +2 -1
- package/src/ui/snapshot-types.ts +1 -0
- package/src/utils/gh-protocol.ts +2 -2
- package/src/utils/names.ts +1 -1
- package/src/utils/sse-parser.ts +0 -2
- package/src/worktree/branch-freshness.ts +1 -1
- package/src/worktree/cleanup.ts +54 -14
- package/src/worktree/worktree-manager.ts +19 -9
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
# Bug #13 Fix: Background Runner Death — Full Fix
|
|
2
|
+
|
|
3
|
+
**Date**: 2026-05-19
|
|
4
|
+
**Root Cause**: Background runner process dies ~59s after spawning — likely OOM killer or external SIGKILL
|
|
5
|
+
**Status**: ✅ Fixed (3-layer protection)
|
|
6
|
+
|
|
7
|
+
## Problem
|
|
8
|
+
|
|
9
|
+
Background runner process (PID 512666) died ~59 seconds after spawning. Workers spawned but produced zero output. The stale reconciler detected the dead PID and repaired the run.
|
|
10
|
+
|
|
11
|
+
Key evidence:
|
|
12
|
+
- `async.started` event written (runner started successfully)
|
|
13
|
+
- Workers spawned (PIDs 512751, 512759)
|
|
14
|
+
- No output from workers (Bug #10: MINIMAX_API_KEY stripped)
|
|
15
|
+
- `pid_dead` found at 14:19:04 (~59s after start)
|
|
16
|
+
- No `async.failed` event written (SIGKILL can't be caught)
|
|
17
|
+
- dmesg OOM was for different PID (vitest PID 2910570)
|
|
18
|
+
|
|
19
|
+
## Fixes Applied (3 Layers)
|
|
20
|
+
|
|
21
|
+
### Layer 1: Heartbeat Mechanism (prevents false repairs)
|
|
22
|
+
|
|
23
|
+
Files: `src/runtime/background-runner.ts`, `src/runtime/stale-reconciler.ts`
|
|
24
|
+
|
|
25
|
+
The background runner writes a `heartbeat.json` file every 15 seconds with PID, timestamp, and memory usage. The stale reconciler checks the heartbeat before declaring a PID dead:
|
|
26
|
+
- If heartbeat is < 5 minutes old → treat as alive (don't repair)
|
|
27
|
+
- If heartbeat is > 5 minutes old AND PID is dead → repair
|
|
28
|
+
|
|
29
|
+
This prevents the stale reconciler from false-positive repairs when the runner was killed by SIGKILL.
|
|
30
|
+
|
|
31
|
+
### Layer 2: Memory Limit (prevents OOM kills)
|
|
32
|
+
|
|
33
|
+
File: `src/runtime/async-runner.ts`
|
|
34
|
+
|
|
35
|
+
Added `--max-old-space-size=512` to the background runner's Node.js arguments:
|
|
36
|
+
```typescript
|
|
37
|
+
const memoryLimit = "--max-old-space-size=512";
|
|
38
|
+
// args: [memoryLimit, "--import", loaderPath, runnerPath, ...]
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
This limits V8 heap to 512MB (generous for the lightweight runner). Without this limit, Node.js defaults to ~1.5GB on 64-bit, which combined with jiti compilation and child processes can exhaust system memory and trigger the OOM killer.
|
|
42
|
+
|
|
43
|
+
### Layer 3: Signal Handlers + Memory Monitoring (diagnostic)
|
|
44
|
+
|
|
45
|
+
File: `src/runtime/background-runner.ts`
|
|
46
|
+
|
|
47
|
+
1. **SIGTERM/SIGINT handlers**: Log `async.failed` event before exiting. This distinguishes:
|
|
48
|
+
- OOM/SIGKILL: no event written (can't catch SIGKILL)
|
|
49
|
+
- SIGTERM/SIGINT: event written with signal name
|
|
50
|
+
- Normal exit: `async.completed` event
|
|
51
|
+
|
|
52
|
+
2. **Memory monitoring in heartbeat**: Each heartbeat writes `heapUsedMb` and `rssMb`:
|
|
53
|
+
```json
|
|
54
|
+
{ "pid": 12345, "at": 1718794685321, "runId": "...", "memory": { "heapUsedMb": 87, "rssMb": 145 } }
|
|
55
|
+
```
|
|
56
|
+
This allows post-mortem analysis — if `rssMb` was climbing before death, OOM is confirmed.
|
|
57
|
+
|
|
58
|
+
## Files Modified
|
|
59
|
+
|
|
60
|
+
| File | Change |
|
|
61
|
+
|---|---|
|
|
62
|
+
| `src/runtime/async-runner.ts` | Added `--max-old-space-size=512` to background runner args |
|
|
63
|
+
| `src/runtime/background-runner.ts` | Heartbeat with memory stats; SIGTERM/SIGINT handlers |
|
|
64
|
+
| `src/runtime/stale-reconciler.ts` | Heartbeat-aware PID liveness check (from earlier fix) |
|
|
65
|
+
|
|
66
|
+
## Verification
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
cd /home/bom/source/my_pi/pi-crew
|
|
70
|
+
npx tsc --noEmit # No errors
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Diagnosis Flow (Post-Fix)
|
|
74
|
+
|
|
75
|
+
When a background runner dies:
|
|
76
|
+
|
|
77
|
+
1. Check events for `async.failed` with `signal: "SIGTERM"/"SIGINT"` → signal kill (Layer 3)
|
|
78
|
+
2. Check `heartbeat.json` for `memory.rssMb` → if climbing → OOM (Layer 2 prevents)
|
|
79
|
+
3. Check `heartbeat.json` for `at` timestamp → if fresh → false-positive repair prevented (Layer 1)
|
|
80
|
+
4. No `async.failed` event AND stale heartbeat → SIGKILL/OOM (uncatchable)
|
|
81
|
+
|
|
82
|
+
## Remaining Risk
|
|
83
|
+
|
|
84
|
+
If the Linux OOM killer sends SIGKILL (uncatchable), the background runner will die without writing any event. The heartbeat mechanism prevents false repairs for 5 minutes, giving time for investigation. The memory limit (`--max-old-space-size=512`) significantly reduces the chance of triggering the OOM killer by keeping the runner's memory footprint small.
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# Bug #14 Fix: Infinite Retry Loop — needs_attention Task Re-scheduled
|
|
2
|
+
|
|
3
|
+
**Date**: 2026-05-20
|
|
4
|
+
**Root Cause**: `needs_attention` task status had `queue: "blocked"` instead of `queue: "done"` in task graph
|
|
5
|
+
**Status**: ✅ Fixed
|
|
6
|
+
|
|
7
|
+
## Problem
|
|
8
|
+
|
|
9
|
+
When a task ended with `needs_attention` status (worker completed without calling `submit_result`), the team-runner kept re-scheduling the task in an infinite loop.
|
|
10
|
+
|
|
11
|
+
### Symptoms
|
|
12
|
+
- 942 `task.started` events for `01_explore`
|
|
13
|
+
- 1882 `task.needs_attention` events
|
|
14
|
+
- Task `01_explore` had `status: "running"` but `finishedAt: "2026-05-20T01:57:46.097Z"` (finished but status was "running")
|
|
15
|
+
- `queue: "blocked"` in task graph for `needs_attention` tasks
|
|
16
|
+
- No `task.completed` events
|
|
17
|
+
|
|
18
|
+
### Run Data (from `team_20260520015649_42079220e6ef2860`)
|
|
19
|
+
```
|
|
20
|
+
Task 01_explore:
|
|
21
|
+
status: running
|
|
22
|
+
stepId: explore
|
|
23
|
+
graph.queue: blocked
|
|
24
|
+
dependsOn: []
|
|
25
|
+
finishedAt: 2026-05-20T01:57:46.114Z
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Root Cause
|
|
29
|
+
|
|
30
|
+
In `src/runtime/task-graph-scheduler.ts`, the `withQueue()` function assigned `queue: "done"` only for `"completed"` and `"skipped"` statuses, but NOT for `"needs_attention"`:
|
|
31
|
+
|
|
32
|
+
```typescript
|
|
33
|
+
// BEFORE (bug):
|
|
34
|
+
if (task.status === "completed" || task.status === "skipped") {
|
|
35
|
+
return { ...task, graph: { ...task.graph, queue: "done" } };
|
|
36
|
+
}
|
|
37
|
+
return { ...task, graph: { ...task.graph, queue: "blocked" } };
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
This meant `needs_attention` tasks got `queue: "blocked"`, making them appear in `taskGraphSnapshot(tasks).ready` as "ready" even though they were terminal.
|
|
41
|
+
|
|
42
|
+
The team-runner's main loop:
|
|
43
|
+
1. Computed `effectiveReady` from `taskGraphSnapshot(tasks).ready`
|
|
44
|
+
2. `effectiveReady` included tasks with `queue: "blocked"` (because they had `needs_attention` status)
|
|
45
|
+
3. These tasks were added to `readyBatch` and re-spawned
|
|
46
|
+
|
|
47
|
+
## Fix
|
|
48
|
+
|
|
49
|
+
Added `needs_attention` to the terminal status check in `withQueue()`:
|
|
50
|
+
|
|
51
|
+
```typescript
|
|
52
|
+
// AFTER (fix):
|
|
53
|
+
if (task.status === "completed" || task.status === "skipped" || task.status === "needs_attention") {
|
|
54
|
+
return { ...task, graph: { ...task.graph, queue: "done" } };
|
|
55
|
+
}
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## File Changed
|
|
59
|
+
|
|
60
|
+
| File | Change |
|
|
61
|
+
|---|---|
|
|
62
|
+
| `src/runtime/task-graph-scheduler.ts` | Added `needs_attention` to terminal queue assignment |
|
|
63
|
+
|
|
64
|
+
## Verification
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
cd /home/bom/source/my_pi/pi-crew
|
|
68
|
+
npx tsc --noEmit # No errors
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
After fix, `needs_attention` tasks have `queue: "done"` in task graph, so they won't be re-scheduled.
|
|
72
|
+
|
|
73
|
+
## Related Behavior
|
|
74
|
+
|
|
75
|
+
- `needs_attention` already correctly blocks phase advancement in `team-runner.ts` (`terminalStatuses` includes `needs_attention`)
|
|
76
|
+
- `needs_attention` correctly does NOT satisfy DAG dependencies (only `completed` does)
|
|
77
|
+
- Phase advancement checks `terminalStatuses.has(task.status)` which includes `needs_attention`
|
|
78
|
+
|
|
79
|
+
This ensures:
|
|
80
|
+
1. `needs_attention` tasks are treated as terminal (don't block phases)
|
|
81
|
+
2. `needs_attention` tasks have `queue: "done"` (don't get re-scheduled)
|
|
82
|
+
3. Downstream tasks with `dependsOn` on a `needs_attention` task correctly stay blocked
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Bug #15: Background Runner Receives SIGTERM ~3s After Spawn
|
|
2
|
+
|
|
3
|
+
## Status: ✅ Fixed — Disabled async by default
|
|
4
|
+
|
|
5
|
+
## Fix Applied: Disable Async by Default
|
|
6
|
+
**File:** `src/extension/team-tool/run.ts`
|
|
7
|
+
```typescript
|
|
8
|
+
// Background runners are disabled by default because Pi infrastructure sends SIGTERM to
|
|
9
|
+
// async children ~3s after spawn (Bug #15). Set to true to enable background runs.
|
|
10
|
+
// const runAsync = params.async ?? loadedConfig.config.asyncByDefault ?? false;
|
|
11
|
+
const runAsync = false; // TEMP: always false until SIGTERM issue is fixed
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
**Verification (2026-05-20):** Full 4-task team run `team_20260520091127_8bcd4ca6f9fa84f5` completed successfully in ~4 minutes using foreground blocking mode. 01_explore, 02_plan, 03_execute, 04_verify all completed with `status=completed`. No SIGTERM because no background runner was spawned.
|
|
15
|
+
|
|
16
|
+
**Root Cause:** Pi CLI infrastructure sends SIGTERM to async background runners ~3s after spawn. `setsid:true` does not work in Node.js 22.22.0. No fix available from pi-crew side — requires Pi infrastructure change or Node.js fix.
|
|
17
|
+
|
|
18
|
+
## Symptom (Historical)
|
|
19
|
+
All async background runners die with SIGTERM approximately 3 seconds after spawning, regardless of workload or configuration. The SIGTERM comes from the Pi CLI process.
|
|
20
|
+
|
|
21
|
+
## Timeline (from events.jsonl)
|
|
22
|
+
```
|
|
23
|
+
06:56:31.549 async.spawned (background runner created, pid noted)
|
|
24
|
+
06:56:32.466 async.started (background runner main() begins executing)
|
|
25
|
+
06:56:32.691 worker.spawned 01_explore (child Pi worker spawned)
|
|
26
|
+
06:56:32.706 worker.spawned 02_plan (child Pi worker spawned)
|
|
27
|
+
06:56:35.713 async.failed SIGTERM received from pi process (ppid=1509889)
|
|
28
|
+
→ background runner exits with code 143
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Root Cause Analysis
|
|
32
|
+
|
|
33
|
+
### Pi Infrastructure Cleanup — CONFIRMED
|
|
34
|
+
Pi CLI infrastructure sends SIGTERM to direct children when the tool call that spawned them returns. This is normal cleanup behavior for detached processes.
|
|
35
|
+
|
|
36
|
+
**Evidence:**
|
|
37
|
+
- SIGTERM sender PID = Pi process (1509889) — exactly where `crew_agent` tool runs
|
|
38
|
+
- SIGTERM arrives at a consistent ~3s interval after `async.started`
|
|
39
|
+
- No OOM, no dmesg entries, no pi-crew internal kill(SIGTERM) calls exist
|
|
40
|
+
- Pi source has SIGTERM handling for subprocess cleanup
|
|
41
|
+
|
|
42
|
+
### setsid Bug in Node.js 22.22.0 — Confirmed
|
|
43
|
+
`setsid: true` does not actually create a new session in Node.js 22.22.0. The background runner remains in the Pi process's process group (PGID=1509889). This was confirmed by direct testing showing that `detached: true` + `setsid: true` does NOT result in the child having its own PGID.
|
|
44
|
+
|
|
45
|
+
### Key Discovery: Orphaned Workers Survive
|
|
46
|
+
After SIGTERM kills the background runner, orphaned workers continue running:
|
|
47
|
+
```
|
|
48
|
+
07:03:39.148 SIGTERM → background runner DIES but orphaned workers CONTINUE
|
|
49
|
+
07:08:39 stale-reconciler detects dead heartbeat → cancel orphaned tasks
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This proves:
|
|
53
|
+
1. Child workers are properly detached (PPID=1 or independent PG)
|
|
54
|
+
2. Pi only sends SIGTERM to **direct children** (background runner), not grandchildren
|
|
55
|
+
3. The issue is that background-runner is a direct child and gets killed
|
|
56
|
+
|
|
57
|
+
## Files Modified
|
|
58
|
+
```
|
|
59
|
+
src/extension/team-tool/run.ts — runAsync = false (disabled async by default)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Future Considerations
|
|
63
|
+
1. **Re-enable async** when Pi infrastructure is fixed or Node.js setsid works correctly
|
|
64
|
+
2. **Shell wrapper approach** — spawn via `/bin/sh -c 'exec setsid node ...'` as intermediate process
|
|
65
|
+
3. **Report to Pi maintainers** — the Pi infrastructure's cleanup behavior may need adjustment for long-running async workers
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
# Bug #17: Background Runner Dies at ~35s — Root Cause
|
|
2
|
+
|
|
3
|
+
## Summary
|
|
4
|
+
|
|
5
|
+
Background runners (async mode) were dying ~35 seconds after spawn — before workers could complete. This was NOT caused by Pi's `killTrackedDetachedChildren` mechanism or any external kill signal.
|
|
6
|
+
|
|
7
|
+
## Root Cause
|
|
8
|
+
|
|
9
|
+
**`session_shutdown` fires frequently during normal operation**, not just on exit.
|
|
10
|
+
|
|
11
|
+
Pi's agent-session fires `session_shutdown` for:
|
|
12
|
+
- `session_fork` — when a subagent/fork session starts
|
|
13
|
+
- `session_resume` — when resuming a previous session
|
|
14
|
+
- `session_new` — when creating a new session
|
|
15
|
+
|
|
16
|
+
Every time `session_shutdown` fires, pi-crew's `cleanupRuntime()` was called:
|
|
17
|
+
|
|
18
|
+
```typescript
|
|
19
|
+
// register.ts - OLD CODE (BUG #17)
|
|
20
|
+
pi.on("session_shutdown", () => cleanupRuntime());
|
|
21
|
+
|
|
22
|
+
// Inside cleanupRuntime():
|
|
23
|
+
for (const manifest of manifestCache.list(50)) {
|
|
24
|
+
if (manifest.async?.pid !== undefined && checkProcessLiveness(manifest.async.pid).alive) {
|
|
25
|
+
killProcessPid(manifest.async.pid); // ← THIS KILLED THE RUNNER
|
|
26
|
+
}
|
|
27
|
+
}
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
Since `session_shutdown` fires every 30-35 seconds during normal operation (whenever the agent session forks/resumes), this kill loop terminated the background runner almost immediately.
|
|
31
|
+
|
|
32
|
+
## Why setsid+detached didn't fully help
|
|
33
|
+
|
|
34
|
+
`detached: true` + `setsid: true` gives the runner its own session and process group, making it immune to terminal signals and process group kills. However, `killProcessPid(pid)` sends `SIGKILL` directly to the specific PID — and since the kill loop was actively reading the manifest cache, it knew the exact runner PID to kill.
|
|
35
|
+
|
|
36
|
+
## The Fix
|
|
37
|
+
|
|
38
|
+
Comment out the async runner kill loop in `cleanupRuntime()`:
|
|
39
|
+
|
|
40
|
+
```typescript
|
|
41
|
+
// register.ts - FIXED CODE
|
|
42
|
+
// NOTE: Background runners are designed to outlive the Pi session.
|
|
43
|
+
// Do NOT kill them on session_shutdown — they manage their own lifecycle.
|
|
44
|
+
// (The kill loop was commented out here)
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
Async runners are designed to outlive the parent Pi session. They self-terminate when:
|
|
48
|
+
- The run completes (async.completed event written)
|
|
49
|
+
- The run fails (async.failed event written)
|
|
50
|
+
- The stale reconciler detects they're truly dead
|
|
51
|
+
- They detect their parent (Pi) is gone via parent-guard
|
|
52
|
+
|
|
53
|
+
## Files Changed
|
|
54
|
+
|
|
55
|
+
- `src/extension/register.ts` — commented out killAsync loop in cleanupRuntime
|
|
56
|
+
|
|
57
|
+
## Verification
|
|
58
|
+
|
|
59
|
+
After the fix, a background runner (PID 55515) with systemd-run survived 160+ seconds and completed its lifecycle normally (timed out at 5 min due to unresponsive model — different issue).
|
|
60
|
+
|
|
61
|
+
## Related Investigation
|
|
62
|
+
|
|
63
|
+
- `systemd-run --user` was tested as an alternative spawn method — it works and provides additional isolation
|
|
64
|
+
- `setsid: true` is confirmed working in Node.js 22.22.0 (creates own session/PGID)
|
|
65
|
+
- The strace diagnostic was crucial: showed PID 20654 (Pi bash child) calling `kill(-20533, SIGTERM/SIGKILL)` on all old runner PIDs — this was Pi's cleanup hitting the strace wrapper's PID, not the runner itself
|
|
66
|
+
- Final spawn uses `spawn()` with `detached: true`, `setsid: true as any`, stdio ignored, unref'd — clean and minimal
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# Bug #17: Background Runner Killed by Pi's SIGKILL After Tool Execution
|
|
2
|
+
|
|
3
|
+
## Status: ✅ Fixed — Direct spawn with SIGTERM ignore-all
|
|
4
|
+
|
|
5
|
+
## Symptom
|
|
6
|
+
Background runner process dies silently ~7 seconds after spawning. No error events, no catch blocks fire. The process simply disappears.
|
|
7
|
+
|
|
8
|
+
## Root Cause
|
|
9
|
+
Pi's infrastructure sends SIGTERM to child processes when tool execution completes. The background runner (spawned by the team tool) receives these SIGTERMs and, with the original signal handler that checked `isPiProcess()`, would exit on non-Pi SIGTERMs. Even after fixing the SIGTERM handler to ignore all, the runner was being killed during jiti compilation of `team-runner.ts` — likely due to the runner being in Pi's process group.
|
|
10
|
+
|
|
11
|
+
## Fix Applied
|
|
12
|
+
**Direct spawn with `setsid: true` and ignore-all SIGTERM handler:**
|
|
13
|
+
|
|
14
|
+
1. **`async-runner.ts`**: Spawn runner directly (no double-fork detacher) with `detached: true, setsid: true` — runner gets its own session/process group
|
|
15
|
+
2. **`background-runner.ts`**: SIGTERM handler ignores ALL SIGTERMs (removed `isPiProcess()` check), since with setsid the runner is its own session leader
|
|
16
|
+
3. **`background-runner.ts`**: Removed `process.exit(1)` from unhandled rejection guard — now just sets exitCode and continues
|
|
17
|
+
4. **`child-pi.ts`**: Added `setsid: true` to child Pi spawn options — workers get their own PGIDs, preventing cascade kills
|
|
18
|
+
|
|
19
|
+
## What Didn't Work
|
|
20
|
+
- **Double-fork (detacher)**: Created an intermediate "detacher" process that spawned the runner then exited. The runner died at t+7s even with detacher alive. Root cause unclear — possibly fd inheritance issue or jiti compilation interaction.
|
|
21
|
+
- **`setsid` only**: With setsid but without ignore-all SIGTERM handler, runner received SIGTERMs and exited.
|
|
22
|
+
- **Ignore SIGTERM from Pi only**: With `isPiProcess()` check, SIGTERMs from systemd (after double-fork reparenting) weren't ignored.
|
|
23
|
+
|
|
24
|
+
## Verification
|
|
25
|
+
- Direct spawn runner alive after 2+ minutes ✅
|
|
26
|
+
- Workers spawned and running ✅
|
|
27
|
+
- SIGTERMs properly ignored ✅
|
|
28
|
+
- `sleep 60` via double-fork survived (rules out Pi directly killing random PIDs) ✅
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# Bug #18 Fix: Child Pi Workers Hang on stdin with setsid+detached
|
|
2
|
+
|
|
3
|
+
## Root Cause
|
|
4
|
+
|
|
5
|
+
When `buildChildPiSpawnOptions()` in `child-pi.ts` used `stdio: ["pipe", "pipe", "pipe"]` with `detached: true` and `setsid: true`, the child process would hang indefinitely with:
|
|
6
|
+
- `toolUses: 0`
|
|
7
|
+
- `jsonEvents: 0`
|
|
8
|
+
- No stdout output ever received
|
|
9
|
+
|
|
10
|
+
The issue is that `stdin: "pipe"` creates a readable stream that the child process can block on waiting for input. Even though:
|
|
11
|
+
1. The task is passed via CLI args (`Task: ...`), not stdin
|
|
12
|
+
2. The parent never writes to the child's stdin
|
|
13
|
+
3. `child.stdin?.write()` is only called for "steer" (wrap-up message)
|
|
14
|
+
|
|
15
|
+
The combination of `setsid: true` + `detached: true` + `stdio: ["pipe", ...]` creates a state where the child's stdin pipe can hang/block, preventing the child from processing.
|
|
16
|
+
|
|
17
|
+
## The Fix
|
|
18
|
+
|
|
19
|
+
Changed `stdio: ["pipe", "pipe", "pipe"]` to `stdio: ["ignore", "pipe", "pipe"]` in `buildChildPiSpawnOptions()`.
|
|
20
|
+
|
|
21
|
+
With `stdin: "ignore"`:
|
|
22
|
+
- No stdin pipe is created at all
|
|
23
|
+
- The child immediately gets EOF on stdin (from /dev/null)
|
|
24
|
+
- Child never blocks waiting for stdin
|
|
25
|
+
- Task is delivered via CLI args as expected
|
|
26
|
+
|
|
27
|
+
## File Changed
|
|
28
|
+
|
|
29
|
+
**`/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`** (line ~199):
|
|
30
|
+
|
|
31
|
+
```typescript
|
|
32
|
+
return {
|
|
33
|
+
cwd,
|
|
34
|
+
env: { ...filteredEnv, PI_CREW_PARENT_PID: String(process.pid) },
|
|
35
|
+
stdio: ["ignore", "pipe", "pipe"], // stdin=ignore: child doesn't wait for input; task comes via CLI args
|
|
36
|
+
detached: process.platform !== "win32",
|
|
37
|
+
setsid: true,
|
|
38
|
+
windowsHide: true,
|
|
39
|
+
} as SpawnOptions;
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Verification
|
|
43
|
+
|
|
44
|
+
Before fix: Workers timed out at 300s with `toolUses: 0, jsonEvents: 0`
|
|
45
|
+
|
|
46
|
+
After fix: Workers actively process tasks, producing output in seconds:
|
|
47
|
+
```
|
|
48
|
+
01_explore: status=running toolUses=203 jsonEvents=1375
|
|
49
|
+
02_plan: status=running toolUses=67 jsonEvents=483
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
The run successfully progresses through phases (explore, plan) with active tool use and JSON event generation.
|
|
53
|
+
|
|
54
|
+
## Test Results
|
|
55
|
+
|
|
56
|
+
- **Bug #18 (worker stdin hang)**: FIXED - workers now produce output immediately
|
|
57
|
+
- **Bug #17 (background runner death at 35s)**: FIXED - runners survive indefinitely
|
|
58
|
+
|
|
59
|
+
Both bugs were related to process/session lifecycle issues but had different root causes:
|
|
60
|
+
- Bug #17: `cleanupRuntime()` killing all async runners on session_shutdown
|
|
61
|
+
- Bug #18: `stdio: ["pipe", ...]` with `setsid`+`detached` causing stdin block
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Bug #19: Phantom Runs from Temp Workspaces
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
Runs in `/tmp/pi-crew-*/` directories were appearing in production dashboard as "running" even after processes died. This caused confusion and UI clutter with "9 running" when there was actually only 1.
|
|
5
|
+
|
|
6
|
+
## Root Cause
|
|
7
|
+
1. Test suite (npm test) creates runs in `/tmp/` with live-session or scaffold runtime
|
|
8
|
+
2. These runs don't have `async.pid` (child-process runs have it, live-session/scaffold don't)
|
|
9
|
+
3. When tests complete/crash, entries remain in `active-run-index.json`
|
|
10
|
+
4. `collectRuns()` scans temp dirs and shows stale manifests
|
|
11
|
+
5. `activeRunEntries()` was not checking timestamp for non-async runs
|
|
12
|
+
|
|
13
|
+
## Fix
|
|
14
|
+
|
|
15
|
+
### 1. `src/extension/run-index.ts` - collectRuns()
|
|
16
|
+
Added detection for temp directories and PID alive check:
|
|
17
|
+
```typescript
|
|
18
|
+
const tempDirs = [os.tmpdir(), "/var/tmp", "/tmp"];
|
|
19
|
+
const isTempRoot = tempDirs.some((t) => root.startsWith(t + path.sep));
|
|
20
|
+
|
|
21
|
+
// For runs in temp dirs, verify background process is alive
|
|
22
|
+
if (isTempRoot && (manifest.status === "running" || ...)) {
|
|
23
|
+
const asyncPidPath = path.join(path.dirname(manifest.stateRoot), "async.pid");
|
|
24
|
+
// ... check PID alive
|
|
25
|
+
}
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### 2. `src/state/active-run-registry.ts` - filterAliveEntries() + activeRunEntries()
|
|
29
|
+
Added 30-minute timeout for non-async runs:
|
|
30
|
+
```typescript
|
|
31
|
+
// 2.19 — Stale non-async run: live-session/scaffold runs older than 30 min
|
|
32
|
+
if (!raw.async) {
|
|
33
|
+
const updatedAt = typeof raw.updatedAt === 'string' ? Date.parse(raw.updatedAt) : NaN;
|
|
34
|
+
if (Number.isFinite(updatedAt) && Date.now() - updatedAt > 30 * 60 * 1000) return false;
|
|
35
|
+
}
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Files Changed
|
|
39
|
+
- `src/extension/run-index.ts`: Added `os` import and temp root check in collectRuns()
|
|
40
|
+
- `src/state/active-run-registry.ts`: Added 30-min timeout for non-async runs in both filterAliveEntries() and activeRunEntries()
|
|
41
|
+
|
|
42
|
+
## Verification
|
|
43
|
+
After fix:
|
|
44
|
+
- Active-run-index is cleared of stale entries
|
|
45
|
+
- Runs older than 30 min with no async.pid are filtered out
|
|
46
|
+
- Only valid runs with alive PIDs or recent timestamps are shown
|
|
47
|
+
|
|
48
|
+
## Why Two Places?
|
|
49
|
+
- `run-index.ts` - handles scanning runs from disk (collectRuns)
|
|
50
|
+
- `active-run-registry.ts` - handles the in-memory registry of active runs
|
|
51
|
+
|
|
52
|
+
Both needed the fix because the dashboard uses both sources.
|