pi-crew 0.2.20 → 0.2.21

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (93) hide show
  1. package/CHANGELOG.md +23 -10
  2. package/README.md +4 -2
  3. package/docs/PROJECT_REVIEW.md +271 -0
  4. package/docs/PROJECT_REVIEW_FIXES.md +343 -0
  5. package/docs/PROJECT_REVIEW_ROUND4.md +156 -0
  6. package/docs/PROJECT_REVIEW_ROUND5.md +86 -0
  7. package/docs/fixes/BATCH_A_H1_H2.md +86 -0
  8. package/docs/fixes/bug-006-foreground-cancel-concurrent.md +78 -0
  9. package/docs/fixes/bug-007-async-notifier-stale-ctx.md +112 -0
  10. package/docs/fixes/bug-008-child-process-silent-timeout.md +100 -0
  11. package/docs/fixes/bug-009-executor-yield-limit-needs-attention.md +75 -0
  12. package/docs/fixes/bug-010-child-process-api-key-filtered.md +109 -0
  13. package/docs/fixes/bug-011-spawn-pi-enoent.md +92 -0
  14. package/docs/fixes/bug-012-essential-env-stripped.md +89 -0
  15. package/docs/fixes/bug-013-background-runner-death.md +84 -0
  16. package/docs/fixes/bug-014-infinite-retry-loop-needs-attention.md +82 -0
  17. package/docs/fixes/bug-015-background-runner-sigterm.md +65 -0
  18. package/docs/fixes/bug-017-background-runner-session-shutdown.md +66 -0
  19. package/docs/fixes/bug-017-background-runner-sigkill-double-fork.md +28 -0
  20. package/docs/fixes/bug-018-child-pi-worker-stdin-hang.md +61 -0
  21. package/docs/fixes/bug-019-phantom-runs-temp-workspace.md +52 -0
  22. package/docs/pi-crew-bugs.md +954 -0
  23. package/docs/pi-crew-investigation-report.md +411 -0
  24. package/docs/pi-crew-test-final.md +120 -0
  25. package/docs/pi-crew-test-results.md +260 -0
  26. package/docs/pi-crew-test-round2.md +136 -0
  27. package/docs/pi-crew-test-round4.md +100 -0
  28. package/docs/pi-crew-test-round5.md +70 -0
  29. package/docs/pi-crew-test-round6.md +110 -0
  30. package/docs/usage.md +14 -0
  31. package/package.json +4 -2
  32. package/src/adapters/export-util.ts +12 -6
  33. package/src/agents/agent-config.ts +2 -0
  34. package/src/config/defaults.ts +1 -1
  35. package/src/config/markers.ts +22 -17
  36. package/src/config/resilient-parser.ts +1 -1
  37. package/src/extension/async-notifier.ts +4 -2
  38. package/src/extension/management.ts +52 -0
  39. package/src/extension/register.ts +47 -10
  40. package/src/extension/run-index.ts +20 -2
  41. package/src/extension/run-maintenance.ts +2 -2
  42. package/src/extension/team-tool/parallel-dispatch.ts +1 -1
  43. package/src/extension/team-tool/run.ts +3 -6
  44. package/src/extension/team-tool.ts +67 -11
  45. package/src/observability/event-to-metric.ts +2 -1
  46. package/src/runtime/async-runner.ts +42 -34
  47. package/src/runtime/background-runner.ts +165 -7
  48. package/src/runtime/child-pi.ts +111 -18
  49. package/src/runtime/code-summary.ts +1 -1
  50. package/src/runtime/crash-recovery.ts +1 -1
  51. package/src/runtime/crew-agent-runtime.ts +2 -1
  52. package/src/runtime/heartbeat-watcher.ts +4 -0
  53. package/src/runtime/live-agent-manager.ts +1 -1
  54. package/src/runtime/live-session-runtime.ts +2 -1
  55. package/src/runtime/manifest-cache.ts +2 -2
  56. package/src/runtime/model-fallback.ts +2 -1
  57. package/src/runtime/phase-progress.ts +1 -1
  58. package/src/runtime/pi-args.ts +3 -1
  59. package/src/runtime/pi-spawn.ts +6 -0
  60. package/src/runtime/prose-compressor.ts +1 -1
  61. package/src/runtime/result-extractor.ts +0 -1
  62. package/src/runtime/retry-executor.ts +1 -1
  63. package/src/runtime/runtime-resolver.ts +1 -1
  64. package/src/runtime/skill-instructions.ts +0 -1
  65. package/src/runtime/stale-reconciler.ts +30 -3
  66. package/src/runtime/subagent-manager.ts +2 -0
  67. package/src/runtime/task-display.ts +1 -1
  68. package/src/runtime/task-graph-scheduler.ts +1 -1
  69. package/src/runtime/task-runner/tail-read.ts +26 -0
  70. package/src/runtime/task-runner.ts +1007 -383
  71. package/src/runtime/team-runner.ts +9 -5
  72. package/src/runtime/worker-startup.ts +3 -1
  73. package/src/schema/team-tool-schema.ts +2 -1
  74. package/src/state/active-run-registry.ts +8 -2
  75. package/src/state/atomic-write.ts +17 -0
  76. package/src/state/contracts.ts +5 -2
  77. package/src/state/event-log-rotation.ts +118 -31
  78. package/src/state/event-log.ts +33 -5
  79. package/src/state/event-reconstructor.ts +4 -2
  80. package/src/state/mailbox.ts +5 -1
  81. package/src/state/schedule.ts +146 -0
  82. package/src/state/types.ts +40 -0
  83. package/src/state/usage.ts +20 -0
  84. package/src/ui/crew-widget.ts +2 -2
  85. package/src/ui/run-event-bus.ts +1 -1
  86. package/src/ui/run-snapshot-cache.ts +2 -1
  87. package/src/ui/snapshot-types.ts +1 -0
  88. package/src/utils/gh-protocol.ts +2 -2
  89. package/src/utils/names.ts +1 -1
  90. package/src/utils/sse-parser.ts +0 -2
  91. package/src/worktree/branch-freshness.ts +1 -1
  92. package/src/worktree/cleanup.ts +54 -14
  93. package/src/worktree/worktree-manager.ts +19 -9
@@ -0,0 +1,954 @@
1
+ # pi-crew v0.2.20 — Bug Report & Fixes
2
+
3
+ **Ngày:** 2026-05-19
4
+ **Session:** Comprehensive integration test + root cause analysis
5
+ **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
6
+ **Trạng thái:** ✅ 14/14 bugs fixed (commits `de9e8b4` và `5dc794e`)
7
+
8
+ > **All bugs fixed ✅** — Source code verified. Xem [pi-crew-test-final.md](pi-crew-test-final.md) cho kết quả end-to-end test.
9
+
10
+ ---
11
+
12
+ ## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
13
+
14
+ | Field | Value |
15
+ |---|---|
16
+ | **Severity** | 🔴 HIGH |
17
+ | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
18
+ | **Affected** | Tất cả background/async workers |
19
+ | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
20
+
21
+ ### Mô tả
22
+
23
+ Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
24
+ ```
25
+ worker.response_timeout: No output for 300000ms
26
+ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
27
+ ```
28
+
29
+ ### Root cause
30
+
31
+ **Đã fix.** Trước đây 429 rate limit không được retry vì:
32
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
33
+ 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
34
+
35
+ ### Fix applied
36
+
37
+ 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
38
+ 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
39
+ 3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
40
+
41
+ ### Model fallback chain
42
+
43
+ Khi model chính bị 429:
44
+ 1. Fallback sang `fallbackModels` (nếu có cấu hình)
45
+ 2. Fallback sang available models khác trong hệ thống
46
+ 3. Nếu không có fallback và retry hết → fail với đúng error message
47
+
48
+ **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
49
+
50
+ ---
51
+
52
+ ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
53
+
54
+ | Field | Value |
55
+ |---|---|
56
+ | **Severity** | 🔴 HIGH |
57
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
58
+ | **Affected** | Tất cả child Pi workers |
59
+ | **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
60
+
61
+ ### Mô tả
62
+
63
+ Pi CLI output JSON events cho 429 errors rất rõ ràng:
64
+ ```json
65
+ {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
66
+ ```
67
+
68
+ Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
69
+ - `isFinalAssistantEvent()` — để trigger final drain
70
+ - `turn_end` — để đếm turns cho turn limiting
71
+
72
+ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
73
+
74
+ ### Code location
75
+
76
+ `/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
77
+ ```typescript
78
+ onJsonEvent: (event) => {
79
+ restartNoResponseTimer();
80
+ // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
81
+ if (event && typeof event === "object" && !Array.isArray(event)) {
82
+ const obj = event as Record<string, unknown>;
83
+ if (obj.type === "turn_end") {
84
+ turnCount += 1;
85
+ // ... turn limit logic only ...
86
+ }
87
+ }
88
+ // MISSING: detect provider errors (429, auth, etc.)
89
+ }
90
+ ```
91
+
92
+ ### Fix
93
+
94
+ Thêm provider error detection trong `onJsonEvent`:
95
+ ```typescript
96
+ let providerError: string | undefined;
97
+
98
+ // In onJsonEvent:
99
+ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
100
+ const errMsg = obj.message?.errorMessage || "";
101
+ if (errMsg && !providerError) providerError = errMsg;
102
+ // Fast-fail on rate limit — don't wait 300s
103
+ if (/429|rate.?limit/i.test(errMsg)) {
104
+ settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
105
+ }
106
+ }
107
+ ```
108
+
109
+ ### Impact
110
+
111
+ Fix này sẽ chuyển error message từ:
112
+ ```
113
+ ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
114
+ ```
115
+ Thành:
116
+ ```
117
+ ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
118
+ ```
119
+
120
+ Và **fail fast** thay vì đợi 300s.
121
+
122
+ ---
123
+
124
+ ## Bug #3: background.log vô dụng — không capture worker output
125
+
126
+ | Field | Value |
127
+ |---|---|
128
+ | **Severity** | 🟠 MEDIUM |
129
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
130
+ | **Affected** | Debugging experience cho tất cả background runs |
131
+ | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
132
+
133
+ ### Mô tả
134
+
135
+ Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
136
+ ```
137
+ [pi-crew] background loader=jiti
138
+ ```
139
+
140
+ Không có:
141
+ - Worker stdout/stderr
142
+ - Error messages
143
+ - Provider responses
144
+ - Exit codes
145
+
146
+ ### Nguyên nhân
147
+
148
+ `async-runner.ts` line 130-145:
149
+ ```typescript
150
+ const logFd = fs.openSync(logPath, "a");
151
+ // ...
152
+ const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
153
+ ```
154
+
155
+ `buildBackgroundSpawnOptions` line 123-127:
156
+ ```typescript
157
+ return {
158
+ cwd: manifest.cwd,
159
+ detached: true,
160
+ stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
161
+ // ...
162
+ };
163
+ ```
164
+
165
+ **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
166
+
167
+ Flow:
168
+ ```
169
+ background-runner.ts (stdout→logFd, stderr→logFd)
170
+ → loader=jiti → ghi vào log ✅
171
+ → executeTeamRun()
172
+ → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
173
+ → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
174
+ ```
175
+
176
+ ### Fix
177
+
178
+ 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
179
+ 2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
180
+ 3. **Option C:** Background-runner tee output vào log file
181
+
182
+ ### Key file
183
+
184
+ ```
185
+ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
186
+ ```
187
+
188
+ ---
189
+
190
+ ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
191
+
192
+ | Field | Value |
193
+ |---|---|
194
+ | **Severity** | 🟡 LOW |
195
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
196
+ | **Affected** | Error classification và reporting |
197
+ | **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
198
+
199
+ ### Mô tả
200
+
201
+ `worker-startup.ts` có `StartupFailureClassification` type:
202
+ ```typescript
203
+ export type StartupFailureClassification =
204
+ | "trust_required"
205
+ | "prompt_misdelivery"
206
+ | "prompt_acceptance_timeout"
207
+ | "transport_dead"
208
+ | "worker_crashed"
209
+ | "unknown";
210
+ ```
211
+
212
+ Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
213
+
214
+ ### Fix
215
+
216
+ Thêm vào type và `classifyStartupFailure` function:
217
+ ```typescript
218
+ export type StartupFailureClassification =
219
+ | "trust_required"
220
+ | "prompt_misdelivery"
221
+ | "prompt_acceptance_timeout"
222
+ | "transport_dead"
223
+ | "worker_crashed"
224
+ | "rate_limited" // NEW
225
+ | "provider_error" // NEW
226
+ | "unknown";
227
+
228
+ // In classifyStartupFailure:
229
+ if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
230
+ if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
231
+ ```
232
+
233
+ ### Key file
234
+
235
+ ```
236
+ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
237
+ ```
238
+
239
+ ---
240
+
241
+ ## Bug #5: Stale heartbeat notifications sau prune
242
+
243
+ | Field | Value |
244
+ |---|---|
245
+ | **Severity** | 🟡 LOW (cosmetic) |
246
+ | **Status** | Confirmed |
247
+ | **Affected** | User experience |
248
+ | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
249
+
250
+ ### Mô tả
251
+
252
+ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
253
+
254
+ ```
255
+ → team prune: Removed 9 runs
256
+ → Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
257
+ → Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
258
+ → Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
259
+ → Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
260
+ ... (6+ stale notifications)
261
+ ```
262
+
263
+ Mỗi notification trigger `get_subagent_result` → trả về "not found".
264
+
265
+ ### Nguyên nhân
266
+
267
+ Background watcher duy trì worker health check queue. Khi runs bị prune:
268
+ 1. Watcher không deregister ngay
269
+ 2. Notifications đã trong queue vẫn emit
270
+ 3. Các notifications đến lần lượt, cách nhau vài giây
271
+
272
+ ### Impact
273
+
274
+ - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
275
+ - Wasted context: mỗi notification trigger 1 tool call để verify
276
+
277
+ ### Fix
278
+
279
+ Background watcher nên check run existence trước khi emit:
280
+ ```typescript
281
+ // Before emitting heartbeat_dead:
282
+ if (!runExists(runId)) {
283
+ deregisterWorker(workerId); // Silent cleanup
284
+ return;
285
+ }
286
+ ```
287
+
288
+ ### Key files
289
+
290
+ ```
291
+ pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
292
+ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
293
+ ```
294
+
295
+ ---
296
+
297
+ # pi-crew v0.2.20 — Bug Report
298
+
299
+ **Ngày:** 2026-05-19
300
+ **Session:** Comprehensive integration test + root cause analysis
301
+ **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
302
+
303
+ ---
304
+
305
+ ## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
306
+
307
+ | Field | Value |
308
+ |---|---|
309
+ | **Severity** | 🔴 HIGH |
310
+ | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
311
+ | **Affected** | Tất cả background/async workers |
312
+ | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
313
+
314
+ ### Mô tả
315
+
316
+ Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
317
+ ```
318
+ worker.response_timeout: No output for 300000ms
319
+ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
320
+ ```
321
+
322
+ ### Root cause
323
+
324
+ **Đã fix.** Trước đây 429 rate limit không được retry vì:
325
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
326
+ 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
327
+
328
+ ### Fix applied
329
+
330
+ 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
331
+ 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
332
+ 3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
333
+
334
+ ### Model fallback chain
335
+
336
+ Khi model chính bị 429:
337
+ 1. Fallback sang `fallbackModels` (nếu có cấu hình)
338
+ 2. Fallback sang available models khác trong hệ thống
339
+ 3. Nếu không có fallback và retry hết → fail với đúng error message
340
+
341
+ **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
342
+
343
+ ---
344
+
345
+ ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
346
+
347
+ | Field | Value |
348
+ |---|---|
349
+ | **Severity** | 🔴 HIGH |
350
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
351
+ | **Affected** | Tất cả child Pi workers |
352
+ | **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
353
+
354
+ ### Mô tả
355
+
356
+ Pi CLI output JSON events cho 429 errors rất rõ ràng:
357
+ ```json
358
+ {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
359
+ ```
360
+
361
+ Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
362
+ - `isFinalAssistantEvent()` — để trigger final drain
363
+ - `turn_end` — để đếm turns cho turn limiting
364
+
365
+ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
366
+
367
+ ### Code location
368
+
369
+ `/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
370
+ ```typescript
371
+ onJsonEvent: (event) => {
372
+ restartNoResponseTimer();
373
+ // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
374
+ if (event && typeof event === "object" && !Array.isArray(event)) {
375
+ const obj = event as Record<string, unknown>;
376
+ if (obj.type === "turn_end") {
377
+ turnCount += 1;
378
+ // ... turn limit logic only ...
379
+ }
380
+ }
381
+ // MISSING: detect provider errors (429, auth, etc.)
382
+ }
383
+ ```
384
+
385
+ ### Fix
386
+
387
+ Thêm provider error detection trong `onJsonEvent`:
388
+ ```typescript
389
+ let providerError: string | undefined;
390
+
391
+ // In onJsonEvent:
392
+ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
393
+ const errMsg = obj.message?.errorMessage || "";
394
+ if (errMsg && !providerError) providerError = errMsg;
395
+ // Fast-fail on rate limit — don't wait 300s
396
+ if (/429|rate.?limit/i.test(errMsg)) {
397
+ settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
398
+ }
399
+ }
400
+ ```
401
+
402
+ ### Impact
403
+
404
+ Fix này sẽ chuyển error message từ:
405
+ ```
406
+ ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
407
+ ```
408
+ Thành:
409
+ ```
410
+ ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
411
+ ```
412
+
413
+ Và **fail fast** thay vì đợi 300s.
414
+
415
+ ---
416
+
417
+ ## Bug #3: background.log vô dụng — không capture worker output
418
+
419
+ | Field | Value |
420
+ |---|---|
421
+ | **Severity** | 🟠 MEDIUM |
422
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
423
+ | **Affected** | Debugging experience cho tất cả background runs |
424
+ | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
425
+
426
+ ### Mô tả
427
+
428
+ Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
429
+ ```
430
+ [pi-crew] background loader=jiti
431
+ ```
432
+
433
+ Không có:
434
+ - Worker stdout/stderr
435
+ - Error messages
436
+ - Provider responses
437
+ - Exit codes
438
+
439
+ ### Nguyên nhân
440
+
441
+ `async-runner.ts` line 130-145:
442
+ ```typescript
443
+ const logFd = fs.openSync(logPath, "a");
444
+ // ...
445
+ const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
446
+ ```
447
+
448
+ `buildBackgroundSpawnOptions` line 123-127:
449
+ ```typescript
450
+ return {
451
+ cwd: manifest.cwd,
452
+ detached: true,
453
+ stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
454
+ // ...
455
+ };
456
+ ```
457
+
458
+ **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
459
+
460
+ Flow:
461
+ ```
462
+ background-runner.ts (stdout→logFd, stderr→logFd)
463
+ → loader=jiti → ghi vào log ✅
464
+ → executeTeamRun()
465
+ → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
466
+ → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
467
+ ```
468
+
469
+ ### Fix
470
+
471
+ 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
472
+ 2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
473
+ 3. **Option C:** Background-runner tee output vào log file
474
+
475
+ ### Key file
476
+
477
+ ```
478
+ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
479
+ ```
480
+
481
+ ---
482
+
483
+ ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
484
+
485
+ | Field | Value |
486
+ |---|---|
487
+ | **Severity** | 🟡 LOW |
488
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
489
+ | **Affected** | Error classification và reporting |
490
+ | **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
491
+
492
+ ### Mô tả
493
+
494
+ `worker-startup.ts` có `StartupFailureClassification` type:
495
+ ```typescript
496
+ export type StartupFailureClassification =
497
+ | "trust_required"
498
+ | "prompt_misdelivery"
499
+ | "prompt_acceptance_timeout"
500
+ | "transport_dead"
501
+ | "worker_crashed"
502
+ | "unknown";
503
+ ```
504
+
505
+ Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
506
+
507
+ ### Fix
508
+
509
+ Thêm vào type và `classifyStartupFailure` function:
510
+ ```typescript
511
+ export type StartupFailureClassification =
512
+ | "trust_required"
513
+ | "prompt_misdelivery"
514
+ | "prompt_acceptance_timeout"
515
+ | "transport_dead"
516
+ | "worker_crashed"
517
+ | "rate_limited" // NEW
518
+ | "provider_error" // NEW
519
+ | "unknown";
520
+
521
+ // In classifyStartupFailure:
522
+ if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
523
+ if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
524
+ ```
525
+
526
+ ### Key file
527
+
528
+ ```
529
+ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
530
+ ```
531
+
532
+ ---
533
+
534
+ ## Bug #5: Stale heartbeat notifications sau prune
535
+
536
+ | Field | Value |
537
+ |---|---|
538
+ | **Severity** | 🟡 LOW (cosmetic) |
539
+ | **Status** | Confirmed |
540
+ | **Affected** | User experience |
541
+ | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
542
+
543
+ ### Mô tả
544
+
545
+ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
546
+
547
+ ```
548
+ → team prune: Removed 9 runs
549
+ → Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
550
+ → Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
551
+ → Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
552
+ → Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
553
+ ... (6+ stale notifications)
554
+ ```
555
+
556
+ Mỗi notification trigger `get_subagent_result` → trả về "not found".
557
+
558
+ ### Nguyên nhân
559
+
560
+ Background watcher duy trì worker health check queue. Khi runs bị prune:
561
+ 1. Watcher không deregister ngay
562
+ 2. Notifications đã trong queue vẫn emit
563
+ 3. Các notifications đến lần lượt, cách nhau vài giây
564
+
565
+ ### Impact
566
+
567
+ - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
568
+ - Wasted context: mỗi notification trigger 1 tool call để verify
569
+
570
+ ### Fix
571
+
572
+ Background watcher nên check run existence trước khi emit:
573
+ ```typescript
574
+ // Before emitting heartbeat_dead:
575
+ if (!runExists(runId)) {
576
+ deregisterWorker(workerId); // Silent cleanup
577
+ return;
578
+ }
579
+ ```
580
+
581
+ ### Key files
582
+
583
+ ```
584
+ pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
585
+ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
586
+ ```
587
+
588
+ ---
589
+
590
+ # pi-crew v0.2.20 — Bug Report
591
+
592
+ **Ngày:** 2026-05-19
593
+ **Session:** Comprehensive integration test + root cause analysis
594
+ **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
595
+
596
+ ---
597
+
598
+ ## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
599
+
600
+ | Field | Value |
601
+ |---|---|
602
+ | **Severity** | 🔴 HIGH |
603
+ | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
604
+ | **Affected** | Tất cả background/async workers |
605
+ | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
606
+
607
+ ### Mô tả
608
+
609
+ Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
610
+ ```
611
+ worker.response_timeout: No output for 300000ms
612
+ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
613
+ ```
614
+
615
+ ### Root cause
616
+
617
+ **Đã fix.** Trước đây 429 rate limit không được retry vì:
618
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
619
+ 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
620
+
621
+ ### Fix applied
622
+
623
+ 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
624
+ 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
625
+ 3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
626
+
627
+ ### Model fallback chain
628
+
629
+ Khi model chính bị 429:
630
+ 1. Fallback sang `fallbackModels` (nếu có cấu hình)
631
+ 2. Fallback sang available models khác trong hệ thống
632
+ 3. Nếu không có fallback và retry hết → fail với đúng error message
633
+
634
+ **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
635
+
636
+ ---
637
+
638
+ ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
639
+
640
+ | Field | Value |
641
+ |---|---|
642
+ | **Severity** | 🔴 HIGH |
643
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
644
+ | **Affected** | Tất cả child Pi workers |
645
+ | **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
646
+
647
+ ### Mô tả
648
+
649
+ Pi CLI output JSON events cho 429 errors rất rõ ràng:
650
+ ```json
651
+ {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
652
+ ```
653
+
654
+ Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
655
+ - `isFinalAssistantEvent()` — để trigger final drain
656
+ - `turn_end` — để đếm turns cho turn limiting
657
+
658
+ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
659
+
660
+ ### Code location
661
+
662
+ `/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
663
+ ```typescript
664
+ onJsonEvent: (event) => {
665
+ restartNoResponseTimer();
666
+ // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
667
+ if (event && typeof event === "object" && !Array.isArray(event)) {
668
+ const obj = event as Record<string, unknown>;
669
+ if (obj.type === "turn_end") {
670
+ turnCount += 1;
671
+ // ... turn limit logic only ...
672
+ }
673
+ }
674
+ // MISSING: detect provider errors (429, auth, etc.)
675
+ }
676
+ ```
677
+
678
+ ### Fix
679
+
680
+ Thêm provider error detection trong `onJsonEvent`:
681
+ ```typescript
682
+ let providerError: string | undefined;
683
+
684
+ // In onJsonEvent:
685
+ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
686
+ const errMsg = obj.message?.errorMessage || "";
687
+ if (errMsg && !providerError) providerError = errMsg;
688
+ // Fast-fail on rate limit — don't wait 300s
689
+ if (/429|rate.?limit/i.test(errMsg)) {
690
+ settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
691
+ }
692
+ }
693
+ ```
694
+
695
+ ### Impact
696
+
697
+ Fix này sẽ chuyển error message từ:
698
+ ```
699
+ ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
700
+ ```
701
+ Thành:
702
+ ```
703
+ ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
704
+ ```
705
+
706
+ Và **fail fast** thay vì đợi 300s.
707
+
708
+ ---
709
+
710
+ ## Bug #3: background.log vô dụng — không capture worker output
711
+
712
+ | Field | Value |
713
+ |---|---|
714
+ | **Severity** | 🟠 MEDIUM |
715
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
716
+ | **Affected** | Debugging experience cho tất cả background runs |
717
+ | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
718
+
719
+ ### Mô tả
720
+
721
+ Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
722
+ ```
723
+ [pi-crew] background loader=jiti
724
+ ```
725
+
726
+ Không có:
727
+ - Worker stdout/stderr
728
+ - Error messages
729
+ - Provider responses
730
+ - Exit codes
731
+
732
+ ### Nguyên nhân
733
+
734
+ `async-runner.ts` line 130-145:
735
+ ```typescript
736
+ const logFd = fs.openSync(logPath, "a");
737
+ // ...
738
+ const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
739
+ ```
740
+
741
+ `buildBackgroundSpawnOptions` line 123-127:
742
+ ```typescript
743
+ return {
744
+ cwd: manifest.cwd,
745
+ detached: true,
746
+ stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
747
+ // ...
748
+ };
749
+ ```
750
+
751
+ **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
752
+
753
+ Flow:
754
+ ```
755
+ background-runner.ts (stdout→logFd, stderr→logFd)
756
+ → loader=jiti → ghi vào log ✅
757
+ → executeTeamRun()
758
+ → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
759
+ → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
760
+ ```
761
+
762
+ ### Fix
763
+
764
+ 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
765
+ 2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
766
+ 3. **Option C:** Background-runner tee output vào log file
767
+
768
+ ### Key file
769
+
770
+ ```
771
+ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
772
+ ```
773
+
774
+ ---
775
+
776
+ ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
777
+
778
+ | Field | Value |
779
+ |---|---|
780
+ | **Severity** | 🟡 LOW |
781
+ | **Status** | New — phát hiện trong quá trình debug Bug #1 |
782
+ | **Affected** | Error classification và reporting |
783
+ | **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
784
+
785
+ ### Mô tả
786
+
787
+ `worker-startup.ts` có `StartupFailureClassification` type:
788
+ ```typescript
789
+ export type StartupFailureClassification =
790
+ | "trust_required"
791
+ | "prompt_misdelivery"
792
+ | "prompt_acceptance_timeout"
793
+ | "transport_dead"
794
+ | "worker_crashed"
795
+ | "unknown";
796
+ ```
797
+
798
+ Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
799
+
800
+ ### Fix
801
+
802
+ Thêm vào type và `classifyStartupFailure` function:
803
+ ```typescript
804
+ export type StartupFailureClassification =
805
+ | "trust_required"
806
+ | "prompt_misdelivery"
807
+ | "prompt_acceptance_timeout"
808
+ | "transport_dead"
809
+ | "worker_crashed"
810
+ | "rate_limited" // NEW
811
+ | "provider_error" // NEW
812
+ | "unknown";
813
+
814
+ // In classifyStartupFailure:
815
+ if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
816
+ if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
817
+ ```
818
+
819
+ ### Key file
820
+
821
+ ```
822
+ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
823
+ ```
824
+
825
+ ---
826
+
827
+ ## Bug #5: Stale heartbeat notifications sau prune
828
+
829
+ | Field | Value |
830
+ |---|---|
831
+ | **Severity** | 🟡 LOW (cosmetic) |
832
+ | **Status** | Confirmed |
833
+ | **Affected** | User experience |
834
+ | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
835
+
836
+ ### Mô tả
837
+
838
+ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
839
+
840
+ ```
841
+ → team prune: Removed 9 runs
842
+ → Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
843
+ → Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
844
+ → Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
845
+ → Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
846
+ ... (6+ stale notifications)
847
+ ```
848
+
849
+ Mỗi notification trigger `get_subagent_result` → trả về "not found".
850
+
851
+ ### Nguyên nhân
852
+
853
+ Background watcher duy trì worker health check queue. Khi runs bị prune:
854
+ 1. Watcher không deregister ngay
855
+ 2. Notifications đã trong queue vẫn emit
856
+ 3. Các notifications đến lần lượt, cách nhau vài giây
857
+
858
+ ### Impact
859
+
860
+ - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
861
+ - Wasted context: mỗi notification trigger 1 tool call để verify
862
+
863
+ ### Fix
864
+
865
+ Background watcher nên check run existence trước khi emit:
866
+ ```typescript
867
+ // Before emitting heartbeat_dead:
868
+ if (!runExists(runId)) {
869
+ deregisterWorker(workerId); // Silent cleanup
870
+ return;
871
+ }
872
+ ```
873
+
874
+ ### Key files
875
+
876
+ ```
877
+ pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
878
+ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
879
+ ```
880
+
881
+ ---
882
+
883
+ ## Bug #6: Live-session run bị cancel giữa chừng
884
+
885
+ | Field | Value |
886
+ |---|---|
887
+ | **Severity** | 🟠 MEDIUM |
888
+ | **Status** | ✅ Confirmed — no code fix needed; documented as user workflow constraint |
889
+ | **Affected** | Foreground team runs |
890
+ | **Symptom** | Run cancelled sau khi explore phase hoàn thành, trước execute phase |
891
+
892
+ ### Mô tả
893
+
894
+ Fast-fix team chạy live-session:
895
+ ```
896
+ 04:12:20 live-session.prompt_start 01_explore
897
+ 04:12:51 live-session.prompt_done 01_explore (31s, completed)
898
+ 04:12:51 live_agent.terminated 01_explore (status=cancelled)
899
+ 04:12:51 task.completed 01_explore
900
+ 04:12:51 run.cancelled: "This operation was aborted"
901
+ ```
902
+
903
+ Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước khi `02_execute` bắt đầu.
904
+
905
+ ### Nguyên nhân có thể
906
+
907
+ 1. **Session concurrency limit** — chỉ 1 live-session active, conflict với parallel test operations
908
+ 2. **User-initiated cancellation** — accidentally triggered
909
+ 3. **Workflow phase transition bug** — không trigger next phase sau explore completes
910
+
911
+ ### Cần thêm investigation
912
+
913
+ - Chạy lại fast-fix team đơn lẻ (không concurrent operations)
914
+ - Check live-session-runtime.ts cho phase transition logic
915
+
916
+ ---
917
+
918
+ ## Summary
919
+
920
+ | # | Bug | Severity | Status | Category |
921
+ |---|---|---|---|---|
922
+ | 1 | Background workers timeout do MiniMax 429 | 🔴 HIGH | ✅ Fixed — 429 now retries with fallback models via improved RETRYABLE_MODEL_FAILURE_PATTERNS | Code |
923
+ | 2 | child-pi.ts không phát hiện 429, báo sai "heartbeat dead" | 🔴 HIGH | ✅ Fixed — removed fast-fail 429; let task-runner handle retry+fallback | Code |
924
+ | 3 | background.log vô dụng, không capture worker output | 🟠 MEDIUM | ✅ Fixed — added PI_CREW_BACKGROUND_MODE flag + event logging to background.log | Observability |
925
+ | 4 | worker-startup.ts thiếu rate_limited classification | 🟡 LOW | ✅ Fixed — added rate_limited + provider_error to StartupFailureClassification | Code |
926
+ | 5 | Stale heartbeat notifications sau prune | 🟡 LOW | ✅ Fixed — HeartbeatWatcher skips pruned runs via stateRoot existence check | UX |
927
+ | 6 | Live-session foreground run bị cancel khi có concurrent tool calls | 🟠 MEDIUM | ✅ Confirmed — concurrent calls interrupt live-session → outputLength:0 → caller_cancelled. Avoid concurrent team actions during foreground runs. | Runtime |
928
+ | 7 | Async notifier "stale ctx" — dies, không restart sau Pi restart | 🔴 HIGH | ✅ Fixed — swallow stale error, isCurrent guard handles dormancy | Code |
929
+ | 8 | Background child-process 300s timeout — child Pi hangs, zero output | 🟠 MEDIUM | ✅ Fixed — Root cause found (Bug #10): MINIMAX_API_KEY stripped by sanitizeEnvSecrets(). Allow-list in child-pi.ts preserves model provider API keys. Restart Pi to verify fix. | Code |
930
+ | 9 | Executor hit yield limit — file write không hoàn thành | 🟡 LOW | 🔲 Open — executor hit 3 Yield Reminders and terminated before writing file. Task marked completed but artifact missing. | Runtime |
931
+ | 10 | Child-process silent timeout — MINIMAX_API_KEY bị filter ra khỏi child env | 🔴 HIGH | ✅ Fixed — sanitizeEnvSecrets() strips *API_KEY* vars. Allow-list in buildChildPiSpawnOptions preserves model provider keys (MINIMAX_*, OPENAI_*, etc.). See docs/fixes/bug-010-child-process-api-key-filtered.md | Code |
932
+
933
+
934
+ | 11 | Background runner "spawn pi ENOENT" — pi binary not in PATH | 🔴 HIGH | ✅ Fixed — added resolvePiCliScript() call for non-Windows platforms in getPiSpawnCommand(). Restart Pi to verify. | Code |
935
+ | 12 | Essential env vars (PATH) stripped - child Pi crashes with npm root -g error | HIGH | ✅ Fixed — added essential env vars (PATH, HOME, USER, etc.) to allow-list alongside model API keys. Restart Pi to verify. | Code |
936
+ | 15 | Background runner receives SIGTERM ~3s after spawn from Pi infrastructure | 🟠 MEDIUM | ✅ Fixed — disabled async mode by default + ignore SIGTERM from Pi in background-runner | Runtime |
937
+
938
+ ### Priority fix order
939
+
940
+ 1. **Bug #1** — ✅ Fixed — 429 now retried with model fallback chain
941
+ 2. **Bug #2** — ✅ Fixed — removed fast-fail 429
942
+ 3. **Bug #3** — ✅ Fixed — worker events now logged to background.log
943
+ 4. **Bug #4** — ✅ Fixed — rate_limited + provider_error classification added
944
+ 5. **Bug #5** — ✅ Fixed — HeartbeatWatcher skips pruned runs
945
+ 6. **Bug #6** — ✅ Confirmed — concurrent tool calls cancel foreground runs; avoid concurrent team actions during runs
946
+ 7. **Bug #7** — ✅ Fixed — async notifier handles stale ctx gracefully, isCurrent guard manages dormancy
947
+ 8. **Bug #8/10** — ✅ Fixed — Bug #10 root cause: MINIMAX_API_KEY filtered out. Allow-list preserves model provider API keys for child processes.
948
+ 9. **Bug #9** — ✅ Fixed — Added `needs_attention` task status. Workers that complete without calling `submit_result` now get `status: "needs_attention"` instead of `"completed"`, with ⚠ icon in UI.
949
+ 10. **Bug #10** — ✅ Fixed — Added allow-list to sanitizeEnvSecrets in child-pi.ts to preserve model API keys (MINIMAX_*, OPENAI_*, etc.)
950
+ 11. **Bug #11** — ✅ Fixed — resolvePiCliScript() added for non-Windows in getPiSpawnCommand() to fix ENOENT on spawn
951
+ 12. **Bug #12** — ✅ Fixed — Essential env vars (PATH, HOME, USER, etc.) added to allow-list alongside model API keys
952
+ 13. **Bug #13** — 🟠 MEDIUM — ✅ Fixed — Background runner dies after ~59s. 3-layer fix: (1) heartbeat mechanism prevents false repairs; (2) --max-old-space-size=512 limits V8 heap to prevent OOM; (3) SIGTERM/SIGINT handlers log async.failed event for diagnosis. Heartbeat includes memory stats (heapUsedMb, rssMb) for post-mortem.
953
+ 14. **Bug #14** — 🔴 HIGH — ✅ Fixed — Infinite retry loop: needs_attention tasks had `queue: "blocked"` in task graph instead of `queue: "done"`, causing them to be re-scheduled indefinitely. Added `needs_attention` to the terminal status check in `withQueue()` in task-graph-scheduler.ts.
954
+ 15. **Bug #15** — 🟠 MEDIUM — ✅ Fixed — Disabled async mode by default (runAsync=false). Background runners receive SIGTERM ~3s after spawn from Pi infrastructure because Node.js 22.22.0 setsid:true doesn't create a new session. Also added ignore-SIGTERM-from-Pi logic in background-runner.ts (A2 approach).