pi-crew 0.8.14 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (80) hide show
  1. package/CHANGELOG.md +271 -0
  2. package/README.md +112 -2
  3. package/docs/FEATURE_INTAKE.md +1 -1
  4. package/docs/HARNESS.md +20 -19
  5. package/docs/PROJECT_REVIEW.md +132 -133
  6. package/docs/PROJECT_REVIEW_FIXES.md +130 -131
  7. package/docs/actions-reference.md +127 -121
  8. package/docs/architecture.md +1 -1
  9. package/docs/code-review-2026-05-11.md +134 -134
  10. package/docs/commands-reference.md +108 -106
  11. package/docs/comparison-pi-subagents-vs-pi-crew.md +105 -105
  12. package/docs/deep-review-report.md +1 -1
  13. package/docs/dynamic-workflows.md +90 -0
  14. package/docs/fixes/BATCH_A_H1_H2.md +17 -17
  15. package/docs/fixes/bug-007-async-notifier-stale-ctx.md +23 -23
  16. package/docs/followup-plan-2026-05-12.md +135 -135
  17. package/docs/followup-review-2026-05-12.md +86 -86
  18. package/docs/followup-review-round3-2026-05-12.md +123 -123
  19. package/docs/goals.md +59 -0
  20. package/docs/implementation-plan-top3.md +4 -4
  21. package/docs/issue-29-analysis.md +2 -2
  22. package/docs/oh-my-pi-research.md +154 -154
  23. package/docs/optimization-plan.md +2 -0
  24. package/docs/perf/baseline-2026-05.md +9 -9
  25. package/docs/perf/final-report-2026-05.md +2 -2
  26. package/docs/perf/sprint-1-report.md +2 -2
  27. package/docs/perf/sprint-2-report.md +1 -1
  28. package/docs/perf/upgrade-plan-2026-05.md +72 -72
  29. package/docs/pi-crew-bugs.md +230 -230
  30. package/docs/pi-crew-investigation-report.md +102 -102
  31. package/docs/pi-crew-test-round5.md +4 -4
  32. package/docs/runtime-analysis-child-vs-live.md +57 -57
  33. package/docs/runtime-migration-in-process-analysis.md +97 -97
  34. package/package.json +2 -4
  35. package/skills/orchestration/SKILL.md +11 -11
  36. package/src/agents/agent-config.ts +4 -0
  37. package/src/config/config.ts +39 -0
  38. package/src/config/types.ts +11 -0
  39. package/src/extension/action-suggestions.ts +2 -1
  40. package/src/extension/async-notifier.ts +10 -0
  41. package/src/extension/help.ts +14 -0
  42. package/src/extension/registration/commands.ts +27 -0
  43. package/src/extension/team-tool/destructive-gate.ts +1 -1
  44. package/src/extension/team-tool/goal-wrap.ts +288 -0
  45. package/src/extension/team-tool/goal.ts +405 -0
  46. package/src/extension/team-tool/run.ts +103 -4
  47. package/src/extension/team-tool/workflow-manage.ts +194 -0
  48. package/src/extension/team-tool.ts +20 -0
  49. package/src/hooks/types.ts +3 -1
  50. package/src/runtime/async-runner.ts +24 -2
  51. package/src/runtime/background-runner.ts +68 -19
  52. package/src/runtime/child-pi.ts +6 -1
  53. package/src/runtime/completion-guard.ts +1 -1
  54. package/src/runtime/dynamic-workflow-context.ts +450 -0
  55. package/src/runtime/dynamic-workflow-runner.ts +180 -0
  56. package/src/runtime/global-worker-cap.ts +96 -0
  57. package/src/runtime/goal-evaluator.ts +294 -0
  58. package/src/runtime/goal-loop-runner.ts +612 -0
  59. package/src/runtime/goal-state-store.ts +209 -0
  60. package/src/runtime/pi-args.ts +10 -2
  61. package/src/runtime/result-extractor.ts +32 -0
  62. package/src/runtime/team-runner.ts +11 -1
  63. package/src/runtime/verification-gates.ts +85 -5
  64. package/src/runtime/verification-integrity.ts +110 -0
  65. package/src/runtime/verification-worktree.ts +136 -0
  66. package/src/runtime/workspace-lock.ts +448 -0
  67. package/src/schema/config-schema.ts +26 -0
  68. package/src/schema/team-tool-schema.ts +39 -4
  69. package/src/state/atomic-write.ts +9 -0
  70. package/src/state/contracts.ts +14 -0
  71. package/src/state/crew-init.ts +18 -5
  72. package/src/state/event-log.ts +7 -1
  73. package/src/state/state-store.ts +2 -0
  74. package/src/state/types.ts +82 -0
  75. package/src/state/worker-atomic-writer.ts +176 -0
  76. package/src/utils/redaction.ts +104 -24
  77. package/src/workflows/discover-workflows.ts +25 -1
  78. package/src/workflows/workflow-config.ts +13 -0
  79. package/teams/parallel-research.team.md +1 -1
  80. package/workflows/examples/hello.dwf.ts +24 -0
@@ -1,33 +1,33 @@
1
1
  # Historical Bug Reports (v0.2.x)
2
2
 
3
- > **Current version: v0.5.22** — See [CHANGELOG.md](../CHANGELOG.md) for all bug fixes.
3
+ > **Current version: v0.9.0** — See [CHANGELOG.md](../CHANGELOG.md) for all bug fixes.
4
4
  > This page tracks historical bugs from v0.2.x. All listed bugs are fixed.
5
5
 
6
6
  ---
7
7
 
8
8
  # pi-crew v0.2.20 — Bug Report & Fixes
9
9
 
10
- **Ngày:** 2026-05-19
10
+ **Date:** 2026-05-19
11
11
  **Session:** Comprehensive integration test + root cause analysis
12
- **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
13
- **Trạng thái:** ✅ 14/14 bugs fixed (commits `de9e8b4` `5dc794e`)
12
+ **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
13
+ **Status:** ✅ 14/14 bugs fixed (commits `de9e8b4` and `5dc794e`)
14
14
 
15
- > **All bugs fixed ✅** — Source code verified. Xem [pi-crew-test-final.md](pi-crew-test-final.md) cho kết quả end-to-end test.
15
+ > **All bugs fixed ✅** — Source code verified. See [pi-crew-test-final.md](pi-crew-test-final.md) for end-to-end test results.
16
16
 
17
17
  ---
18
18
 
19
- ## Bug #1: Background workers "heartbeat dead" — thực chất MiniMax 429 Rate Limit
19
+ ## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
20
20
 
21
21
  | Field | Value |
22
22
  |---|---|
23
23
  | **Severity** | 🔴 HIGH |
24
24
  | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
25
- | **Affected** | Tất cả background/async workers |
26
- | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
25
+ | **Affected** | All background/async workers |
26
+ | **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
27
27
 
28
- ### Mô tả
28
+ ### Description
29
29
 
30
- Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
30
+ When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
31
31
  ```
32
32
  worker.response_timeout: No output for 300000ms
33
33
  crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
@@ -35,48 +35,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
35
35
 
36
36
  ### Root cause
37
37
 
38
- **Đã fix.** Trước đây 429 rate limit không được retry vì:
39
- 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không số 429 rõ ràng)
40
- 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay để task-runner xử retry với fallback
38
+ **Fixed.** Previously the 429 rate limit was not retried because:
39
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
40
+ 2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
41
41
 
42
42
  ### Fix applied
43
43
 
44
- 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
45
- 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
46
- 3. **child-pi.ts**: Bỏ fast-fail 429 để task-runner xử retry với model fallback chain
44
+ 1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
45
+ 2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
46
+ 3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
47
47
 
48
48
  ### Model fallback chain
49
49
 
50
- Khi model chính bị 429:
51
- 1. Fallback sang `fallbackModels` (nếu có cấu hình)
52
- 2. Fallback sang available models khác trong hệ thống
53
- 3. Nếu không fallback retry hết → fail với đúng error message
50
+ When the main model gets a 429:
51
+ 1. Fall back to `fallbackModels` (if configured)
52
+ 2. Fall back to other available models in the system
53
+ 3. If there is no fallback and retries are exhausted → fail with the correct error message
54
54
 
55
- **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để nhiều lựa chọn khi model chính bị rate limit.
55
+ **Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
56
56
 
57
57
  ---
58
58
 
59
- ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
59
+ ## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
60
60
 
61
61
  | Field | Value |
62
62
  |---|---|
63
63
  | **Severity** | 🔴 HIGH |
64
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
65
- | **Affected** | Tất cả child Pi workers |
66
- | **Symptom** | Worker báo generic "No output for 300000ms" thay "Provider rate limit: 429" |
64
+ | **Status** | New — discovered while debugging Bug #1 |
65
+ | **Affected** | All child Pi workers |
66
+ | **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
67
67
 
68
- ### Mô tả
68
+ ### Description
69
69
 
70
- Pi CLI output JSON events cho 429 errors rất rõ ràng:
70
+ The Pi CLI outputs JSON events for 429 errors very clearly:
71
71
  ```json
72
72
  {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
73
73
  ```
74
74
 
75
- Nhưng `child-pi.ts` **không parse error events** — chỉ quan tâm đến:
76
- - `isFinalAssistantEvent()` — để trigger final drain
77
- - `turn_end` — để đếm turns cho turn limiting
75
+ But `child-pi.ts` **does not parse error events** — it only cares about:
76
+ - `isFinalAssistantEvent()` — to trigger the final drain
77
+ - `turn_end` — to count turns for turn limiting
78
78
 
79
- Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
79
+ Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
80
80
 
81
81
  ### Code location
82
82
 
@@ -84,7 +84,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
84
84
  ```typescript
85
85
  onJsonEvent: (event) => {
86
86
  restartNoResponseTimer();
87
- // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
87
+ // Turn-count-based steering: only counts turns, does NOT check errors
88
88
  if (event && typeof event === "object" && !Array.isArray(event)) {
89
89
  const obj = event as Record<string, unknown>;
90
90
  if (obj.type === "turn_end") {
@@ -98,7 +98,7 @@ onJsonEvent: (event) => {
98
98
 
99
99
  ### Fix
100
100
 
101
- Thêm provider error detection trong `onJsonEvent`:
101
+ Add provider error detection in `onJsonEvent`:
102
102
  ```typescript
103
103
  let providerError: string | undefined;
104
104
 
@@ -115,42 +115,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
115
115
 
116
116
  ### Impact
117
117
 
118
- Fix này sẽ chuyển error message từ:
118
+ This fix changes the error message from:
119
119
  ```
120
120
  ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
121
121
  ```
122
- Thành:
122
+ to:
123
123
  ```
124
124
  ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
125
125
  ```
126
126
 
127
- **fail fast** thay đợi 300s.
127
+ And it **fails fast** instead of waiting 300s.
128
128
 
129
129
  ---
130
130
 
131
- ## Bug #3: background.log dụngkhông capture worker output
131
+ ## Bug #3: background.log is uselessdoes not capture worker output
132
132
 
133
133
  | Field | Value |
134
134
  |---|---|
135
135
  | **Severity** | 🟠 MEDIUM |
136
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
137
- | **Affected** | Debugging experience cho tất cả background runs |
138
- | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
136
+ | **Status** | New — discovered while debugging Bug #1 |
137
+ | **Affected** | Debugging experience for all background runs |
138
+ | **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
139
139
 
140
- ### Mô tả
140
+ ### Description
141
141
 
142
- Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
142
+ When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
143
143
  ```
144
144
  [pi-crew] background loader=jiti
145
145
  ```
146
146
 
147
- Không có:
147
+ Missing:
148
148
  - Worker stdout/stderr
149
149
  - Error messages
150
150
  - Provider responses
151
151
  - Exit codes
152
152
 
153
- ### Nguyên nhân
153
+ ### Cause
154
154
 
155
155
  `async-runner.ts` line 130-145:
156
156
  ```typescript
@@ -169,22 +169,22 @@ return {
169
169
  };
170
170
  ```
171
171
 
172
- **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
172
+ **The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
173
173
 
174
174
  Flow:
175
175
  ```
176
176
  background-runner.ts (stdout→logFd, stderr→logFd)
177
- → loader=jiti → ghi vào log ✅
177
+ → loader=jiti → writes to log ✅
178
178
  → executeTeamRun()
179
- → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
180
- → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
179
+ → child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
180
+ → Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
181
181
  ```
182
182
 
183
183
  ### Fix
184
184
 
185
- 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
186
- 2. **Option B:** Thêm event log entries cho provider errors (đã event log, nhưng không đủ chi tiết)
187
- 3. **Option C:** Background-runner tee output vào log file
185
+ 1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
186
+ 2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
187
+ 3. **Option C:** Background-runner tees output to a log file
188
188
 
189
189
  ### Key file
190
190
 
@@ -194,18 +194,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
194
194
 
195
195
  ---
196
196
 
197
- ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
197
+ ## Bug #4: worker-startup.ts missing "rate_limited" classification
198
198
 
199
199
  | Field | Value |
200
200
  |---|---|
201
201
  | **Severity** | 🟡 LOW |
202
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
203
- | **Affected** | Error classification reporting |
204
- | **Symptom** | 429 errors classified "unknown" thay "rate_limited" |
202
+ | **Status** | New — discovered while debugging Bug #1 |
203
+ | **Affected** | Error classification and reporting |
204
+ | **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
205
205
 
206
- ### Mô tả
206
+ ### Description
207
207
 
208
- `worker-startup.ts` `StartupFailureClassification` type:
208
+ `worker-startup.ts` has the `StartupFailureClassification` type:
209
209
  ```typescript
210
210
  export type StartupFailureClassification =
211
211
  | "trust_required"
@@ -216,11 +216,11 @@ export type StartupFailureClassification =
216
216
  | "unknown";
217
217
  ```
218
218
 
219
- Thiếu `"rate_limited"` `"provider_error"`. Kết quả: 429 errors bị classify `"unknown"`.
219
+ Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
220
220
 
221
221
  ### Fix
222
222
 
223
- Thêm vào type `classifyStartupFailure` function:
223
+ Add to the type and `classifyStartupFailure` function:
224
224
  ```typescript
225
225
  export type StartupFailureClassification =
226
226
  | "trust_required"
@@ -245,18 +245,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
245
245
 
246
246
  ---
247
247
 
248
- ## Bug #5: Stale heartbeat notifications sau prune
248
+ ## Bug #5: Stale heartbeat notifications after prune
249
249
 
250
250
  | Field | Value |
251
251
  |---|---|
252
252
  | **Severity** | 🟡 LOW (cosmetic) |
253
253
  | **Status** | Confirmed |
254
254
  | **Affected** | User experience |
255
- | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
255
+ | **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
256
256
 
257
- ### Mô tả
257
+ ### Description
258
258
 
259
- Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
259
+ After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
260
260
 
261
261
  ```
262
262
  → team prune: Removed 9 runs
@@ -267,23 +267,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
267
267
  ... (6+ stale notifications)
268
268
  ```
269
269
 
270
- Mỗi notification trigger `get_subagent_result` → trả về "not found".
270
+ Each notification triggers `get_subagent_result` → returns "not found".
271
271
 
272
- ### Nguyên nhân
272
+ ### Cause
273
273
 
274
- Background watcher duy trì worker health check queue. Khi runs bị prune:
275
- 1. Watcher không deregister ngay
276
- 2. Notifications đã trong queue vẫn emit
277
- 3. Các notifications đến lần lượt, cách nhau vài giây
274
+ The background watcher maintains a worker health-check queue. When runs are pruned:
275
+ 1. The watcher does not deregister immediately
276
+ 2. Notifications already in the queue still emit
277
+ 3. The notifications arrive one by one, a few seconds apart
278
278
 
279
279
  ### Impact
280
280
 
281
- - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
282
- - Wasted context: mỗi notification trigger 1 tool call để verify
281
+ - Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
282
+ - Wasted context: each notification triggers 1 tool call to verify
283
283
 
284
284
  ### Fix
285
285
 
286
- Background watcher nên check run existence trước khi emit:
286
+ The background watcher should check run existence before emitting:
287
287
  ```typescript
288
288
  // Before emitting heartbeat_dead:
289
289
  if (!runExists(runId)) {
@@ -303,24 +303,24 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
303
303
 
304
304
  # pi-crew v0.2.20 — Bug Report
305
305
 
306
- **Ngày:** 2026-05-19
306
+ **Date:** 2026-05-19
307
307
  **Session:** Comprehensive integration test + root cause analysis
308
308
  **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
309
309
 
310
310
  ---
311
311
 
312
- ## Bug #1: Background workers "heartbeat dead" — thực chất MiniMax 429 Rate Limit
312
+ ## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
313
313
 
314
314
  | Field | Value |
315
315
  |---|---|
316
316
  | **Severity** | 🔴 HIGH |
317
317
  | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
318
- | **Affected** | Tất cả background/async workers |
319
- | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
318
+ | **Affected** | All background/async workers |
319
+ | **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
320
320
 
321
- ### Mô tả
321
+ ### Description
322
322
 
323
- Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
323
+ When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
324
324
  ```
325
325
  worker.response_timeout: No output for 300000ms
326
326
  crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
@@ -328,48 +328,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
328
328
 
329
329
  ### Root cause
330
330
 
331
- **Đã fix.** Trước đây 429 rate limit không được retry vì:
332
- 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không số 429 rõ ràng)
333
- 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay để task-runner xử retry với fallback
331
+ **Fixed.** Previously the 429 rate limit was not retried because:
332
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
333
+ 2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
334
334
 
335
335
  ### Fix applied
336
336
 
337
- 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
338
- 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
339
- 3. **child-pi.ts**: Bỏ fast-fail 429 để task-runner xử retry với model fallback chain
337
+ 1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
338
+ 2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
339
+ 3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
340
340
 
341
341
  ### Model fallback chain
342
342
 
343
- Khi model chính bị 429:
344
- 1. Fallback sang `fallbackModels` (nếu có cấu hình)
345
- 2. Fallback sang available models khác trong hệ thống
346
- 3. Nếu không fallback retry hết → fail với đúng error message
343
+ When the main model gets a 429:
344
+ 1. Fall back to `fallbackModels` (if configured)
345
+ 2. Fall back to other available models in the system
346
+ 3. If there is no fallback and retries are exhausted → fail with the correct error message
347
347
 
348
- **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để nhiều lựa chọn khi model chính bị rate limit.
348
+ **Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
349
349
 
350
350
  ---
351
351
 
352
- ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
352
+ ## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
353
353
 
354
354
  | Field | Value |
355
355
  |---|---|
356
356
  | **Severity** | 🔴 HIGH |
357
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
358
- | **Affected** | Tất cả child Pi workers |
359
- | **Symptom** | Worker báo generic "No output for 300000ms" thay "Provider rate limit: 429" |
357
+ | **Status** | New — discovered while debugging Bug #1 |
358
+ | **Affected** | All child Pi workers |
359
+ | **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
360
360
 
361
- ### Mô tả
361
+ ### Description
362
362
 
363
- Pi CLI output JSON events cho 429 errors rất rõ ràng:
363
+ The Pi CLI outputs JSON events for 429 errors very clearly:
364
364
  ```json
365
365
  {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
366
366
  ```
367
367
 
368
- Nhưng `child-pi.ts` **không parse error events** — chỉ quan tâm đến:
369
- - `isFinalAssistantEvent()` — để trigger final drain
370
- - `turn_end` — để đếm turns cho turn limiting
368
+ But `child-pi.ts` **does not parse error events** — it only cares about:
369
+ - `isFinalAssistantEvent()` — to trigger the final drain
370
+ - `turn_end` — to count turns for turn limiting
371
371
 
372
- Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
372
+ Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
373
373
 
374
374
  ### Code location
375
375
 
@@ -377,7 +377,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
377
377
  ```typescript
378
378
  onJsonEvent: (event) => {
379
379
  restartNoResponseTimer();
380
- // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
380
+ // Turn-count-based steering: only counts turns, does NOT check errors
381
381
  if (event && typeof event === "object" && !Array.isArray(event)) {
382
382
  const obj = event as Record<string, unknown>;
383
383
  if (obj.type === "turn_end") {
@@ -391,7 +391,7 @@ onJsonEvent: (event) => {
391
391
 
392
392
  ### Fix
393
393
 
394
- Thêm provider error detection trong `onJsonEvent`:
394
+ Add provider error detection in `onJsonEvent`:
395
395
  ```typescript
396
396
  let providerError: string | undefined;
397
397
 
@@ -408,42 +408,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
408
408
 
409
409
  ### Impact
410
410
 
411
- Fix này sẽ chuyển error message từ:
411
+ This fix changes the error message from:
412
412
  ```
413
413
  ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
414
414
  ```
415
- Thành:
415
+ to:
416
416
  ```
417
417
  ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
418
418
  ```
419
419
 
420
- **fail fast** thay đợi 300s.
420
+ And it **fails fast** instead of waiting 300s.
421
421
 
422
422
  ---
423
423
 
424
- ## Bug #3: background.log dụngkhông capture worker output
424
+ ## Bug #3: background.log is uselessdoes not capture worker output
425
425
 
426
426
  | Field | Value |
427
427
  |---|---|
428
428
  | **Severity** | 🟠 MEDIUM |
429
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
430
- | **Affected** | Debugging experience cho tất cả background runs |
431
- | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
429
+ | **Status** | New — discovered while debugging Bug #1 |
430
+ | **Affected** | Debugging experience for all background runs |
431
+ | **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
432
432
 
433
- ### Mô tả
433
+ ### Description
434
434
 
435
- Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
435
+ When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
436
436
  ```
437
437
  [pi-crew] background loader=jiti
438
438
  ```
439
439
 
440
- Không có:
440
+ Missing:
441
441
  - Worker stdout/stderr
442
442
  - Error messages
443
443
  - Provider responses
444
444
  - Exit codes
445
445
 
446
- ### Nguyên nhân
446
+ ### Cause
447
447
 
448
448
  `async-runner.ts` line 130-145:
449
449
  ```typescript
@@ -462,22 +462,22 @@ return {
462
462
  };
463
463
  ```
464
464
 
465
- **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
465
+ **The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
466
466
 
467
467
  Flow:
468
468
  ```
469
469
  background-runner.ts (stdout→logFd, stderr→logFd)
470
- → loader=jiti → ghi vào log ✅
470
+ → loader=jiti → writes to log ✅
471
471
  → executeTeamRun()
472
- → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
473
- → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
472
+ → child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
473
+ → Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
474
474
  ```
475
475
 
476
476
  ### Fix
477
477
 
478
- 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
479
- 2. **Option B:** Thêm event log entries cho provider errors (đã event log, nhưng không đủ chi tiết)
480
- 3. **Option C:** Background-runner tee output vào log file
478
+ 1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
479
+ 2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
480
+ 3. **Option C:** Background-runner tees output to a log file
481
481
 
482
482
  ### Key file
483
483
 
@@ -487,18 +487,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
487
487
 
488
488
  ---
489
489
 
490
- ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
490
+ ## Bug #4: worker-startup.ts missing "rate_limited" classification
491
491
 
492
492
  | Field | Value |
493
493
  |---|---|
494
494
  | **Severity** | 🟡 LOW |
495
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
496
- | **Affected** | Error classification reporting |
497
- | **Symptom** | 429 errors classified "unknown" thay "rate_limited" |
495
+ | **Status** | New — discovered while debugging Bug #1 |
496
+ | **Affected** | Error classification and reporting |
497
+ | **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
498
498
 
499
- ### Mô tả
499
+ ### Description
500
500
 
501
- `worker-startup.ts` `StartupFailureClassification` type:
501
+ `worker-startup.ts` has the `StartupFailureClassification` type:
502
502
  ```typescript
503
503
  export type StartupFailureClassification =
504
504
  | "trust_required"
@@ -509,11 +509,11 @@ export type StartupFailureClassification =
509
509
  | "unknown";
510
510
  ```
511
511
 
512
- Thiếu `"rate_limited"` `"provider_error"`. Kết quả: 429 errors bị classify `"unknown"`.
512
+ Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
513
513
 
514
514
  ### Fix
515
515
 
516
- Thêm vào type `classifyStartupFailure` function:
516
+ Add to the type and `classifyStartupFailure` function:
517
517
  ```typescript
518
518
  export type StartupFailureClassification =
519
519
  | "trust_required"
@@ -538,18 +538,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
538
538
 
539
539
  ---
540
540
 
541
- ## Bug #5: Stale heartbeat notifications sau prune
541
+ ## Bug #5: Stale heartbeat notifications after prune
542
542
 
543
543
  | Field | Value |
544
544
  |---|---|
545
545
  | **Severity** | 🟡 LOW (cosmetic) |
546
546
  | **Status** | Confirmed |
547
547
  | **Affected** | User experience |
548
- | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
548
+ | **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
549
549
 
550
- ### Mô tả
550
+ ### Description
551
551
 
552
- Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
552
+ After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
553
553
 
554
554
  ```
555
555
  → team prune: Removed 9 runs
@@ -560,23 +560,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
560
560
  ... (6+ stale notifications)
561
561
  ```
562
562
 
563
- Mỗi notification trigger `get_subagent_result` → trả về "not found".
563
+ Each notification triggers `get_subagent_result` → returns "not found".
564
564
 
565
- ### Nguyên nhân
565
+ ### Cause
566
566
 
567
- Background watcher duy trì worker health check queue. Khi runs bị prune:
568
- 1. Watcher không deregister ngay
569
- 2. Notifications đã trong queue vẫn emit
570
- 3. Các notifications đến lần lượt, cách nhau vài giây
567
+ The background watcher maintains a worker health-check queue. When runs are pruned:
568
+ 1. The watcher does not deregister immediately
569
+ 2. Notifications already in the queue still emit
570
+ 3. The notifications arrive one by one, a few seconds apart
571
571
 
572
572
  ### Impact
573
573
 
574
- - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
575
- - Wasted context: mỗi notification trigger 1 tool call để verify
574
+ - Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
575
+ - Wasted context: each notification triggers 1 tool call to verify
576
576
 
577
577
  ### Fix
578
578
 
579
- Background watcher nên check run existence trước khi emit:
579
+ The background watcher should check run existence before emitting:
580
580
  ```typescript
581
581
  // Before emitting heartbeat_dead:
582
582
  if (!runExists(runId)) {
@@ -596,24 +596,24 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
596
596
 
597
597
  # pi-crew v0.2.20 — Bug Report
598
598
 
599
- **Ngày:** 2026-05-19
599
+ **Date:** 2026-05-19
600
600
  **Session:** Comprehensive integration test + root cause analysis
601
601
  **Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
602
602
 
603
603
  ---
604
604
 
605
- ## Bug #1: Background workers "heartbeat dead" — thực chất MiniMax 429 Rate Limit
605
+ ## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
606
606
 
607
607
  | Field | Value |
608
608
  |---|---|
609
609
  | **Severity** | 🔴 HIGH |
610
610
  | **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
611
- | **Affected** | Tất cả background/async workers |
612
- | **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
611
+ | **Affected** | All background/async workers |
612
+ | **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
613
613
 
614
- ### Mô tả
614
+ ### Description
615
615
 
616
- Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
616
+ When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
617
617
  ```
618
618
  worker.response_timeout: No output for 300000ms
619
619
  crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
@@ -621,48 +621,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
621
621
 
622
622
  ### Root cause
623
623
 
624
- **Đã fix.** Trước đây 429 rate limit không được retry vì:
625
- 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không số 429 rõ ràng)
626
- 2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay để task-runner xử retry với fallback
624
+ **Fixed.** Previously the 429 rate limit was not retried because:
625
+ 1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
626
+ 2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
627
627
 
628
628
  ### Fix applied
629
629
 
630
- 1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
631
- 2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
632
- 3. **child-pi.ts**: Bỏ fast-fail 429 để task-runner xử retry với model fallback chain
630
+ 1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
631
+ 2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
632
+ 3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
633
633
 
634
634
  ### Model fallback chain
635
635
 
636
- Khi model chính bị 429:
637
- 1. Fallback sang `fallbackModels` (nếu có cấu hình)
638
- 2. Fallback sang available models khác trong hệ thống
639
- 3. Nếu không fallback retry hết → fail với đúng error message
636
+ When the main model gets a 429:
637
+ 1. Fall back to `fallbackModels` (if configured)
638
+ 2. Fall back to other available models in the system
639
+ 3. If there is no fallback and retries are exhausted → fail with the correct error message
640
640
 
641
- **Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để nhiều lựa chọn khi model chính bị rate limit.
641
+ **Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
642
642
 
643
643
  ---
644
644
 
645
- ## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
645
+ ## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
646
646
 
647
647
  | Field | Value |
648
648
  |---|---|
649
649
  | **Severity** | 🔴 HIGH |
650
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
651
- | **Affected** | Tất cả child Pi workers |
652
- | **Symptom** | Worker báo generic "No output for 300000ms" thay "Provider rate limit: 429" |
650
+ | **Status** | New — discovered while debugging Bug #1 |
651
+ | **Affected** | All child Pi workers |
652
+ | **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
653
653
 
654
- ### Mô tả
654
+ ### Description
655
655
 
656
- Pi CLI output JSON events cho 429 errors rất rõ ràng:
656
+ The Pi CLI outputs JSON events for 429 errors very clearly:
657
657
  ```json
658
658
  {"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
659
659
  ```
660
660
 
661
- Nhưng `child-pi.ts` **không parse error events** — chỉ quan tâm đến:
662
- - `isFinalAssistantEvent()` — để trigger final drain
663
- - `turn_end` — để đếm turns cho turn limiting
661
+ But `child-pi.ts` **does not parse error events** — it only cares about:
662
+ - `isFinalAssistantEvent()` — to trigger the final drain
663
+ - `turn_end` — to count turns for turn limiting
664
664
 
665
- Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
665
+ Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
666
666
 
667
667
  ### Code location
668
668
 
@@ -670,7 +670,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
670
670
  ```typescript
671
671
  onJsonEvent: (event) => {
672
672
  restartNoResponseTimer();
673
- // Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
673
+ // Turn-count-based steering: only counts turns, does NOT check errors
674
674
  if (event && typeof event === "object" && !Array.isArray(event)) {
675
675
  const obj = event as Record<string, unknown>;
676
676
  if (obj.type === "turn_end") {
@@ -684,7 +684,7 @@ onJsonEvent: (event) => {
684
684
 
685
685
  ### Fix
686
686
 
687
- Thêm provider error detection trong `onJsonEvent`:
687
+ Add provider error detection in `onJsonEvent`:
688
688
  ```typescript
689
689
  let providerError: string | undefined;
690
690
 
@@ -701,42 +701,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
701
701
 
702
702
  ### Impact
703
703
 
704
- Fix này sẽ chuyển error message từ:
704
+ This fix changes the error message from:
705
705
  ```
706
706
  ❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
707
707
  ```
708
- Thành:
708
+ to:
709
709
  ```
710
710
  ✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
711
711
  ```
712
712
 
713
- **fail fast** thay đợi 300s.
713
+ And it **fails fast** instead of waiting 300s.
714
714
 
715
715
  ---
716
716
 
717
- ## Bug #3: background.log dụngkhông capture worker output
717
+ ## Bug #3: background.log is uselessdoes not capture worker output
718
718
 
719
719
  | Field | Value |
720
720
  |---|---|
721
721
  | **Severity** | 🟠 MEDIUM |
722
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
723
- | **Affected** | Debugging experience cho tất cả background runs |
724
- | **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
722
+ | **Status** | New — discovered while debugging Bug #1 |
723
+ | **Affected** | Debugging experience for all background runs |
724
+ | **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
725
725
 
726
- ### Mô tả
726
+ ### Description
727
727
 
728
- Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
728
+ When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
729
729
  ```
730
730
  [pi-crew] background loader=jiti
731
731
  ```
732
732
 
733
- Không có:
733
+ Missing:
734
734
  - Worker stdout/stderr
735
735
  - Error messages
736
736
  - Provider responses
737
737
  - Exit codes
738
738
 
739
- ### Nguyên nhân
739
+ ### Cause
740
740
 
741
741
  `async-runner.ts` line 130-145:
742
742
  ```typescript
@@ -755,22 +755,22 @@ return {
755
755
  };
756
756
  ```
757
757
 
758
- **stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
758
+ **The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
759
759
 
760
760
  Flow:
761
761
  ```
762
762
  background-runner.ts (stdout→logFd, stderr→logFd)
763
- → loader=jiti → ghi vào log ✅
763
+ → loader=jiti → writes to log ✅
764
764
  → executeTeamRun()
765
- → child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
766
- → Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
765
+ → child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
766
+ → Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
767
767
  ```
768
768
 
769
769
  ### Fix
770
770
 
771
- 1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
772
- 2. **Option B:** Thêm event log entries cho provider errors (đã event log, nhưng không đủ chi tiết)
773
- 3. **Option C:** Background-runner tee output vào log file
771
+ 1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
772
+ 2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
773
+ 3. **Option C:** Background-runner tees output to a log file
774
774
 
775
775
  ### Key file
776
776
 
@@ -780,18 +780,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
780
780
 
781
781
  ---
782
782
 
783
- ## Bug #4: worker-startup.ts thiếu "rate_limited" classification
783
+ ## Bug #4: worker-startup.ts missing "rate_limited" classification
784
784
 
785
785
  | Field | Value |
786
786
  |---|---|
787
787
  | **Severity** | 🟡 LOW |
788
- | **Status** | New — phát hiện trong quá trình debug Bug #1 |
789
- | **Affected** | Error classification reporting |
790
- | **Symptom** | 429 errors classified "unknown" thay "rate_limited" |
788
+ | **Status** | New — discovered while debugging Bug #1 |
789
+ | **Affected** | Error classification and reporting |
790
+ | **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
791
791
 
792
- ### Mô tả
792
+ ### Description
793
793
 
794
- `worker-startup.ts` `StartupFailureClassification` type:
794
+ `worker-startup.ts` has the `StartupFailureClassification` type:
795
795
  ```typescript
796
796
  export type StartupFailureClassification =
797
797
  | "trust_required"
@@ -802,11 +802,11 @@ export type StartupFailureClassification =
802
802
  | "unknown";
803
803
  ```
804
804
 
805
- Thiếu `"rate_limited"` `"provider_error"`. Kết quả: 429 errors bị classify `"unknown"`.
805
+ Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
806
806
 
807
807
  ### Fix
808
808
 
809
- Thêm vào type `classifyStartupFailure` function:
809
+ Add to the type and `classifyStartupFailure` function:
810
810
  ```typescript
811
811
  export type StartupFailureClassification =
812
812
  | "trust_required"
@@ -831,18 +831,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
831
831
 
832
832
  ---
833
833
 
834
- ## Bug #5: Stale heartbeat notifications sau prune
834
+ ## Bug #5: Stale heartbeat notifications after prune
835
835
 
836
836
  | Field | Value |
837
837
  |---|---|
838
838
  | **Severity** | 🟡 LOW (cosmetic) |
839
839
  | **Status** | Confirmed |
840
840
  | **Affected** | User experience |
841
- | **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
841
+ | **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
842
842
 
843
- ### Mô tả
843
+ ### Description
844
844
 
845
- Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
845
+ After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
846
846
 
847
847
  ```
848
848
  → team prune: Removed 9 runs
@@ -853,23 +853,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
853
853
  ... (6+ stale notifications)
854
854
  ```
855
855
 
856
- Mỗi notification trigger `get_subagent_result` → trả về "not found".
856
+ Each notification triggers `get_subagent_result` → returns "not found".
857
857
 
858
- ### Nguyên nhân
858
+ ### Cause
859
859
 
860
- Background watcher duy trì worker health check queue. Khi runs bị prune:
861
- 1. Watcher không deregister ngay
862
- 2. Notifications đã trong queue vẫn emit
863
- 3. Các notifications đến lần lượt, cách nhau vài giây
860
+ The background watcher maintains a worker health-check queue. When runs are pruned:
861
+ 1. The watcher does not deregister immediately
862
+ 2. Notifications already in the queue still emit
863
+ 3. The notifications arrive one by one, a few seconds apart
864
864
 
865
865
  ### Impact
866
866
 
867
- - Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
868
- - Wasted context: mỗi notification trigger 1 tool call để verify
867
+ - Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
868
+ - Wasted context: each notification triggers 1 tool call to verify
869
869
 
870
870
  ### Fix
871
871
 
872
- Background watcher nên check run existence trước khi emit:
872
+ The background watcher should check run existence before emitting:
873
873
  ```typescript
874
874
  // Before emitting heartbeat_dead:
875
875
  if (!runExists(runId)) {
@@ -887,18 +887,18 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
887
887
 
888
888
  ---
889
889
 
890
- ## Bug #6: Live-session run bị cancel giữa chừng
890
+ ## Bug #6: Live-session run cancelled mid-execution
891
891
 
892
892
  | Field | Value |
893
893
  |---|---|
894
894
  | **Severity** | 🟠 MEDIUM |
895
- | **Status** | ✅ Confirmed — no code fix needed; documented as user workflow constraint |
895
+ | **Status** | ✅ Confirmed — no code fix needed; documented as a user workflow constraint |
896
896
  | **Affected** | Foreground team runs |
897
- | **Symptom** | Run cancelled sau khi explore phase hoàn thành, trước execute phase |
897
+ | **Symptom** | Run cancelled after the explore phase completes, before the execute phase |
898
898
 
899
- ### Mô tả
899
+ ### Description
900
900
 
901
- Fast-fix team chạy live-session:
901
+ A fast-fix team ran in a live-session:
902
902
  ```
903
903
  04:12:20 live-session.prompt_start 01_explore
904
904
  04:12:51 live-session.prompt_done 01_explore (31s, completed)
@@ -907,18 +907,18 @@ Fast-fix team chạy live-session:
907
907
  04:12:51 run.cancelled: "This operation was aborted"
908
908
  ```
909
909
 
910
- Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước khi `02_execute` bắt đầu.
910
+ Task `01_explore` completed successfully, but the run was cancelled before `02_execute` started.
911
911
 
912
- ### Nguyên nhân có thể
912
+ ### Possible causes
913
913
 
914
- 1. **Session concurrency limit** — chỉ 1 live-session active, conflict với parallel test operations
914
+ 1. **Session concurrency limit** — only 1 active live-session, conflicting with parallel test operations
915
915
  2. **User-initiated cancellation** — accidentally triggered
916
- 3. **Workflow phase transition bug** — không trigger next phase sau explore completes
916
+ 3. **Workflow phase transition bug** — does not trigger the next phase after explore completes
917
917
 
918
- ### Cần thêm investigation
918
+ ### Needs further investigation
919
919
 
920
- - Chạy lại fast-fix team đơn lẻ (không concurrent operations)
921
- - Check live-session-runtime.ts cho phase transition logic
920
+ - Run the fast-fix team standalone (no concurrent operations)
921
+ - Check live-session-runtime.ts for phase-transition logic
922
922
 
923
923
  ---
924
924
 
@@ -926,16 +926,16 @@ Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước
926
926
 
927
927
  | # | Bug | Severity | Status | Category |
928
928
  |---|---|---|---|---|
929
- | 1 | Background workers timeout do MiniMax 429 | 🔴 HIGH | ✅ Fixed — 429 now retries with fallback models via improved RETRYABLE_MODEL_FAILURE_PATTERNS | Code |
930
- | 2 | child-pi.ts không phát hiện 429, báo sai "heartbeat dead" | 🔴 HIGH | ✅ Fixed — removed fast-fail 429; let task-runner handle retry+fallback | Code |
931
- | 3 | background.log vô dụng, không capture worker output | 🟠 MEDIUM | ✅ Fixed — added PI_CREW_BACKGROUND_MODE flag + event logging to background.log | Observability |
932
- | 4 | worker-startup.ts thiếu rate_limited classification | 🟡 LOW | ✅ Fixed — added rate_limited + provider_error to StartupFailureClassification | Code |
933
- | 5 | Stale heartbeat notifications sau prune | 🟡 LOW | ✅ Fixed — HeartbeatWatcher skips pruned runs via stateRoot existence check | UX |
934
- | 6 | Live-session foreground run bị cancel khi concurrent tool calls | 🟠 MEDIUM | ✅ Confirmed — concurrent calls interrupt live-session → outputLength:0 → caller_cancelled. Avoid concurrent team actions during foreground runs. | Runtime |
935
- | 7 | Async notifier "stale ctx" — dies, không restart sau Pi restart | 🔴 HIGH | ✅ Fixed — swallow stale error, isCurrent guard handles dormancy | Code |
929
+ | 1 | Background workers timeout due to MiniMax 429 | 🔴 HIGH | ✅ Fixed — 429 now retries with fallback models via improved RETRYABLE_MODEL_FAILURE_PATTERNS | Code |
930
+ | 2 | child-pi.ts does not detect 429, reports wrong "heartbeat dead" | 🔴 HIGH | ✅ Fixed — removed 429 fast-fail; let task-runner handle retry+fallback | Code |
931
+ | 3 | background.log useless, does not capture worker output | 🟠 MEDIUM | ✅ Fixed — added PI_CREW_BACKGROUND_MODE flag + event logging to background.log | Observability |
932
+ | 4 | worker-startup.ts missing rate_limited classification | 🟡 LOW | ✅ Fixed — added rate_limited + provider_error to StartupFailureClassification | Code |
933
+ | 5 | Stale heartbeat notifications after prune | 🟡 LOW | ✅ Fixed — HeartbeatWatcher skips pruned runs via stateRoot existence check | UX |
934
+ | 6 | Live-session foreground run cancelled when there are concurrent tool calls | 🟠 MEDIUM | ✅ Confirmed — concurrent calls interrupt live-session → outputLength:0 → caller_cancelled. Avoid concurrent team actions during foreground runs. | Runtime |
935
+ | 7 | Async notifier "stale ctx" — dies, does not restart after Pi restart | 🔴 HIGH | ✅ Fixed — swallow stale error, isCurrent guard handles dormancy | Code |
936
936
  | 8 | Background child-process 300s timeout — child Pi hangs, zero output | 🟠 MEDIUM | ✅ Fixed — Root cause found (Bug #10): MINIMAX_API_KEY stripped by sanitizeEnvSecrets(). Allow-list in child-pi.ts preserves model provider API keys. Restart Pi to verify fix. | Code |
937
- | 9 | Executor hit yield limit — file write không hoàn thành | 🟡 LOW | 🔲 Open — executor hit 3 Yield Reminders and terminated before writing file. Task marked completed but artifact missing. | Runtime |
938
- | 10 | Child-process silent timeout — MINIMAX_API_KEY bị filter ra khỏi child env | 🔴 HIGH | ✅ Fixed — sanitizeEnvSecrets() strips *API_KEY* vars. Allow-list in buildChildPiSpawnOptions preserves model provider keys (MINIMAX_*, OPENAI_*, etc.). See docs/fixes/bug-010-child-process-api-key-filtered.md | Code |
937
+ | 9 | Executor hit yield limit — file write not completed | 🟡 LOW | 🔲 Open — executor hit 3 Yield Reminders and terminated before writing file. Task marked completed but artifact missing. | Runtime |
938
+ | 10 | Child-process silent timeout — MINIMAX_API_KEY filtered out of child env | 🔴 HIGH | ✅ Fixed — sanitizeEnvSecrets() strips *API_KEY* vars. Allow-list in buildChildPiSpawnOptions preserves model provider keys (MINIMAX_*, OPENAI_*, etc.). See docs/fixes/bug-010-child-process-api-key-filtered.md | Code |
939
939
 
940
940
 
941
941
  | 11 | Background runner "spawn pi ENOENT" — pi binary not in PATH | 🔴 HIGH | ✅ Fixed — added resolvePiCliScript() call for non-Windows platforms in getPiSpawnCommand(). Restart Pi to verify. | Code |
@@ -945,7 +945,7 @@ Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước
945
945
  ### Priority fix order
946
946
 
947
947
  1. **Bug #1** — ✅ Fixed — 429 now retried with model fallback chain
948
- 2. **Bug #2** — ✅ Fixed — removed fast-fail 429
948
+ 2. **Bug #2** — ✅ Fixed — removed 429 fast-fail
949
949
  3. **Bug #3** — ✅ Fixed — worker events now logged to background.log
950
950
  4. **Bug #4** — ✅ Fixed — rate_limited + provider_error classification added
951
951
  5. **Bug #5** — ✅ Fixed — HeartbeatWatcher skips pruned runs