pi-crew 0.2.20 → 0.2.22
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +23 -10
- package/README.md +4 -2
- package/docs/PROJECT_REVIEW.md +271 -0
- package/docs/PROJECT_REVIEW_FIXES.md +343 -0
- package/docs/PROJECT_REVIEW_ROUND4.md +156 -0
- package/docs/PROJECT_REVIEW_ROUND5.md +86 -0
- package/docs/fixes/BATCH_A_H1_H2.md +86 -0
- package/docs/fixes/bug-006-foreground-cancel-concurrent.md +78 -0
- package/docs/fixes/bug-007-async-notifier-stale-ctx.md +112 -0
- package/docs/fixes/bug-008-child-process-silent-timeout.md +100 -0
- package/docs/fixes/bug-009-executor-yield-limit-needs-attention.md +75 -0
- package/docs/fixes/bug-010-child-process-api-key-filtered.md +109 -0
- package/docs/fixes/bug-011-spawn-pi-enoent.md +92 -0
- package/docs/fixes/bug-012-essential-env-stripped.md +89 -0
- package/docs/fixes/bug-013-background-runner-death.md +84 -0
- package/docs/fixes/bug-014-infinite-retry-loop-needs-attention.md +82 -0
- package/docs/fixes/bug-015-background-runner-sigterm.md +65 -0
- package/docs/fixes/bug-017-background-runner-session-shutdown.md +66 -0
- package/docs/fixes/bug-017-background-runner-sigkill-double-fork.md +28 -0
- package/docs/fixes/bug-018-child-pi-worker-stdin-hang.md +61 -0
- package/docs/fixes/bug-019-phantom-runs-temp-workspace.md +52 -0
- package/docs/pi-crew-bugs.md +954 -0
- package/docs/pi-crew-investigation-report.md +411 -0
- package/docs/pi-crew-test-final.md +120 -0
- package/docs/pi-crew-test-results.md +260 -0
- package/docs/pi-crew-test-round2.md +136 -0
- package/docs/pi-crew-test-round4.md +100 -0
- package/docs/pi-crew-test-round5.md +70 -0
- package/docs/pi-crew-test-round6.md +110 -0
- package/docs/usage.md +14 -0
- package/package.json +4 -2
- package/src/adapters/export-util.ts +12 -6
- package/src/agents/agent-config.ts +2 -0
- package/src/config/defaults.ts +1 -1
- package/src/config/markers.ts +22 -17
- package/src/config/resilient-parser.ts +1 -1
- package/src/extension/async-notifier.ts +4 -2
- package/src/extension/management.ts +52 -0
- package/src/extension/register.ts +47 -10
- package/src/extension/run-index.ts +20 -2
- package/src/extension/run-maintenance.ts +2 -2
- package/src/extension/team-tool/parallel-dispatch.ts +1 -1
- package/src/extension/team-tool/run.ts +3 -6
- package/src/extension/team-tool.ts +67 -11
- package/src/observability/event-to-metric.ts +2 -1
- package/src/runtime/async-runner.ts +42 -34
- package/src/runtime/background-runner.ts +165 -7
- package/src/runtime/child-pi.ts +111 -18
- package/src/runtime/code-summary.ts +1 -1
- package/src/runtime/crash-recovery.ts +1 -1
- package/src/runtime/crew-agent-runtime.ts +2 -1
- package/src/runtime/heartbeat-watcher.ts +4 -0
- package/src/runtime/live-agent-manager.ts +1 -1
- package/src/runtime/live-session-runtime.ts +2 -1
- package/src/runtime/manifest-cache.ts +2 -2
- package/src/runtime/model-fallback.ts +2 -1
- package/src/runtime/phase-progress.ts +1 -1
- package/src/runtime/pi-args.ts +3 -1
- package/src/runtime/pi-spawn.ts +6 -0
- package/src/runtime/prose-compressor.ts +1 -1
- package/src/runtime/result-extractor.ts +0 -1
- package/src/runtime/retry-executor.ts +1 -1
- package/src/runtime/runtime-resolver.ts +8 -3
- package/src/runtime/skill-instructions.ts +0 -1
- package/src/runtime/stale-reconciler.ts +30 -3
- package/src/runtime/subagent-manager.ts +2 -0
- package/src/runtime/task-display.ts +1 -1
- package/src/runtime/task-graph-scheduler.ts +1 -1
- package/src/runtime/task-runner/live-executor.ts +15 -0
- package/src/runtime/task-runner/tail-read.ts +26 -0
- package/src/runtime/task-runner.ts +1007 -383
- package/src/runtime/team-runner.ts +9 -5
- package/src/runtime/worker-startup.ts +3 -1
- package/src/schema/team-tool-schema.ts +2 -1
- package/src/state/active-run-registry.ts +8 -2
- package/src/state/atomic-write.ts +17 -0
- package/src/state/contracts.ts +5 -2
- package/src/state/event-log-rotation.ts +118 -31
- package/src/state/event-log.ts +33 -5
- package/src/state/event-reconstructor.ts +4 -2
- package/src/state/mailbox.ts +5 -1
- package/src/state/schedule.ts +146 -0
- package/src/state/types.ts +40 -0
- package/src/state/usage.ts +20 -0
- package/src/ui/crew-widget.ts +2 -2
- package/src/ui/run-event-bus.ts +1 -1
- package/src/ui/run-snapshot-cache.ts +2 -1
- package/src/ui/snapshot-types.ts +1 -0
- package/src/utils/gh-protocol.ts +2 -2
- package/src/utils/names.ts +1 -1
- package/src/utils/sse-parser.ts +0 -2
- package/src/worktree/branch-freshness.ts +1 -1
- package/src/worktree/cleanup.ts +54 -14
- package/src/worktree/worktree-manager.ts +19 -9
|
@@ -0,0 +1,954 @@
|
|
|
1
|
+
# pi-crew v0.2.20 — Bug Report & Fixes
|
|
2
|
+
|
|
3
|
+
**Ngày:** 2026-05-19
|
|
4
|
+
**Session:** Comprehensive integration test + root cause analysis
|
|
5
|
+
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
6
|
+
**Trạng thái:** ✅ 14/14 bugs fixed (commits `de9e8b4` và `5dc794e`)
|
|
7
|
+
|
|
8
|
+
> **All bugs fixed ✅** — Source code verified. Xem [pi-crew-test-final.md](pi-crew-test-final.md) cho kết quả end-to-end test.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
|
|
13
|
+
|
|
14
|
+
| Field | Value |
|
|
15
|
+
|---|---|
|
|
16
|
+
| **Severity** | 🔴 HIGH |
|
|
17
|
+
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
18
|
+
| **Affected** | Tất cả background/async workers |
|
|
19
|
+
| **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
|
|
20
|
+
|
|
21
|
+
### Mô tả
|
|
22
|
+
|
|
23
|
+
Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
|
|
24
|
+
```
|
|
25
|
+
worker.response_timeout: No output for 300000ms
|
|
26
|
+
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Root cause
|
|
30
|
+
|
|
31
|
+
**Đã fix.** Trước đây 429 rate limit không được retry vì:
|
|
32
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
|
|
33
|
+
2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
|
|
34
|
+
|
|
35
|
+
### Fix applied
|
|
36
|
+
|
|
37
|
+
1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
|
|
38
|
+
2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
|
|
39
|
+
3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
|
|
40
|
+
|
|
41
|
+
### Model fallback chain
|
|
42
|
+
|
|
43
|
+
Khi model chính bị 429:
|
|
44
|
+
1. Fallback sang `fallbackModels` (nếu có cấu hình)
|
|
45
|
+
2. Fallback sang available models khác trong hệ thống
|
|
46
|
+
3. Nếu không có fallback và retry hết → fail với đúng error message
|
|
47
|
+
|
|
48
|
+
**Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
|
|
53
|
+
|
|
54
|
+
| Field | Value |
|
|
55
|
+
|---|---|
|
|
56
|
+
| **Severity** | 🔴 HIGH |
|
|
57
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
58
|
+
| **Affected** | Tất cả child Pi workers |
|
|
59
|
+
| **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
|
|
60
|
+
|
|
61
|
+
### Mô tả
|
|
62
|
+
|
|
63
|
+
Pi CLI output JSON events cho 429 errors rất rõ ràng:
|
|
64
|
+
```json
|
|
65
|
+
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
|
|
69
|
+
- `isFinalAssistantEvent()` — để trigger final drain
|
|
70
|
+
- `turn_end` — để đếm turns cho turn limiting
|
|
71
|
+
|
|
72
|
+
Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
|
|
73
|
+
|
|
74
|
+
### Code location
|
|
75
|
+
|
|
76
|
+
`/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
|
|
77
|
+
```typescript
|
|
78
|
+
onJsonEvent: (event) => {
|
|
79
|
+
restartNoResponseTimer();
|
|
80
|
+
// Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
|
|
81
|
+
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
82
|
+
const obj = event as Record<string, unknown>;
|
|
83
|
+
if (obj.type === "turn_end") {
|
|
84
|
+
turnCount += 1;
|
|
85
|
+
// ... turn limit logic only ...
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
// MISSING: detect provider errors (429, auth, etc.)
|
|
89
|
+
}
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### Fix
|
|
93
|
+
|
|
94
|
+
Thêm provider error detection trong `onJsonEvent`:
|
|
95
|
+
```typescript
|
|
96
|
+
let providerError: string | undefined;
|
|
97
|
+
|
|
98
|
+
// In onJsonEvent:
|
|
99
|
+
if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
100
|
+
const errMsg = obj.message?.errorMessage || "";
|
|
101
|
+
if (errMsg && !providerError) providerError = errMsg;
|
|
102
|
+
// Fast-fail on rate limit — don't wait 300s
|
|
103
|
+
if (/429|rate.?limit/i.test(errMsg)) {
|
|
104
|
+
settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Impact
|
|
110
|
+
|
|
111
|
+
Fix này sẽ chuyển error message từ:
|
|
112
|
+
```
|
|
113
|
+
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
114
|
+
```
|
|
115
|
+
Thành:
|
|
116
|
+
```
|
|
117
|
+
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
Và **fail fast** thay vì đợi 300s.
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
## Bug #3: background.log vô dụng — không capture worker output
|
|
125
|
+
|
|
126
|
+
| Field | Value |
|
|
127
|
+
|---|---|
|
|
128
|
+
| **Severity** | 🟠 MEDIUM |
|
|
129
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
130
|
+
| **Affected** | Debugging experience cho tất cả background runs |
|
|
131
|
+
| **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
|
|
132
|
+
|
|
133
|
+
### Mô tả
|
|
134
|
+
|
|
135
|
+
Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
|
|
136
|
+
```
|
|
137
|
+
[pi-crew] background loader=jiti
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
Không có:
|
|
141
|
+
- Worker stdout/stderr
|
|
142
|
+
- Error messages
|
|
143
|
+
- Provider responses
|
|
144
|
+
- Exit codes
|
|
145
|
+
|
|
146
|
+
### Nguyên nhân
|
|
147
|
+
|
|
148
|
+
`async-runner.ts` line 130-145:
|
|
149
|
+
```typescript
|
|
150
|
+
const logFd = fs.openSync(logPath, "a");
|
|
151
|
+
// ...
|
|
152
|
+
const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
`buildBackgroundSpawnOptions` line 123-127:
|
|
156
|
+
```typescript
|
|
157
|
+
return {
|
|
158
|
+
cwd: manifest.cwd,
|
|
159
|
+
detached: true,
|
|
160
|
+
stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
|
|
161
|
+
// ...
|
|
162
|
+
};
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
|
|
166
|
+
|
|
167
|
+
Flow:
|
|
168
|
+
```
|
|
169
|
+
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
170
|
+
→ loader=jiti → ghi vào log ✅
|
|
171
|
+
→ executeTeamRun()
|
|
172
|
+
→ child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
|
|
173
|
+
→ Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
### Fix
|
|
177
|
+
|
|
178
|
+
1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
|
|
179
|
+
2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
|
|
180
|
+
3. **Option C:** Background-runner tee output vào log file
|
|
181
|
+
|
|
182
|
+
### Key file
|
|
183
|
+
|
|
184
|
+
```
|
|
185
|
+
pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Bug #4: worker-startup.ts thiếu "rate_limited" classification
|
|
191
|
+
|
|
192
|
+
| Field | Value |
|
|
193
|
+
|---|---|
|
|
194
|
+
| **Severity** | 🟡 LOW |
|
|
195
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
196
|
+
| **Affected** | Error classification và reporting |
|
|
197
|
+
| **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
|
|
198
|
+
|
|
199
|
+
### Mô tả
|
|
200
|
+
|
|
201
|
+
`worker-startup.ts` có `StartupFailureClassification` type:
|
|
202
|
+
```typescript
|
|
203
|
+
export type StartupFailureClassification =
|
|
204
|
+
| "trust_required"
|
|
205
|
+
| "prompt_misdelivery"
|
|
206
|
+
| "prompt_acceptance_timeout"
|
|
207
|
+
| "transport_dead"
|
|
208
|
+
| "worker_crashed"
|
|
209
|
+
| "unknown";
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
|
|
213
|
+
|
|
214
|
+
### Fix
|
|
215
|
+
|
|
216
|
+
Thêm vào type và `classifyStartupFailure` function:
|
|
217
|
+
```typescript
|
|
218
|
+
export type StartupFailureClassification =
|
|
219
|
+
| "trust_required"
|
|
220
|
+
| "prompt_misdelivery"
|
|
221
|
+
| "prompt_acceptance_timeout"
|
|
222
|
+
| "transport_dead"
|
|
223
|
+
| "worker_crashed"
|
|
224
|
+
| "rate_limited" // NEW
|
|
225
|
+
| "provider_error" // NEW
|
|
226
|
+
| "unknown";
|
|
227
|
+
|
|
228
|
+
// In classifyStartupFailure:
|
|
229
|
+
if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
|
|
230
|
+
if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### Key file
|
|
234
|
+
|
|
235
|
+
```
|
|
236
|
+
pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## Bug #5: Stale heartbeat notifications sau prune
|
|
242
|
+
|
|
243
|
+
| Field | Value |
|
|
244
|
+
|---|---|
|
|
245
|
+
| **Severity** | 🟡 LOW (cosmetic) |
|
|
246
|
+
| **Status** | Confirmed |
|
|
247
|
+
| **Affected** | User experience |
|
|
248
|
+
| **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
|
|
249
|
+
|
|
250
|
+
### Mô tả
|
|
251
|
+
|
|
252
|
+
Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
|
|
253
|
+
|
|
254
|
+
```
|
|
255
|
+
→ team prune: Removed 9 runs
|
|
256
|
+
→ Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
|
|
257
|
+
→ Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
|
|
258
|
+
→ Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
|
|
259
|
+
→ Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
|
|
260
|
+
... (6+ stale notifications)
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
Mỗi notification trigger `get_subagent_result` → trả về "not found".
|
|
264
|
+
|
|
265
|
+
### Nguyên nhân
|
|
266
|
+
|
|
267
|
+
Background watcher duy trì worker health check queue. Khi runs bị prune:
|
|
268
|
+
1. Watcher không deregister ngay
|
|
269
|
+
2. Notifications đã trong queue vẫn emit
|
|
270
|
+
3. Các notifications đến lần lượt, cách nhau vài giây
|
|
271
|
+
|
|
272
|
+
### Impact
|
|
273
|
+
|
|
274
|
+
- Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
|
|
275
|
+
- Wasted context: mỗi notification trigger 1 tool call để verify
|
|
276
|
+
|
|
277
|
+
### Fix
|
|
278
|
+
|
|
279
|
+
Background watcher nên check run existence trước khi emit:
|
|
280
|
+
```typescript
|
|
281
|
+
// Before emitting heartbeat_dead:
|
|
282
|
+
if (!runExists(runId)) {
|
|
283
|
+
deregisterWorker(workerId); // Silent cleanup
|
|
284
|
+
return;
|
|
285
|
+
}
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### Key files
|
|
289
|
+
|
|
290
|
+
```
|
|
291
|
+
pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
|
|
292
|
+
pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
# pi-crew v0.2.20 — Bug Report
|
|
298
|
+
|
|
299
|
+
**Ngày:** 2026-05-19
|
|
300
|
+
**Session:** Comprehensive integration test + root cause analysis
|
|
301
|
+
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
|
|
306
|
+
|
|
307
|
+
| Field | Value |
|
|
308
|
+
|---|---|
|
|
309
|
+
| **Severity** | 🔴 HIGH |
|
|
310
|
+
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
311
|
+
| **Affected** | Tất cả background/async workers |
|
|
312
|
+
| **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
|
|
313
|
+
|
|
314
|
+
### Mô tả
|
|
315
|
+
|
|
316
|
+
Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
|
|
317
|
+
```
|
|
318
|
+
worker.response_timeout: No output for 300000ms
|
|
319
|
+
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
### Root cause
|
|
323
|
+
|
|
324
|
+
**Đã fix.** Trước đây 429 rate limit không được retry vì:
|
|
325
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
|
|
326
|
+
2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
|
|
327
|
+
|
|
328
|
+
### Fix applied
|
|
329
|
+
|
|
330
|
+
1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
|
|
331
|
+
2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
|
|
332
|
+
3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
|
|
333
|
+
|
|
334
|
+
### Model fallback chain
|
|
335
|
+
|
|
336
|
+
Khi model chính bị 429:
|
|
337
|
+
1. Fallback sang `fallbackModels` (nếu có cấu hình)
|
|
338
|
+
2. Fallback sang available models khác trong hệ thống
|
|
339
|
+
3. Nếu không có fallback và retry hết → fail với đúng error message
|
|
340
|
+
|
|
341
|
+
**Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
|
|
342
|
+
|
|
343
|
+
---
|
|
344
|
+
|
|
345
|
+
## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
|
|
346
|
+
|
|
347
|
+
| Field | Value |
|
|
348
|
+
|---|---|
|
|
349
|
+
| **Severity** | 🔴 HIGH |
|
|
350
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
351
|
+
| **Affected** | Tất cả child Pi workers |
|
|
352
|
+
| **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
|
|
353
|
+
|
|
354
|
+
### Mô tả
|
|
355
|
+
|
|
356
|
+
Pi CLI output JSON events cho 429 errors rất rõ ràng:
|
|
357
|
+
```json
|
|
358
|
+
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
|
|
362
|
+
- `isFinalAssistantEvent()` — để trigger final drain
|
|
363
|
+
- `turn_end` — để đếm turns cho turn limiting
|
|
364
|
+
|
|
365
|
+
Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
|
|
366
|
+
|
|
367
|
+
### Code location
|
|
368
|
+
|
|
369
|
+
`/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
|
|
370
|
+
```typescript
|
|
371
|
+
onJsonEvent: (event) => {
|
|
372
|
+
restartNoResponseTimer();
|
|
373
|
+
// Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
|
|
374
|
+
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
375
|
+
const obj = event as Record<string, unknown>;
|
|
376
|
+
if (obj.type === "turn_end") {
|
|
377
|
+
turnCount += 1;
|
|
378
|
+
// ... turn limit logic only ...
|
|
379
|
+
}
|
|
380
|
+
}
|
|
381
|
+
// MISSING: detect provider errors (429, auth, etc.)
|
|
382
|
+
}
|
|
383
|
+
```
|
|
384
|
+
|
|
385
|
+
### Fix
|
|
386
|
+
|
|
387
|
+
Thêm provider error detection trong `onJsonEvent`:
|
|
388
|
+
```typescript
|
|
389
|
+
let providerError: string | undefined;
|
|
390
|
+
|
|
391
|
+
// In onJsonEvent:
|
|
392
|
+
if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
393
|
+
const errMsg = obj.message?.errorMessage || "";
|
|
394
|
+
if (errMsg && !providerError) providerError = errMsg;
|
|
395
|
+
// Fast-fail on rate limit — don't wait 300s
|
|
396
|
+
if (/429|rate.?limit/i.test(errMsg)) {
|
|
397
|
+
settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
|
|
398
|
+
}
|
|
399
|
+
}
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
### Impact
|
|
403
|
+
|
|
404
|
+
Fix này sẽ chuyển error message từ:
|
|
405
|
+
```
|
|
406
|
+
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
407
|
+
```
|
|
408
|
+
Thành:
|
|
409
|
+
```
|
|
410
|
+
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
Và **fail fast** thay vì đợi 300s.
|
|
414
|
+
|
|
415
|
+
---
|
|
416
|
+
|
|
417
|
+
## Bug #3: background.log vô dụng — không capture worker output
|
|
418
|
+
|
|
419
|
+
| Field | Value |
|
|
420
|
+
|---|---|
|
|
421
|
+
| **Severity** | 🟠 MEDIUM |
|
|
422
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
423
|
+
| **Affected** | Debugging experience cho tất cả background runs |
|
|
424
|
+
| **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
|
|
425
|
+
|
|
426
|
+
### Mô tả
|
|
427
|
+
|
|
428
|
+
Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
|
|
429
|
+
```
|
|
430
|
+
[pi-crew] background loader=jiti
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
Không có:
|
|
434
|
+
- Worker stdout/stderr
|
|
435
|
+
- Error messages
|
|
436
|
+
- Provider responses
|
|
437
|
+
- Exit codes
|
|
438
|
+
|
|
439
|
+
### Nguyên nhân
|
|
440
|
+
|
|
441
|
+
`async-runner.ts` line 130-145:
|
|
442
|
+
```typescript
|
|
443
|
+
const logFd = fs.openSync(logPath, "a");
|
|
444
|
+
// ...
|
|
445
|
+
const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
|
|
446
|
+
```
|
|
447
|
+
|
|
448
|
+
`buildBackgroundSpawnOptions` line 123-127:
|
|
449
|
+
```typescript
|
|
450
|
+
return {
|
|
451
|
+
cwd: manifest.cwd,
|
|
452
|
+
detached: true,
|
|
453
|
+
stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
|
|
454
|
+
// ...
|
|
455
|
+
};
|
|
456
|
+
```
|
|
457
|
+
|
|
458
|
+
**stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
|
|
459
|
+
|
|
460
|
+
Flow:
|
|
461
|
+
```
|
|
462
|
+
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
463
|
+
→ loader=jiti → ghi vào log ✅
|
|
464
|
+
→ executeTeamRun()
|
|
465
|
+
→ child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
|
|
466
|
+
→ Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
### Fix
|
|
470
|
+
|
|
471
|
+
1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
|
|
472
|
+
2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
|
|
473
|
+
3. **Option C:** Background-runner tee output vào log file
|
|
474
|
+
|
|
475
|
+
### Key file
|
|
476
|
+
|
|
477
|
+
```
|
|
478
|
+
pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
|
|
479
|
+
```
|
|
480
|
+
|
|
481
|
+
---
|
|
482
|
+
|
|
483
|
+
## Bug #4: worker-startup.ts thiếu "rate_limited" classification
|
|
484
|
+
|
|
485
|
+
| Field | Value |
|
|
486
|
+
|---|---|
|
|
487
|
+
| **Severity** | 🟡 LOW |
|
|
488
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
489
|
+
| **Affected** | Error classification và reporting |
|
|
490
|
+
| **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
|
|
491
|
+
|
|
492
|
+
### Mô tả
|
|
493
|
+
|
|
494
|
+
`worker-startup.ts` có `StartupFailureClassification` type:
|
|
495
|
+
```typescript
|
|
496
|
+
export type StartupFailureClassification =
|
|
497
|
+
| "trust_required"
|
|
498
|
+
| "prompt_misdelivery"
|
|
499
|
+
| "prompt_acceptance_timeout"
|
|
500
|
+
| "transport_dead"
|
|
501
|
+
| "worker_crashed"
|
|
502
|
+
| "unknown";
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
|
|
506
|
+
|
|
507
|
+
### Fix
|
|
508
|
+
|
|
509
|
+
Thêm vào type và `classifyStartupFailure` function:
|
|
510
|
+
```typescript
|
|
511
|
+
export type StartupFailureClassification =
|
|
512
|
+
| "trust_required"
|
|
513
|
+
| "prompt_misdelivery"
|
|
514
|
+
| "prompt_acceptance_timeout"
|
|
515
|
+
| "transport_dead"
|
|
516
|
+
| "worker_crashed"
|
|
517
|
+
| "rate_limited" // NEW
|
|
518
|
+
| "provider_error" // NEW
|
|
519
|
+
| "unknown";
|
|
520
|
+
|
|
521
|
+
// In classifyStartupFailure:
|
|
522
|
+
if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
|
|
523
|
+
if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Key file
|
|
527
|
+
|
|
528
|
+
```
|
|
529
|
+
pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
---
|
|
533
|
+
|
|
534
|
+
## Bug #5: Stale heartbeat notifications sau prune
|
|
535
|
+
|
|
536
|
+
| Field | Value |
|
|
537
|
+
|---|---|
|
|
538
|
+
| **Severity** | 🟡 LOW (cosmetic) |
|
|
539
|
+
| **Status** | Confirmed |
|
|
540
|
+
| **Affected** | User experience |
|
|
541
|
+
| **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
|
|
542
|
+
|
|
543
|
+
### Mô tả
|
|
544
|
+
|
|
545
|
+
Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
|
|
546
|
+
|
|
547
|
+
```
|
|
548
|
+
→ team prune: Removed 9 runs
|
|
549
|
+
→ Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
|
|
550
|
+
→ Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
|
|
551
|
+
→ Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
|
|
552
|
+
→ Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
|
|
553
|
+
... (6+ stale notifications)
|
|
554
|
+
```
|
|
555
|
+
|
|
556
|
+
Mỗi notification trigger `get_subagent_result` → trả về "not found".
|
|
557
|
+
|
|
558
|
+
### Nguyên nhân
|
|
559
|
+
|
|
560
|
+
Background watcher duy trì worker health check queue. Khi runs bị prune:
|
|
561
|
+
1. Watcher không deregister ngay
|
|
562
|
+
2. Notifications đã trong queue vẫn emit
|
|
563
|
+
3. Các notifications đến lần lượt, cách nhau vài giây
|
|
564
|
+
|
|
565
|
+
### Impact
|
|
566
|
+
|
|
567
|
+
- Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
|
|
568
|
+
- Wasted context: mỗi notification trigger 1 tool call để verify
|
|
569
|
+
|
|
570
|
+
### Fix
|
|
571
|
+
|
|
572
|
+
Background watcher nên check run existence trước khi emit:
|
|
573
|
+
```typescript
|
|
574
|
+
// Before emitting heartbeat_dead:
|
|
575
|
+
if (!runExists(runId)) {
|
|
576
|
+
deregisterWorker(workerId); // Silent cleanup
|
|
577
|
+
return;
|
|
578
|
+
}
|
|
579
|
+
```
|
|
580
|
+
|
|
581
|
+
### Key files
|
|
582
|
+
|
|
583
|
+
```
|
|
584
|
+
pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
|
|
585
|
+
pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
586
|
+
```
|
|
587
|
+
|
|
588
|
+
---
|
|
589
|
+
|
|
590
|
+
# pi-crew v0.2.20 — Bug Report
|
|
591
|
+
|
|
592
|
+
**Ngày:** 2026-05-19
|
|
593
|
+
**Session:** Comprehensive integration test + root cause analysis
|
|
594
|
+
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
595
|
+
|
|
596
|
+
---
|
|
597
|
+
|
|
598
|
+
## Bug #1: Background workers "heartbeat dead" — thực chất là MiniMax 429 Rate Limit
|
|
599
|
+
|
|
600
|
+
| Field | Value |
|
|
601
|
+
|---|---|
|
|
602
|
+
| **Severity** | 🔴 HIGH |
|
|
603
|
+
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
604
|
+
| **Affected** | Tất cả background/async workers |
|
|
605
|
+
| **Symptom** | Workers timeout sau 300s với "heartbeat dead", zero output |
|
|
606
|
+
|
|
607
|
+
### Mô tả
|
|
608
|
+
|
|
609
|
+
Khi chạy `team action='run'` với `async=true` hoặc `Agent(run_in_background=true)`, workers spawn thành công (PID tồn tại) nhưng **timeout sau 300s** với generic error:
|
|
610
|
+
```
|
|
611
|
+
worker.response_timeout: No output for 300000ms
|
|
612
|
+
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
613
|
+
```
|
|
614
|
+
|
|
615
|
+
### Root cause
|
|
616
|
+
|
|
617
|
+
**Đã fix.** Trước đây 429 rate limit không được retry vì:
|
|
618
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` có `/\b429\b/` nhưng MiniMax trả về `rate_limit_error: usage limit exceeded` (không có số 429 rõ ràng)
|
|
619
|
+
2. 429 được fast-fail trong `child-pi.ts onJsonEvent` thay vì để task-runner xử lý retry với fallback
|
|
620
|
+
|
|
621
|
+
### Fix applied
|
|
622
|
+
|
|
623
|
+
1. **model-fallback.ts**: Thêm `/rate_limit_error/i` vào `RETRYABLE_MODEL_FAILURE_PATTERNS` để nhận diện đúng MiniMax rate limit error
|
|
624
|
+
2. **model-fallback.ts**: Sửa `/\b429\b/` → `/rate.?limit/i` để match nhiều format hơn
|
|
625
|
+
3. **child-pi.ts**: Bỏ fast-fail 429 — để task-runner xử lý retry với model fallback chain
|
|
626
|
+
|
|
627
|
+
### Model fallback chain
|
|
628
|
+
|
|
629
|
+
Khi model chính bị 429:
|
|
630
|
+
1. Fallback sang `fallbackModels` (nếu có cấu hình)
|
|
631
|
+
2. Fallback sang available models khác trong hệ thống
|
|
632
|
+
3. Nếu không có fallback và retry hết → fail với đúng error message
|
|
633
|
+
|
|
634
|
+
**Cấu hình khuyến nghị:** Thêm `fallbackModels` vào agent config để có nhiều lựa chọn khi model chính bị rate limit.
|
|
635
|
+
|
|
636
|
+
---
|
|
637
|
+
|
|
638
|
+
## Bug #2: child-pi.ts không phát hiện 429 rate limit error — báo sai "heartbeat dead"
|
|
639
|
+
|
|
640
|
+
| Field | Value |
|
|
641
|
+
|---|---|
|
|
642
|
+
| **Severity** | 🔴 HIGH |
|
|
643
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
644
|
+
| **Affected** | Tất cả child Pi workers |
|
|
645
|
+
| **Symptom** | Worker báo generic "No output for 300000ms" thay vì "Provider rate limit: 429" |
|
|
646
|
+
|
|
647
|
+
### Mô tả
|
|
648
|
+
|
|
649
|
+
Pi CLI output JSON events cho 429 errors rất rõ ràng:
|
|
650
|
+
```json
|
|
651
|
+
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
652
|
+
```
|
|
653
|
+
|
|
654
|
+
Nhưng `child-pi.ts` **không parse error events** — nó chỉ quan tâm đến:
|
|
655
|
+
- `isFinalAssistantEvent()` — để trigger final drain
|
|
656
|
+
- `turn_end` — để đếm turns cho turn limiting
|
|
657
|
+
|
|
658
|
+
Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**, nhưng **không nhận ra đây là error**. Pi block sau 3 retries → heartbeat timeout 300s → generic error message.
|
|
659
|
+
|
|
660
|
+
### Code location
|
|
661
|
+
|
|
662
|
+
`/home/bom/source/my_pi/pi-crew/src/runtime/child-pi.ts`, line ~394:
|
|
663
|
+
```typescript
|
|
664
|
+
onJsonEvent: (event) => {
|
|
665
|
+
restartNoResponseTimer();
|
|
666
|
+
// Turn-count-based steering: chỉ đếm turns, KHÔNG check errors
|
|
667
|
+
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
668
|
+
const obj = event as Record<string, unknown>;
|
|
669
|
+
if (obj.type === "turn_end") {
|
|
670
|
+
turnCount += 1;
|
|
671
|
+
// ... turn limit logic only ...
|
|
672
|
+
}
|
|
673
|
+
}
|
|
674
|
+
// MISSING: detect provider errors (429, auth, etc.)
|
|
675
|
+
}
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
### Fix
|
|
679
|
+
|
|
680
|
+
Thêm provider error detection trong `onJsonEvent`:
|
|
681
|
+
```typescript
|
|
682
|
+
let providerError: string | undefined;
|
|
683
|
+
|
|
684
|
+
// In onJsonEvent:
|
|
685
|
+
if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
686
|
+
const errMsg = obj.message?.errorMessage || "";
|
|
687
|
+
if (errMsg && !providerError) providerError = errMsg;
|
|
688
|
+
// Fast-fail on rate limit — don't wait 300s
|
|
689
|
+
if (/429|rate.?limit/i.test(errMsg)) {
|
|
690
|
+
settle({ exitCode: 1, stdout, stderr: `Provider rate limit: ${errMsg.slice(0, 200)}` });
|
|
691
|
+
}
|
|
692
|
+
}
|
|
693
|
+
```
|
|
694
|
+
|
|
695
|
+
### Impact
|
|
696
|
+
|
|
697
|
+
Fix này sẽ chuyển error message từ:
|
|
698
|
+
```
|
|
699
|
+
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
700
|
+
```
|
|
701
|
+
Thành:
|
|
702
|
+
```
|
|
703
|
+
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
704
|
+
```
|
|
705
|
+
|
|
706
|
+
Và **fail fast** thay vì đợi 300s.
|
|
707
|
+
|
|
708
|
+
---
|
|
709
|
+
|
|
710
|
+
## Bug #3: background.log vô dụng — không capture worker output
|
|
711
|
+
|
|
712
|
+
| Field | Value |
|
|
713
|
+
|---|---|
|
|
714
|
+
| **Severity** | 🟠 MEDIUM |
|
|
715
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
716
|
+
| **Affected** | Debugging experience cho tất cả background runs |
|
|
717
|
+
| **Symptom** | background.log chỉ chứa 1 dòng: `[pi-crew] background loader=jiti` |
|
|
718
|
+
|
|
719
|
+
### Mô tả
|
|
720
|
+
|
|
721
|
+
Khi background worker fail, log file tại `.crew/state/runs/<id>/background.log` chỉ chứa:
|
|
722
|
+
```
|
|
723
|
+
[pi-crew] background loader=jiti
|
|
724
|
+
```
|
|
725
|
+
|
|
726
|
+
Không có:
|
|
727
|
+
- Worker stdout/stderr
|
|
728
|
+
- Error messages
|
|
729
|
+
- Provider responses
|
|
730
|
+
- Exit codes
|
|
731
|
+
|
|
732
|
+
### Nguyên nhân
|
|
733
|
+
|
|
734
|
+
`async-runner.ts` line 130-145:
|
|
735
|
+
```typescript
|
|
736
|
+
const logFd = fs.openSync(logPath, "a");
|
|
737
|
+
// ...
|
|
738
|
+
const child = spawn(process.execPath, command.args, buildBackgroundSpawnOptions(manifest, logFd));
|
|
739
|
+
```
|
|
740
|
+
|
|
741
|
+
`buildBackgroundSpawnOptions` line 123-127:
|
|
742
|
+
```typescript
|
|
743
|
+
return {
|
|
744
|
+
cwd: manifest.cwd,
|
|
745
|
+
detached: true,
|
|
746
|
+
stdio: ["ignore", logFd, logFd], // stdout+stderr → background.log
|
|
747
|
+
// ...
|
|
748
|
+
};
|
|
749
|
+
```
|
|
750
|
+
|
|
751
|
+
**stdout/stderr của background-runner** được ghi vào background.log. Nhưng **child Pi workers** (spawn bởi background-runner qua child-pi.ts) **output vào child-pi's pipe**, KHÔNG vào background.log.
|
|
752
|
+
|
|
753
|
+
Flow:
|
|
754
|
+
```
|
|
755
|
+
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
756
|
+
→ loader=jiti → ghi vào log ✅
|
|
757
|
+
→ executeTeamRun()
|
|
758
|
+
→ child-pi.ts spawn child Pi (stdout→pipe, stderr→pipe)
|
|
759
|
+
→ Pi output → child-pi.ts captures →KHÔNG GHI VÀO background.log ❌
|
|
760
|
+
```
|
|
761
|
+
|
|
762
|
+
### Fix
|
|
763
|
+
|
|
764
|
+
1. **Option A:** Trong `child-pi.ts` hoặc `team-runner.ts`, ghi worker output events vào background.log
|
|
765
|
+
2. **Option B:** Thêm event log entries cho provider errors (đã có event log, nhưng không đủ chi tiết)
|
|
766
|
+
3. **Option C:** Background-runner tee output vào log file
|
|
767
|
+
|
|
768
|
+
### Key file
|
|
769
|
+
|
|
770
|
+
```
|
|
771
|
+
pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBackgroundTeamRun()
|
|
772
|
+
```
|
|
773
|
+
|
|
774
|
+
---
|
|
775
|
+
|
|
776
|
+
## Bug #4: worker-startup.ts thiếu "rate_limited" classification
|
|
777
|
+
|
|
778
|
+
| Field | Value |
|
|
779
|
+
|---|---|
|
|
780
|
+
| **Severity** | 🟡 LOW |
|
|
781
|
+
| **Status** | New — phát hiện trong quá trình debug Bug #1 |
|
|
782
|
+
| **Affected** | Error classification và reporting |
|
|
783
|
+
| **Symptom** | 429 errors classified là "unknown" thay vì "rate_limited" |
|
|
784
|
+
|
|
785
|
+
### Mô tả
|
|
786
|
+
|
|
787
|
+
`worker-startup.ts` có `StartupFailureClassification` type:
|
|
788
|
+
```typescript
|
|
789
|
+
export type StartupFailureClassification =
|
|
790
|
+
| "trust_required"
|
|
791
|
+
| "prompt_misdelivery"
|
|
792
|
+
| "prompt_acceptance_timeout"
|
|
793
|
+
| "transport_dead"
|
|
794
|
+
| "worker_crashed"
|
|
795
|
+
| "unknown";
|
|
796
|
+
```
|
|
797
|
+
|
|
798
|
+
Thiếu `"rate_limited"` và `"provider_error"`. Kết quả: 429 errors bị classify là `"unknown"`.
|
|
799
|
+
|
|
800
|
+
### Fix
|
|
801
|
+
|
|
802
|
+
Thêm vào type và `classifyStartupFailure` function:
|
|
803
|
+
```typescript
|
|
804
|
+
export type StartupFailureClassification =
|
|
805
|
+
| "trust_required"
|
|
806
|
+
| "prompt_misdelivery"
|
|
807
|
+
| "prompt_acceptance_timeout"
|
|
808
|
+
| "transport_dead"
|
|
809
|
+
| "worker_crashed"
|
|
810
|
+
| "rate_limited" // NEW
|
|
811
|
+
| "provider_error" // NEW
|
|
812
|
+
| "unknown";
|
|
813
|
+
|
|
814
|
+
// In classifyStartupFailure:
|
|
815
|
+
if (evidence.stderrPreview && /429|rate.?limit/i.test(evidence.stderrPreview)) return "rate_limited";
|
|
816
|
+
if (evidence.stderrPreview && /5\d{2}|server.?error|internal.?error/i.test(evidence.stderrPreview)) return "provider_error";
|
|
817
|
+
```
|
|
818
|
+
|
|
819
|
+
### Key file
|
|
820
|
+
|
|
821
|
+
```
|
|
822
|
+
pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classifyStartupFailure()
|
|
823
|
+
```
|
|
824
|
+
|
|
825
|
+
---
|
|
826
|
+
|
|
827
|
+
## Bug #5: Stale heartbeat notifications sau prune
|
|
828
|
+
|
|
829
|
+
| Field | Value |
|
|
830
|
+
|---|---|
|
|
831
|
+
| **Severity** | 🟡 LOW (cosmetic) |
|
|
832
|
+
| **Status** | Confirmed |
|
|
833
|
+
| **Affected** | User experience |
|
|
834
|
+
| **Symptom** | "Task heartbeat dead" notifications cho runs đã bị xóa |
|
|
835
|
+
|
|
836
|
+
### Mô tả
|
|
837
|
+
|
|
838
|
+
Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn emit notifications cho runs đã prune:
|
|
839
|
+
|
|
840
|
+
```
|
|
841
|
+
→ team prune: Removed 9 runs
|
|
842
|
+
→ Notification: "agent_mpc423rq_1 heartbeat dead" (run not found)
|
|
843
|
+
→ Notification: "agent_mpc423rv_2 heartbeat dead" (run not found)
|
|
844
|
+
→ Notification: "agent_mpc423rw_3 heartbeat dead" (run not found)
|
|
845
|
+
→ Notification: "agent_mpc423rw_4 heartbeat dead" (run not found)
|
|
846
|
+
... (6+ stale notifications)
|
|
847
|
+
```
|
|
848
|
+
|
|
849
|
+
Mỗi notification trigger `get_subagent_result` → trả về "not found".
|
|
850
|
+
|
|
851
|
+
### Nguyên nhân
|
|
852
|
+
|
|
853
|
+
Background watcher duy trì worker health check queue. Khi runs bị prune:
|
|
854
|
+
1. Watcher không deregister ngay
|
|
855
|
+
2. Notifications đã trong queue vẫn emit
|
|
856
|
+
3. Các notifications đến lần lượt, cách nhau vài giây
|
|
857
|
+
|
|
858
|
+
### Impact
|
|
859
|
+
|
|
860
|
+
- Confusing cho user: thấy "heartbeat dead" cho runs không còn tồn tại
|
|
861
|
+
- Wasted context: mỗi notification trigger 1 tool call để verify
|
|
862
|
+
|
|
863
|
+
### Fix
|
|
864
|
+
|
|
865
|
+
Background watcher nên check run existence trước khi emit:
|
|
866
|
+
```typescript
|
|
867
|
+
// Before emitting heartbeat_dead:
|
|
868
|
+
if (!runExists(runId)) {
|
|
869
|
+
deregisterWorker(workerId); // Silent cleanup
|
|
870
|
+
return;
|
|
871
|
+
}
|
|
872
|
+
```
|
|
873
|
+
|
|
874
|
+
### Key files
|
|
875
|
+
|
|
876
|
+
```
|
|
877
|
+
pi-crew/src/runtime/worker-heartbeat.ts — isWorkerHeartbeatStale()
|
|
878
|
+
pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
879
|
+
```
|
|
880
|
+
|
|
881
|
+
---
|
|
882
|
+
|
|
883
|
+
## Bug #6: Live-session run bị cancel giữa chừng
|
|
884
|
+
|
|
885
|
+
| Field | Value |
|
|
886
|
+
|---|---|
|
|
887
|
+
| **Severity** | 🟠 MEDIUM |
|
|
888
|
+
| **Status** | ✅ Confirmed — no code fix needed; documented as user workflow constraint |
|
|
889
|
+
| **Affected** | Foreground team runs |
|
|
890
|
+
| **Symptom** | Run cancelled sau khi explore phase hoàn thành, trước execute phase |
|
|
891
|
+
|
|
892
|
+
### Mô tả
|
|
893
|
+
|
|
894
|
+
Fast-fix team chạy live-session:
|
|
895
|
+
```
|
|
896
|
+
04:12:20 live-session.prompt_start 01_explore
|
|
897
|
+
04:12:51 live-session.prompt_done 01_explore (31s, completed)
|
|
898
|
+
04:12:51 live_agent.terminated 01_explore (status=cancelled)
|
|
899
|
+
04:12:51 task.completed 01_explore
|
|
900
|
+
04:12:51 run.cancelled: "This operation was aborted"
|
|
901
|
+
```
|
|
902
|
+
|
|
903
|
+
Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước khi `02_execute` bắt đầu.
|
|
904
|
+
|
|
905
|
+
### Nguyên nhân có thể
|
|
906
|
+
|
|
907
|
+
1. **Session concurrency limit** — chỉ 1 live-session active, conflict với parallel test operations
|
|
908
|
+
2. **User-initiated cancellation** — accidentally triggered
|
|
909
|
+
3. **Workflow phase transition bug** — không trigger next phase sau explore completes
|
|
910
|
+
|
|
911
|
+
### Cần thêm investigation
|
|
912
|
+
|
|
913
|
+
- Chạy lại fast-fix team đơn lẻ (không concurrent operations)
|
|
914
|
+
- Check live-session-runtime.ts cho phase transition logic
|
|
915
|
+
|
|
916
|
+
---
|
|
917
|
+
|
|
918
|
+
## Summary
|
|
919
|
+
|
|
920
|
+
| # | Bug | Severity | Status | Category |
|
|
921
|
+
|---|---|---|---|---|
|
|
922
|
+
| 1 | Background workers timeout do MiniMax 429 | 🔴 HIGH | ✅ Fixed — 429 now retries with fallback models via improved RETRYABLE_MODEL_FAILURE_PATTERNS | Code |
|
|
923
|
+
| 2 | child-pi.ts không phát hiện 429, báo sai "heartbeat dead" | 🔴 HIGH | ✅ Fixed — removed fast-fail 429; let task-runner handle retry+fallback | Code |
|
|
924
|
+
| 3 | background.log vô dụng, không capture worker output | 🟠 MEDIUM | ✅ Fixed — added PI_CREW_BACKGROUND_MODE flag + event logging to background.log | Observability |
|
|
925
|
+
| 4 | worker-startup.ts thiếu rate_limited classification | 🟡 LOW | ✅ Fixed — added rate_limited + provider_error to StartupFailureClassification | Code |
|
|
926
|
+
| 5 | Stale heartbeat notifications sau prune | 🟡 LOW | ✅ Fixed — HeartbeatWatcher skips pruned runs via stateRoot existence check | UX |
|
|
927
|
+
| 6 | Live-session foreground run bị cancel khi có concurrent tool calls | 🟠 MEDIUM | ✅ Confirmed — concurrent calls interrupt live-session → outputLength:0 → caller_cancelled. Avoid concurrent team actions during foreground runs. | Runtime |
|
|
928
|
+
| 7 | Async notifier "stale ctx" — dies, không restart sau Pi restart | 🔴 HIGH | ✅ Fixed — swallow stale error, isCurrent guard handles dormancy | Code |
|
|
929
|
+
| 8 | Background child-process 300s timeout — child Pi hangs, zero output | 🟠 MEDIUM | ✅ Fixed — Root cause found (Bug #10): MINIMAX_API_KEY stripped by sanitizeEnvSecrets(). Allow-list in child-pi.ts preserves model provider API keys. Restart Pi to verify fix. | Code |
|
|
930
|
+
| 9 | Executor hit yield limit — file write không hoàn thành | 🟡 LOW | 🔲 Open — executor hit 3 Yield Reminders and terminated before writing file. Task marked completed but artifact missing. | Runtime |
|
|
931
|
+
| 10 | Child-process silent timeout — MINIMAX_API_KEY bị filter ra khỏi child env | 🔴 HIGH | ✅ Fixed — sanitizeEnvSecrets() strips *API_KEY* vars. Allow-list in buildChildPiSpawnOptions preserves model provider keys (MINIMAX_*, OPENAI_*, etc.). See docs/fixes/bug-010-child-process-api-key-filtered.md | Code |
|
|
932
|
+
|
|
933
|
+
|
|
934
|
+
| 11 | Background runner "spawn pi ENOENT" — pi binary not in PATH | 🔴 HIGH | ✅ Fixed — added resolvePiCliScript() call for non-Windows platforms in getPiSpawnCommand(). Restart Pi to verify. | Code |
|
|
935
|
+
| 12 | Essential env vars (PATH) stripped - child Pi crashes with npm root -g error | HIGH | ✅ Fixed — added essential env vars (PATH, HOME, USER, etc.) to allow-list alongside model API keys. Restart Pi to verify. | Code |
|
|
936
|
+
| 15 | Background runner receives SIGTERM ~3s after spawn from Pi infrastructure | 🟠 MEDIUM | ✅ Fixed — disabled async mode by default + ignore SIGTERM from Pi in background-runner | Runtime |
|
|
937
|
+
|
|
938
|
+
### Priority fix order
|
|
939
|
+
|
|
940
|
+
1. **Bug #1** — ✅ Fixed — 429 now retried with model fallback chain
|
|
941
|
+
2. **Bug #2** — ✅ Fixed — removed fast-fail 429
|
|
942
|
+
3. **Bug #3** — ✅ Fixed — worker events now logged to background.log
|
|
943
|
+
4. **Bug #4** — ✅ Fixed — rate_limited + provider_error classification added
|
|
944
|
+
5. **Bug #5** — ✅ Fixed — HeartbeatWatcher skips pruned runs
|
|
945
|
+
6. **Bug #6** — ✅ Confirmed — concurrent tool calls cancel foreground runs; avoid concurrent team actions during runs
|
|
946
|
+
7. **Bug #7** — ✅ Fixed — async notifier handles stale ctx gracefully, isCurrent guard manages dormancy
|
|
947
|
+
8. **Bug #8/10** — ✅ Fixed — Bug #10 root cause: MINIMAX_API_KEY filtered out. Allow-list preserves model provider API keys for child processes.
|
|
948
|
+
9. **Bug #9** — ✅ Fixed — Added `needs_attention` task status. Workers that complete without calling `submit_result` now get `status: "needs_attention"` instead of `"completed"`, with ⚠ icon in UI.
|
|
949
|
+
10. **Bug #10** — ✅ Fixed — Added allow-list to sanitizeEnvSecrets in child-pi.ts to preserve model API keys (MINIMAX_*, OPENAI_*, etc.)
|
|
950
|
+
11. **Bug #11** — ✅ Fixed — resolvePiCliScript() added for non-Windows in getPiSpawnCommand() to fix ENOENT on spawn
|
|
951
|
+
12. **Bug #12** — ✅ Fixed — Essential env vars (PATH, HOME, USER, etc.) added to allow-list alongside model API keys
|
|
952
|
+
13. **Bug #13** — 🟠 MEDIUM — ✅ Fixed — Background runner dies after ~59s. 3-layer fix: (1) heartbeat mechanism prevents false repairs; (2) --max-old-space-size=512 limits V8 heap to prevent OOM; (3) SIGTERM/SIGINT handlers log async.failed event for diagnosis. Heartbeat includes memory stats (heapUsedMb, rssMb) for post-mortem.
|
|
953
|
+
14. **Bug #14** — 🔴 HIGH — ✅ Fixed — Infinite retry loop: needs_attention tasks had `queue: "blocked"` in task graph instead of `queue: "done"`, causing them to be re-scheduled indefinitely. Added `needs_attention` to the terminal status check in `withQueue()` in task-graph-scheduler.ts.
|
|
954
|
+
15. **Bug #15** — 🟠 MEDIUM — ✅ Fixed — Disabled async mode by default (runAsync=false). Background runners receive SIGTERM ~3s after spawn from Pi infrastructure because Node.js 22.22.0 setsid:true doesn't create a new session. Also added ignore-SIGTERM-from-Pi logic in background-runner.ts (A2 approach).
|