pi-crew 0.8.13 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +296 -0
- package/README.md +118 -2
- package/docs/FEATURE_INTAKE.md +1 -1
- package/docs/HARNESS.md +20 -19
- package/docs/PROJECT_REVIEW.md +132 -133
- package/docs/PROJECT_REVIEW_FIXES.md +130 -131
- package/docs/actions-reference.md +127 -121
- package/docs/architecture.md +1 -1
- package/docs/code-review-2026-05-11.md +134 -134
- package/docs/commands-reference.md +108 -106
- package/docs/comparison-pi-subagents-vs-pi-crew.md +105 -105
- package/docs/deep-review-report.md +1 -1
- package/docs/dynamic-workflows.md +90 -0
- package/docs/fixes/BATCH_A_H1_H2.md +17 -17
- package/docs/fixes/bug-007-async-notifier-stale-ctx.md +23 -23
- package/docs/followup-plan-2026-05-12.md +135 -135
- package/docs/followup-review-2026-05-12.md +86 -86
- package/docs/followup-review-round3-2026-05-12.md +123 -123
- package/docs/goals.md +59 -0
- package/docs/implementation-plan-top3.md +4 -4
- package/docs/issue-29-analysis.md +2 -2
- package/docs/oh-my-pi-research.md +154 -154
- package/docs/optimization-plan.md +2 -0
- package/docs/perf/baseline-2026-05.md +9 -9
- package/docs/perf/final-report-2026-05.md +2 -2
- package/docs/perf/sprint-1-report.md +2 -2
- package/docs/perf/sprint-2-report.md +1 -1
- package/docs/perf/upgrade-plan-2026-05.md +72 -72
- package/docs/pi-crew-bugs.md +230 -230
- package/docs/pi-crew-investigation-report.md +102 -102
- package/docs/pi-crew-test-round5.md +4 -4
- package/docs/runtime-analysis-child-vs-live.md +57 -57
- package/docs/runtime-migration-in-process-analysis.md +97 -97
- package/install.mjs +3 -2
- package/package.json +2 -4
- package/skills/orchestration/SKILL.md +11 -11
- package/src/agents/agent-config.ts +4 -0
- package/src/config/config.ts +39 -0
- package/src/config/types.ts +11 -0
- package/src/extension/action-suggestions.ts +2 -1
- package/src/extension/async-notifier.ts +10 -0
- package/src/extension/help.ts +14 -0
- package/src/extension/project-init.ts +7 -20
- package/src/extension/registration/commands.ts +27 -0
- package/src/extension/team-tool/destructive-gate.ts +1 -1
- package/src/extension/team-tool/goal-wrap.ts +288 -0
- package/src/extension/team-tool/goal.ts +405 -0
- package/src/extension/team-tool/run.ts +103 -4
- package/src/extension/team-tool/workflow-manage.ts +194 -0
- package/src/extension/team-tool.ts +20 -0
- package/src/hooks/types.ts +3 -1
- package/src/runtime/async-runner.ts +24 -2
- package/src/runtime/background-runner.ts +68 -19
- package/src/runtime/child-pi.ts +6 -1
- package/src/runtime/completion-guard.ts +1 -1
- package/src/runtime/dynamic-workflow-context.ts +450 -0
- package/src/runtime/dynamic-workflow-runner.ts +180 -0
- package/src/runtime/global-worker-cap.ts +96 -0
- package/src/runtime/goal-evaluator.ts +294 -0
- package/src/runtime/goal-loop-runner.ts +612 -0
- package/src/runtime/goal-state-store.ts +209 -0
- package/src/runtime/pi-args.ts +10 -2
- package/src/runtime/result-extractor.ts +32 -0
- package/src/runtime/team-runner.ts +11 -1
- package/src/runtime/verification-gates.ts +85 -5
- package/src/runtime/verification-integrity.ts +110 -0
- package/src/runtime/verification-worktree.ts +136 -0
- package/src/runtime/workspace-lock.ts +448 -0
- package/src/schema/config-schema.ts +26 -0
- package/src/schema/team-tool-schema.ts +39 -4
- package/src/state/atomic-write.ts +9 -0
- package/src/state/contracts.ts +14 -0
- package/src/state/crew-init.ts +18 -5
- package/src/state/event-log.ts +7 -1
- package/src/state/state-store.ts +2 -0
- package/src/state/types.ts +82 -0
- package/src/state/worker-atomic-writer.ts +176 -0
- package/src/utils/redaction.ts +104 -24
- package/src/workflows/discover-workflows.ts +25 -1
- package/src/workflows/workflow-config.ts +13 -0
- package/teams/parallel-research.team.md +1 -1
- package/workflows/examples/hello.dwf.ts +24 -0
package/docs/pi-crew-bugs.md
CHANGED
|
@@ -1,33 +1,33 @@
|
|
|
1
1
|
# Historical Bug Reports (v0.2.x)
|
|
2
2
|
|
|
3
|
-
> **Current version: v0.
|
|
3
|
+
> **Current version: v0.9.0** — See [CHANGELOG.md](../CHANGELOG.md) for all bug fixes.
|
|
4
4
|
> This page tracks historical bugs from v0.2.x. All listed bugs are fixed.
|
|
5
5
|
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# pi-crew v0.2.20 — Bug Report & Fixes
|
|
9
9
|
|
|
10
|
-
**
|
|
10
|
+
**Date:** 2026-05-19
|
|
11
11
|
**Session:** Comprehensive integration test + root cause analysis
|
|
12
|
-
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
13
|
-
**
|
|
12
|
+
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
13
|
+
**Status:** ✅ 14/14 bugs fixed (commits `de9e8b4` and `5dc794e`)
|
|
14
14
|
|
|
15
|
-
> **All bugs fixed ✅** — Source code verified.
|
|
15
|
+
> **All bugs fixed ✅** — Source code verified. See [pi-crew-test-final.md](pi-crew-test-final.md) for end-to-end test results.
|
|
16
16
|
|
|
17
17
|
---
|
|
18
18
|
|
|
19
|
-
## Bug #1: Background workers "heartbeat dead" —
|
|
19
|
+
## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
|
|
20
20
|
|
|
21
21
|
| Field | Value |
|
|
22
22
|
|---|---|
|
|
23
23
|
| **Severity** | 🔴 HIGH |
|
|
24
24
|
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
25
|
-
| **Affected** |
|
|
26
|
-
| **Symptom** | Workers
|
|
25
|
+
| **Affected** | All background/async workers |
|
|
26
|
+
| **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
|
|
27
27
|
|
|
28
|
-
###
|
|
28
|
+
### Description
|
|
29
29
|
|
|
30
|
-
|
|
30
|
+
When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
|
|
31
31
|
```
|
|
32
32
|
worker.response_timeout: No output for 300000ms
|
|
33
33
|
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
@@ -35,48 +35,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
|
35
35
|
|
|
36
36
|
### Root cause
|
|
37
37
|
|
|
38
|
-
|
|
39
|
-
1. `RETRYABLE_MODEL_FAILURE_PATTERNS`
|
|
40
|
-
2. 429
|
|
38
|
+
**Fixed.** Previously the 429 rate limit was not retried because:
|
|
39
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
|
|
40
|
+
2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
|
|
41
41
|
|
|
42
42
|
### Fix applied
|
|
43
43
|
|
|
44
|
-
1. **model-fallback.ts**:
|
|
45
|
-
2. **model-fallback.ts**:
|
|
46
|
-
3. **child-pi.ts**:
|
|
44
|
+
1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
|
|
45
|
+
2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
|
|
46
|
+
3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
|
|
47
47
|
|
|
48
48
|
### Model fallback chain
|
|
49
49
|
|
|
50
|
-
|
|
51
|
-
1.
|
|
52
|
-
2.
|
|
53
|
-
3.
|
|
50
|
+
When the main model gets a 429:
|
|
51
|
+
1. Fall back to `fallbackModels` (if configured)
|
|
52
|
+
2. Fall back to other available models in the system
|
|
53
|
+
3. If there is no fallback and retries are exhausted → fail with the correct error message
|
|
54
54
|
|
|
55
|
-
**
|
|
55
|
+
**Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
|
|
56
56
|
|
|
57
57
|
---
|
|
58
58
|
|
|
59
|
-
## Bug #2: child-pi.ts
|
|
59
|
+
## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
|
|
60
60
|
|
|
61
61
|
| Field | Value |
|
|
62
62
|
|---|---|
|
|
63
63
|
| **Severity** | 🔴 HIGH |
|
|
64
|
-
| **Status** | New —
|
|
65
|
-
| **Affected** |
|
|
66
|
-
| **Symptom** | Worker
|
|
64
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
65
|
+
| **Affected** | All child Pi workers |
|
|
66
|
+
| **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
|
|
67
67
|
|
|
68
|
-
###
|
|
68
|
+
### Description
|
|
69
69
|
|
|
70
|
-
Pi CLI
|
|
70
|
+
The Pi CLI outputs JSON events for 429 errors very clearly:
|
|
71
71
|
```json
|
|
72
72
|
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
73
73
|
```
|
|
74
74
|
|
|
75
|
-
|
|
76
|
-
- `isFinalAssistantEvent()` —
|
|
77
|
-
- `turn_end` —
|
|
75
|
+
But `child-pi.ts` **does not parse error events** — it only cares about:
|
|
76
|
+
- `isFinalAssistantEvent()` — to trigger the final drain
|
|
77
|
+
- `turn_end` — to count turns for turn limiting
|
|
78
78
|
|
|
79
|
-
|
|
79
|
+
Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
|
|
80
80
|
|
|
81
81
|
### Code location
|
|
82
82
|
|
|
@@ -84,7 +84,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
|
|
|
84
84
|
```typescript
|
|
85
85
|
onJsonEvent: (event) => {
|
|
86
86
|
restartNoResponseTimer();
|
|
87
|
-
// Turn-count-based steering:
|
|
87
|
+
// Turn-count-based steering: only counts turns, does NOT check errors
|
|
88
88
|
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
89
89
|
const obj = event as Record<string, unknown>;
|
|
90
90
|
if (obj.type === "turn_end") {
|
|
@@ -98,7 +98,7 @@ onJsonEvent: (event) => {
|
|
|
98
98
|
|
|
99
99
|
### Fix
|
|
100
100
|
|
|
101
|
-
|
|
101
|
+
Add provider error detection in `onJsonEvent`:
|
|
102
102
|
```typescript
|
|
103
103
|
let providerError: string | undefined;
|
|
104
104
|
|
|
@@ -115,42 +115,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
|
115
115
|
|
|
116
116
|
### Impact
|
|
117
117
|
|
|
118
|
-
|
|
118
|
+
This fix changes the error message from:
|
|
119
119
|
```
|
|
120
120
|
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
121
121
|
```
|
|
122
|
-
|
|
122
|
+
to:
|
|
123
123
|
```
|
|
124
124
|
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
125
125
|
```
|
|
126
126
|
|
|
127
|
-
|
|
127
|
+
And it **fails fast** instead of waiting 300s.
|
|
128
128
|
|
|
129
129
|
---
|
|
130
130
|
|
|
131
|
-
## Bug #3: background.log
|
|
131
|
+
## Bug #3: background.log is useless — does not capture worker output
|
|
132
132
|
|
|
133
133
|
| Field | Value |
|
|
134
134
|
|---|---|
|
|
135
135
|
| **Severity** | 🟠 MEDIUM |
|
|
136
|
-
| **Status** | New —
|
|
137
|
-
| **Affected** | Debugging experience
|
|
138
|
-
| **Symptom** | background.log
|
|
136
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
137
|
+
| **Affected** | Debugging experience for all background runs |
|
|
138
|
+
| **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
|
|
139
139
|
|
|
140
|
-
###
|
|
140
|
+
### Description
|
|
141
141
|
|
|
142
|
-
|
|
142
|
+
When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
|
|
143
143
|
```
|
|
144
144
|
[pi-crew] background loader=jiti
|
|
145
145
|
```
|
|
146
146
|
|
|
147
|
-
|
|
147
|
+
Missing:
|
|
148
148
|
- Worker stdout/stderr
|
|
149
149
|
- Error messages
|
|
150
150
|
- Provider responses
|
|
151
151
|
- Exit codes
|
|
152
152
|
|
|
153
|
-
###
|
|
153
|
+
### Cause
|
|
154
154
|
|
|
155
155
|
`async-runner.ts` line 130-145:
|
|
156
156
|
```typescript
|
|
@@ -169,22 +169,22 @@ return {
|
|
|
169
169
|
};
|
|
170
170
|
```
|
|
171
171
|
|
|
172
|
-
**stdout/stderr
|
|
172
|
+
**The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
|
|
173
173
|
|
|
174
174
|
Flow:
|
|
175
175
|
```
|
|
176
176
|
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
177
|
-
→ loader=jiti →
|
|
177
|
+
→ loader=jiti → writes to log ✅
|
|
178
178
|
→ executeTeamRun()
|
|
179
|
-
→ child-pi.ts
|
|
180
|
-
→ Pi output → child-pi.ts captures →
|
|
179
|
+
→ child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
|
|
180
|
+
→ Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
|
|
181
181
|
```
|
|
182
182
|
|
|
183
183
|
### Fix
|
|
184
184
|
|
|
185
|
-
1. **Option A:**
|
|
186
|
-
2. **Option B:**
|
|
187
|
-
3. **Option C:** Background-runner
|
|
185
|
+
1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
|
|
186
|
+
2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
|
|
187
|
+
3. **Option C:** Background-runner tees output to a log file
|
|
188
188
|
|
|
189
189
|
### Key file
|
|
190
190
|
|
|
@@ -194,18 +194,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
|
|
|
194
194
|
|
|
195
195
|
---
|
|
196
196
|
|
|
197
|
-
## Bug #4: worker-startup.ts
|
|
197
|
+
## Bug #4: worker-startup.ts missing "rate_limited" classification
|
|
198
198
|
|
|
199
199
|
| Field | Value |
|
|
200
200
|
|---|---|
|
|
201
201
|
| **Severity** | 🟡 LOW |
|
|
202
|
-
| **Status** | New —
|
|
203
|
-
| **Affected** | Error classification
|
|
204
|
-
| **Symptom** | 429 errors classified
|
|
202
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
203
|
+
| **Affected** | Error classification and reporting |
|
|
204
|
+
| **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
|
|
205
205
|
|
|
206
|
-
###
|
|
206
|
+
### Description
|
|
207
207
|
|
|
208
|
-
`worker-startup.ts`
|
|
208
|
+
`worker-startup.ts` has the `StartupFailureClassification` type:
|
|
209
209
|
```typescript
|
|
210
210
|
export type StartupFailureClassification =
|
|
211
211
|
| "trust_required"
|
|
@@ -216,11 +216,11 @@ export type StartupFailureClassification =
|
|
|
216
216
|
| "unknown";
|
|
217
217
|
```
|
|
218
218
|
|
|
219
|
-
|
|
219
|
+
Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
|
|
220
220
|
|
|
221
221
|
### Fix
|
|
222
222
|
|
|
223
|
-
|
|
223
|
+
Add to the type and `classifyStartupFailure` function:
|
|
224
224
|
```typescript
|
|
225
225
|
export type StartupFailureClassification =
|
|
226
226
|
| "trust_required"
|
|
@@ -245,18 +245,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
|
|
|
245
245
|
|
|
246
246
|
---
|
|
247
247
|
|
|
248
|
-
## Bug #5: Stale heartbeat notifications
|
|
248
|
+
## Bug #5: Stale heartbeat notifications after prune
|
|
249
249
|
|
|
250
250
|
| Field | Value |
|
|
251
251
|
|---|---|
|
|
252
252
|
| **Severity** | 🟡 LOW (cosmetic) |
|
|
253
253
|
| **Status** | Confirmed |
|
|
254
254
|
| **Affected** | User experience |
|
|
255
|
-
| **Symptom** | "Task heartbeat dead" notifications
|
|
255
|
+
| **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
|
|
256
256
|
|
|
257
|
-
###
|
|
257
|
+
### Description
|
|
258
258
|
|
|
259
|
-
|
|
259
|
+
After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
|
|
260
260
|
|
|
261
261
|
```
|
|
262
262
|
→ team prune: Removed 9 runs
|
|
@@ -267,23 +267,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
|
|
|
267
267
|
... (6+ stale notifications)
|
|
268
268
|
```
|
|
269
269
|
|
|
270
|
-
|
|
270
|
+
Each notification triggers `get_subagent_result` → returns "not found".
|
|
271
271
|
|
|
272
|
-
###
|
|
272
|
+
### Cause
|
|
273
273
|
|
|
274
|
-
|
|
275
|
-
1.
|
|
276
|
-
2. Notifications
|
|
277
|
-
3.
|
|
274
|
+
The background watcher maintains a worker health-check queue. When runs are pruned:
|
|
275
|
+
1. The watcher does not deregister immediately
|
|
276
|
+
2. Notifications already in the queue still emit
|
|
277
|
+
3. The notifications arrive one by one, a few seconds apart
|
|
278
278
|
|
|
279
279
|
### Impact
|
|
280
280
|
|
|
281
|
-
- Confusing
|
|
282
|
-
- Wasted context:
|
|
281
|
+
- Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
|
|
282
|
+
- Wasted context: each notification triggers 1 tool call to verify
|
|
283
283
|
|
|
284
284
|
### Fix
|
|
285
285
|
|
|
286
|
-
|
|
286
|
+
The background watcher should check run existence before emitting:
|
|
287
287
|
```typescript
|
|
288
288
|
// Before emitting heartbeat_dead:
|
|
289
289
|
if (!runExists(runId)) {
|
|
@@ -303,24 +303,24 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
|
303
303
|
|
|
304
304
|
# pi-crew v0.2.20 — Bug Report
|
|
305
305
|
|
|
306
|
-
**
|
|
306
|
+
**Date:** 2026-05-19
|
|
307
307
|
**Session:** Comprehensive integration test + root cause analysis
|
|
308
308
|
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
309
309
|
|
|
310
310
|
---
|
|
311
311
|
|
|
312
|
-
## Bug #1: Background workers "heartbeat dead" —
|
|
312
|
+
## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
|
|
313
313
|
|
|
314
314
|
| Field | Value |
|
|
315
315
|
|---|---|
|
|
316
316
|
| **Severity** | 🔴 HIGH |
|
|
317
317
|
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
318
|
-
| **Affected** |
|
|
319
|
-
| **Symptom** | Workers
|
|
318
|
+
| **Affected** | All background/async workers |
|
|
319
|
+
| **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
|
|
320
320
|
|
|
321
|
-
###
|
|
321
|
+
### Description
|
|
322
322
|
|
|
323
|
-
|
|
323
|
+
When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
|
|
324
324
|
```
|
|
325
325
|
worker.response_timeout: No output for 300000ms
|
|
326
326
|
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
@@ -328,48 +328,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
|
328
328
|
|
|
329
329
|
### Root cause
|
|
330
330
|
|
|
331
|
-
|
|
332
|
-
1. `RETRYABLE_MODEL_FAILURE_PATTERNS`
|
|
333
|
-
2. 429
|
|
331
|
+
**Fixed.** Previously the 429 rate limit was not retried because:
|
|
332
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
|
|
333
|
+
2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
|
|
334
334
|
|
|
335
335
|
### Fix applied
|
|
336
336
|
|
|
337
|
-
1. **model-fallback.ts**:
|
|
338
|
-
2. **model-fallback.ts**:
|
|
339
|
-
3. **child-pi.ts**:
|
|
337
|
+
1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
|
|
338
|
+
2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
|
|
339
|
+
3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
|
|
340
340
|
|
|
341
341
|
### Model fallback chain
|
|
342
342
|
|
|
343
|
-
|
|
344
|
-
1.
|
|
345
|
-
2.
|
|
346
|
-
3.
|
|
343
|
+
When the main model gets a 429:
|
|
344
|
+
1. Fall back to `fallbackModels` (if configured)
|
|
345
|
+
2. Fall back to other available models in the system
|
|
346
|
+
3. If there is no fallback and retries are exhausted → fail with the correct error message
|
|
347
347
|
|
|
348
|
-
**
|
|
348
|
+
**Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
|
|
349
349
|
|
|
350
350
|
---
|
|
351
351
|
|
|
352
|
-
## Bug #2: child-pi.ts
|
|
352
|
+
## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
|
|
353
353
|
|
|
354
354
|
| Field | Value |
|
|
355
355
|
|---|---|
|
|
356
356
|
| **Severity** | 🔴 HIGH |
|
|
357
|
-
| **Status** | New —
|
|
358
|
-
| **Affected** |
|
|
359
|
-
| **Symptom** | Worker
|
|
357
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
358
|
+
| **Affected** | All child Pi workers |
|
|
359
|
+
| **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
|
|
360
360
|
|
|
361
|
-
###
|
|
361
|
+
### Description
|
|
362
362
|
|
|
363
|
-
Pi CLI
|
|
363
|
+
The Pi CLI outputs JSON events for 429 errors very clearly:
|
|
364
364
|
```json
|
|
365
365
|
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
366
366
|
```
|
|
367
367
|
|
|
368
|
-
|
|
369
|
-
- `isFinalAssistantEvent()` —
|
|
370
|
-
- `turn_end` —
|
|
368
|
+
But `child-pi.ts` **does not parse error events** — it only cares about:
|
|
369
|
+
- `isFinalAssistantEvent()` — to trigger the final drain
|
|
370
|
+
- `turn_end` — to count turns for turn limiting
|
|
371
371
|
|
|
372
|
-
|
|
372
|
+
Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
|
|
373
373
|
|
|
374
374
|
### Code location
|
|
375
375
|
|
|
@@ -377,7 +377,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
|
|
|
377
377
|
```typescript
|
|
378
378
|
onJsonEvent: (event) => {
|
|
379
379
|
restartNoResponseTimer();
|
|
380
|
-
// Turn-count-based steering:
|
|
380
|
+
// Turn-count-based steering: only counts turns, does NOT check errors
|
|
381
381
|
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
382
382
|
const obj = event as Record<string, unknown>;
|
|
383
383
|
if (obj.type === "turn_end") {
|
|
@@ -391,7 +391,7 @@ onJsonEvent: (event) => {
|
|
|
391
391
|
|
|
392
392
|
### Fix
|
|
393
393
|
|
|
394
|
-
|
|
394
|
+
Add provider error detection in `onJsonEvent`:
|
|
395
395
|
```typescript
|
|
396
396
|
let providerError: string | undefined;
|
|
397
397
|
|
|
@@ -408,42 +408,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
|
408
408
|
|
|
409
409
|
### Impact
|
|
410
410
|
|
|
411
|
-
|
|
411
|
+
This fix changes the error message from:
|
|
412
412
|
```
|
|
413
413
|
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
414
414
|
```
|
|
415
|
-
|
|
415
|
+
to:
|
|
416
416
|
```
|
|
417
417
|
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
418
418
|
```
|
|
419
419
|
|
|
420
|
-
|
|
420
|
+
And it **fails fast** instead of waiting 300s.
|
|
421
421
|
|
|
422
422
|
---
|
|
423
423
|
|
|
424
|
-
## Bug #3: background.log
|
|
424
|
+
## Bug #3: background.log is useless — does not capture worker output
|
|
425
425
|
|
|
426
426
|
| Field | Value |
|
|
427
427
|
|---|---|
|
|
428
428
|
| **Severity** | 🟠 MEDIUM |
|
|
429
|
-
| **Status** | New —
|
|
430
|
-
| **Affected** | Debugging experience
|
|
431
|
-
| **Symptom** | background.log
|
|
429
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
430
|
+
| **Affected** | Debugging experience for all background runs |
|
|
431
|
+
| **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
|
|
432
432
|
|
|
433
|
-
###
|
|
433
|
+
### Description
|
|
434
434
|
|
|
435
|
-
|
|
435
|
+
When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
|
|
436
436
|
```
|
|
437
437
|
[pi-crew] background loader=jiti
|
|
438
438
|
```
|
|
439
439
|
|
|
440
|
-
|
|
440
|
+
Missing:
|
|
441
441
|
- Worker stdout/stderr
|
|
442
442
|
- Error messages
|
|
443
443
|
- Provider responses
|
|
444
444
|
- Exit codes
|
|
445
445
|
|
|
446
|
-
###
|
|
446
|
+
### Cause
|
|
447
447
|
|
|
448
448
|
`async-runner.ts` line 130-145:
|
|
449
449
|
```typescript
|
|
@@ -462,22 +462,22 @@ return {
|
|
|
462
462
|
};
|
|
463
463
|
```
|
|
464
464
|
|
|
465
|
-
**stdout/stderr
|
|
465
|
+
**The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
|
|
466
466
|
|
|
467
467
|
Flow:
|
|
468
468
|
```
|
|
469
469
|
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
470
|
-
→ loader=jiti →
|
|
470
|
+
→ loader=jiti → writes to log ✅
|
|
471
471
|
→ executeTeamRun()
|
|
472
|
-
→ child-pi.ts
|
|
473
|
-
→ Pi output → child-pi.ts captures →
|
|
472
|
+
→ child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
|
|
473
|
+
→ Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
|
|
474
474
|
```
|
|
475
475
|
|
|
476
476
|
### Fix
|
|
477
477
|
|
|
478
|
-
1. **Option A:**
|
|
479
|
-
2. **Option B:**
|
|
480
|
-
3. **Option C:** Background-runner
|
|
478
|
+
1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
|
|
479
|
+
2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
|
|
480
|
+
3. **Option C:** Background-runner tees output to a log file
|
|
481
481
|
|
|
482
482
|
### Key file
|
|
483
483
|
|
|
@@ -487,18 +487,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
|
|
|
487
487
|
|
|
488
488
|
---
|
|
489
489
|
|
|
490
|
-
## Bug #4: worker-startup.ts
|
|
490
|
+
## Bug #4: worker-startup.ts missing "rate_limited" classification
|
|
491
491
|
|
|
492
492
|
| Field | Value |
|
|
493
493
|
|---|---|
|
|
494
494
|
| **Severity** | 🟡 LOW |
|
|
495
|
-
| **Status** | New —
|
|
496
|
-
| **Affected** | Error classification
|
|
497
|
-
| **Symptom** | 429 errors classified
|
|
495
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
496
|
+
| **Affected** | Error classification and reporting |
|
|
497
|
+
| **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
|
|
498
498
|
|
|
499
|
-
###
|
|
499
|
+
### Description
|
|
500
500
|
|
|
501
|
-
`worker-startup.ts`
|
|
501
|
+
`worker-startup.ts` has the `StartupFailureClassification` type:
|
|
502
502
|
```typescript
|
|
503
503
|
export type StartupFailureClassification =
|
|
504
504
|
| "trust_required"
|
|
@@ -509,11 +509,11 @@ export type StartupFailureClassification =
|
|
|
509
509
|
| "unknown";
|
|
510
510
|
```
|
|
511
511
|
|
|
512
|
-
|
|
512
|
+
Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
|
|
513
513
|
|
|
514
514
|
### Fix
|
|
515
515
|
|
|
516
|
-
|
|
516
|
+
Add to the type and `classifyStartupFailure` function:
|
|
517
517
|
```typescript
|
|
518
518
|
export type StartupFailureClassification =
|
|
519
519
|
| "trust_required"
|
|
@@ -538,18 +538,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
|
|
|
538
538
|
|
|
539
539
|
---
|
|
540
540
|
|
|
541
|
-
## Bug #5: Stale heartbeat notifications
|
|
541
|
+
## Bug #5: Stale heartbeat notifications after prune
|
|
542
542
|
|
|
543
543
|
| Field | Value |
|
|
544
544
|
|---|---|
|
|
545
545
|
| **Severity** | 🟡 LOW (cosmetic) |
|
|
546
546
|
| **Status** | Confirmed |
|
|
547
547
|
| **Affected** | User experience |
|
|
548
|
-
| **Symptom** | "Task heartbeat dead" notifications
|
|
548
|
+
| **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
|
|
549
549
|
|
|
550
|
-
###
|
|
550
|
+
### Description
|
|
551
551
|
|
|
552
|
-
|
|
552
|
+
After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
|
|
553
553
|
|
|
554
554
|
```
|
|
555
555
|
→ team prune: Removed 9 runs
|
|
@@ -560,23 +560,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
|
|
|
560
560
|
... (6+ stale notifications)
|
|
561
561
|
```
|
|
562
562
|
|
|
563
|
-
|
|
563
|
+
Each notification triggers `get_subagent_result` → returns "not found".
|
|
564
564
|
|
|
565
|
-
###
|
|
565
|
+
### Cause
|
|
566
566
|
|
|
567
|
-
|
|
568
|
-
1.
|
|
569
|
-
2. Notifications
|
|
570
|
-
3.
|
|
567
|
+
The background watcher maintains a worker health-check queue. When runs are pruned:
|
|
568
|
+
1. The watcher does not deregister immediately
|
|
569
|
+
2. Notifications already in the queue still emit
|
|
570
|
+
3. The notifications arrive one by one, a few seconds apart
|
|
571
571
|
|
|
572
572
|
### Impact
|
|
573
573
|
|
|
574
|
-
- Confusing
|
|
575
|
-
- Wasted context:
|
|
574
|
+
- Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
|
|
575
|
+
- Wasted context: each notification triggers 1 tool call to verify
|
|
576
576
|
|
|
577
577
|
### Fix
|
|
578
578
|
|
|
579
|
-
|
|
579
|
+
The background watcher should check run existence before emitting:
|
|
580
580
|
```typescript
|
|
581
581
|
// Before emitting heartbeat_dead:
|
|
582
582
|
if (!runExists(runId)) {
|
|
@@ -596,24 +596,24 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
|
596
596
|
|
|
597
597
|
# pi-crew v0.2.20 — Bug Report
|
|
598
598
|
|
|
599
|
-
**
|
|
599
|
+
**Date:** 2026-05-19
|
|
600
600
|
**Session:** Comprehensive integration test + root cause analysis
|
|
601
601
|
**Environment:** linux/x64, Node v22.22.0, Pi CLI v0.75.3, pi-crew v0.2.20
|
|
602
602
|
|
|
603
603
|
---
|
|
604
604
|
|
|
605
|
-
## Bug #1: Background workers "heartbeat dead" —
|
|
605
|
+
## Bug #1: Background workers "heartbeat dead" — actually a MiniMax 429 Rate Limit
|
|
606
606
|
|
|
607
607
|
| Field | Value |
|
|
608
608
|
|---|---|
|
|
609
609
|
| **Severity** | 🔴 HIGH |
|
|
610
610
|
| **Status** | ✅ Fixed — 429 now retries with fallback models instead of blocking 300s |
|
|
611
|
-
| **Affected** |
|
|
612
|
-
| **Symptom** | Workers
|
|
611
|
+
| **Affected** | All background/async workers |
|
|
612
|
+
| **Symptom** | Workers time out after 300s with "heartbeat dead", zero output |
|
|
613
613
|
|
|
614
|
-
###
|
|
614
|
+
### Description
|
|
615
615
|
|
|
616
|
-
|
|
616
|
+
When running `team action='run'` with `async=true` or `Agent(run_in_background=true)`, workers spawn successfully (PID exists) but **time out after 300s** with a generic error:
|
|
617
617
|
```
|
|
618
618
|
worker.response_timeout: No output for 300000ms
|
|
619
619
|
crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
@@ -621,48 +621,48 @@ crew.task.heartbeat_dead: Task 01_assess heartbeat dead.
|
|
|
621
621
|
|
|
622
622
|
### Root cause
|
|
623
623
|
|
|
624
|
-
|
|
625
|
-
1. `RETRYABLE_MODEL_FAILURE_PATTERNS`
|
|
626
|
-
2. 429
|
|
624
|
+
**Fixed.** Previously the 429 rate limit was not retried because:
|
|
625
|
+
1. `RETRYABLE_MODEL_FAILURE_PATTERNS` had `/\b429\b/` but MiniMax returns `rate_limit_error: usage limit exceeded` (no clear "429" number)
|
|
626
|
+
2. 429 was fast-failed in `child-pi.ts onJsonEvent` instead of letting the task-runner handle retry with fallback
|
|
627
627
|
|
|
628
628
|
### Fix applied
|
|
629
629
|
|
|
630
|
-
1. **model-fallback.ts**:
|
|
631
|
-
2. **model-fallback.ts**:
|
|
632
|
-
3. **child-pi.ts**:
|
|
630
|
+
1. **model-fallback.ts**: Added `/rate_limit_error/i` to `RETRYABLE_MODEL_FAILURE_PATTERNS` to correctly identify the MiniMax rate limit error
|
|
631
|
+
2. **model-fallback.ts**: Changed `/\b429\b/` → `/rate.?limit/i` to match more formats
|
|
632
|
+
3. **child-pi.ts**: Removed the 429 fast-fail — let the task-runner handle retry with the model fallback chain
|
|
633
633
|
|
|
634
634
|
### Model fallback chain
|
|
635
635
|
|
|
636
|
-
|
|
637
|
-
1.
|
|
638
|
-
2.
|
|
639
|
-
3.
|
|
636
|
+
When the main model gets a 429:
|
|
637
|
+
1. Fall back to `fallbackModels` (if configured)
|
|
638
|
+
2. Fall back to other available models in the system
|
|
639
|
+
3. If there is no fallback and retries are exhausted → fail with the correct error message
|
|
640
640
|
|
|
641
|
-
**
|
|
641
|
+
**Recommended config:** Add `fallbackModels` to the agent config to have more options when the main model is rate-limited.
|
|
642
642
|
|
|
643
643
|
---
|
|
644
644
|
|
|
645
|
-
## Bug #2: child-pi.ts
|
|
645
|
+
## Bug #2: child-pi.ts does not detect 429 rate limit error — reports wrong "heartbeat dead"
|
|
646
646
|
|
|
647
647
|
| Field | Value |
|
|
648
648
|
|---|---|
|
|
649
649
|
| **Severity** | 🔴 HIGH |
|
|
650
|
-
| **Status** | New —
|
|
651
|
-
| **Affected** |
|
|
652
|
-
| **Symptom** | Worker
|
|
650
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
651
|
+
| **Affected** | All child Pi workers |
|
|
652
|
+
| **Symptom** | Worker reports generic "No output for 300000ms" instead of "Provider rate limit: 429" |
|
|
653
653
|
|
|
654
|
-
###
|
|
654
|
+
### Description
|
|
655
655
|
|
|
656
|
-
Pi CLI
|
|
656
|
+
The Pi CLI outputs JSON events for 429 errors very clearly:
|
|
657
657
|
```json
|
|
658
658
|
{"type":"turn_end","message":{"stopReason":"error","errorMessage":"429 {\"type\":\"error\",\"error\":{\"type\":\"rate_limit_error\"...}}"}}
|
|
659
659
|
```
|
|
660
660
|
|
|
661
|
-
|
|
662
|
-
- `isFinalAssistantEvent()` —
|
|
663
|
-
- `turn_end` —
|
|
661
|
+
But `child-pi.ts` **does not parse error events** — it only cares about:
|
|
662
|
+
- `isFinalAssistantEvent()` — to trigger the final drain
|
|
663
|
+
- `turn_end` — to count turns for turn limiting
|
|
664
664
|
|
|
665
|
-
|
|
665
|
+
Result: child-pi sees output (JSON events), **restarts the heartbeat timer**, but **does not recognize it as an error**. Pi blocks after 3 retries → heartbeat times out at 300s → generic error message.
|
|
666
666
|
|
|
667
667
|
### Code location
|
|
668
668
|
|
|
@@ -670,7 +670,7 @@ Kết quả: child-pi thấy output (JSON events), **restart heartbeat timer**,
|
|
|
670
670
|
```typescript
|
|
671
671
|
onJsonEvent: (event) => {
|
|
672
672
|
restartNoResponseTimer();
|
|
673
|
-
// Turn-count-based steering:
|
|
673
|
+
// Turn-count-based steering: only counts turns, does NOT check errors
|
|
674
674
|
if (event && typeof event === "object" && !Array.isArray(event)) {
|
|
675
675
|
const obj = event as Record<string, unknown>;
|
|
676
676
|
if (obj.type === "turn_end") {
|
|
@@ -684,7 +684,7 @@ onJsonEvent: (event) => {
|
|
|
684
684
|
|
|
685
685
|
### Fix
|
|
686
686
|
|
|
687
|
-
|
|
687
|
+
Add provider error detection in `onJsonEvent`:
|
|
688
688
|
```typescript
|
|
689
689
|
let providerError: string | undefined;
|
|
690
690
|
|
|
@@ -701,42 +701,42 @@ if (obj.type === "turn_end" && obj.message?.stopReason === "error") {
|
|
|
701
701
|
|
|
702
702
|
### Impact
|
|
703
703
|
|
|
704
|
-
|
|
704
|
+
This fix changes the error message from:
|
|
705
705
|
```
|
|
706
706
|
❌ "Child Pi produced no new output for 300000ms; process was terminated as unresponsive."
|
|
707
707
|
```
|
|
708
|
-
|
|
708
|
+
to:
|
|
709
709
|
```
|
|
710
710
|
✅ "Provider rate limit: 429 rate_limit_error: usage limit exceeded, resets at 2026-05-19T05:00:00Z"
|
|
711
711
|
```
|
|
712
712
|
|
|
713
|
-
|
|
713
|
+
And it **fails fast** instead of waiting 300s.
|
|
714
714
|
|
|
715
715
|
---
|
|
716
716
|
|
|
717
|
-
## Bug #3: background.log
|
|
717
|
+
## Bug #3: background.log is useless — does not capture worker output
|
|
718
718
|
|
|
719
719
|
| Field | Value |
|
|
720
720
|
|---|---|
|
|
721
721
|
| **Severity** | 🟠 MEDIUM |
|
|
722
|
-
| **Status** | New —
|
|
723
|
-
| **Affected** | Debugging experience
|
|
724
|
-
| **Symptom** | background.log
|
|
722
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
723
|
+
| **Affected** | Debugging experience for all background runs |
|
|
724
|
+
| **Symptom** | background.log contains only 1 line: `[pi-crew] background loader=jiti` |
|
|
725
725
|
|
|
726
|
-
###
|
|
726
|
+
### Description
|
|
727
727
|
|
|
728
|
-
|
|
728
|
+
When a background worker fails, the log file at `.crew/state/runs/<id>/background.log` contains only:
|
|
729
729
|
```
|
|
730
730
|
[pi-crew] background loader=jiti
|
|
731
731
|
```
|
|
732
732
|
|
|
733
|
-
|
|
733
|
+
Missing:
|
|
734
734
|
- Worker stdout/stderr
|
|
735
735
|
- Error messages
|
|
736
736
|
- Provider responses
|
|
737
737
|
- Exit codes
|
|
738
738
|
|
|
739
|
-
###
|
|
739
|
+
### Cause
|
|
740
740
|
|
|
741
741
|
`async-runner.ts` line 130-145:
|
|
742
742
|
```typescript
|
|
@@ -755,22 +755,22 @@ return {
|
|
|
755
755
|
};
|
|
756
756
|
```
|
|
757
757
|
|
|
758
|
-
**stdout/stderr
|
|
758
|
+
**The stdout/stderr of the background-runner** is written to background.log. But **child Pi workers** (spawned by the background-runner via child-pi.ts) **output to child-pi's pipe**, NOT to background.log.
|
|
759
759
|
|
|
760
760
|
Flow:
|
|
761
761
|
```
|
|
762
762
|
background-runner.ts (stdout→logFd, stderr→logFd)
|
|
763
|
-
→ loader=jiti →
|
|
763
|
+
→ loader=jiti → writes to log ✅
|
|
764
764
|
→ executeTeamRun()
|
|
765
|
-
→ child-pi.ts
|
|
766
|
-
→ Pi output → child-pi.ts captures →
|
|
765
|
+
→ child-pi.ts spawns child Pi (stdout→pipe, stderr→pipe)
|
|
766
|
+
→ Pi output → child-pi.ts captures → DOES NOT WRITE TO background.log ❌
|
|
767
767
|
```
|
|
768
768
|
|
|
769
769
|
### Fix
|
|
770
770
|
|
|
771
|
-
1. **Option A:**
|
|
772
|
-
2. **Option B:**
|
|
773
|
-
3. **Option C:** Background-runner
|
|
771
|
+
1. **Option A:** In `child-pi.ts` or `team-runner.ts`, write worker output events to background.log
|
|
772
|
+
2. **Option B:** Add event log entries for provider errors (there is an event log, but not detailed enough)
|
|
773
|
+
3. **Option C:** Background-runner tees output to a log file
|
|
774
774
|
|
|
775
775
|
### Key file
|
|
776
776
|
|
|
@@ -780,18 +780,18 @@ pi-crew/src/runtime/async-runner.ts — buildBackgroundSpawnOptions(), spawnBac
|
|
|
780
780
|
|
|
781
781
|
---
|
|
782
782
|
|
|
783
|
-
## Bug #4: worker-startup.ts
|
|
783
|
+
## Bug #4: worker-startup.ts missing "rate_limited" classification
|
|
784
784
|
|
|
785
785
|
| Field | Value |
|
|
786
786
|
|---|---|
|
|
787
787
|
| **Severity** | 🟡 LOW |
|
|
788
|
-
| **Status** | New —
|
|
789
|
-
| **Affected** | Error classification
|
|
790
|
-
| **Symptom** | 429 errors classified
|
|
788
|
+
| **Status** | New — discovered while debugging Bug #1 |
|
|
789
|
+
| **Affected** | Error classification and reporting |
|
|
790
|
+
| **Symptom** | 429 errors classified as "unknown" instead of "rate_limited" |
|
|
791
791
|
|
|
792
|
-
###
|
|
792
|
+
### Description
|
|
793
793
|
|
|
794
|
-
`worker-startup.ts`
|
|
794
|
+
`worker-startup.ts` has the `StartupFailureClassification` type:
|
|
795
795
|
```typescript
|
|
796
796
|
export type StartupFailureClassification =
|
|
797
797
|
| "trust_required"
|
|
@@ -802,11 +802,11 @@ export type StartupFailureClassification =
|
|
|
802
802
|
| "unknown";
|
|
803
803
|
```
|
|
804
804
|
|
|
805
|
-
|
|
805
|
+
Missing `"rate_limited"` and `"provider_error"`. Result: 429 errors are classified as `"unknown"`.
|
|
806
806
|
|
|
807
807
|
### Fix
|
|
808
808
|
|
|
809
|
-
|
|
809
|
+
Add to the type and `classifyStartupFailure` function:
|
|
810
810
|
```typescript
|
|
811
811
|
export type StartupFailureClassification =
|
|
812
812
|
| "trust_required"
|
|
@@ -831,18 +831,18 @@ pi-crew/src/runtime/worker-startup.ts — StartupFailureClassification, classif
|
|
|
831
831
|
|
|
832
832
|
---
|
|
833
833
|
|
|
834
|
-
## Bug #5: Stale heartbeat notifications
|
|
834
|
+
## Bug #5: Stale heartbeat notifications after prune
|
|
835
835
|
|
|
836
836
|
| Field | Value |
|
|
837
837
|
|---|---|
|
|
838
838
|
| **Severity** | 🟡 LOW (cosmetic) |
|
|
839
839
|
| **Status** | Confirmed |
|
|
840
840
|
| **Affected** | User experience |
|
|
841
|
-
| **Symptom** | "Task heartbeat dead" notifications
|
|
841
|
+
| **Symptom** | "Task heartbeat dead" notifications for already-removed runs |
|
|
842
842
|
|
|
843
|
-
###
|
|
843
|
+
### Description
|
|
844
844
|
|
|
845
|
-
|
|
845
|
+
After running `team prune --keep=0 --confirm=true`, the background watcher still emits notifications for pruned runs:
|
|
846
846
|
|
|
847
847
|
```
|
|
848
848
|
→ team prune: Removed 9 runs
|
|
@@ -853,23 +853,23 @@ Sau khi chạy `team prune --keep=0 --confirm=true`, background watcher vẫn em
|
|
|
853
853
|
... (6+ stale notifications)
|
|
854
854
|
```
|
|
855
855
|
|
|
856
|
-
|
|
856
|
+
Each notification triggers `get_subagent_result` → returns "not found".
|
|
857
857
|
|
|
858
|
-
###
|
|
858
|
+
### Cause
|
|
859
859
|
|
|
860
|
-
|
|
861
|
-
1.
|
|
862
|
-
2. Notifications
|
|
863
|
-
3.
|
|
860
|
+
The background watcher maintains a worker health-check queue. When runs are pruned:
|
|
861
|
+
1. The watcher does not deregister immediately
|
|
862
|
+
2. Notifications already in the queue still emit
|
|
863
|
+
3. The notifications arrive one by one, a few seconds apart
|
|
864
864
|
|
|
865
865
|
### Impact
|
|
866
866
|
|
|
867
|
-
- Confusing
|
|
868
|
-
- Wasted context:
|
|
867
|
+
- Confusing for the user: seeing "heartbeat dead" for runs that no longer exist
|
|
868
|
+
- Wasted context: each notification triggers 1 tool call to verify
|
|
869
869
|
|
|
870
870
|
### Fix
|
|
871
871
|
|
|
872
|
-
|
|
872
|
+
The background watcher should check run existence before emitting:
|
|
873
873
|
```typescript
|
|
874
874
|
// Before emitting heartbeat_dead:
|
|
875
875
|
if (!runExists(runId)) {
|
|
@@ -887,18 +887,18 @@ pi-crew/src/runtime/background-runner.ts — heartbeat monitoring loop
|
|
|
887
887
|
|
|
888
888
|
---
|
|
889
889
|
|
|
890
|
-
## Bug #6: Live-session run
|
|
890
|
+
## Bug #6: Live-session run cancelled mid-execution
|
|
891
891
|
|
|
892
892
|
| Field | Value |
|
|
893
893
|
|---|---|
|
|
894
894
|
| **Severity** | 🟠 MEDIUM |
|
|
895
|
-
| **Status** | ✅ Confirmed — no code fix needed; documented as user workflow constraint |
|
|
895
|
+
| **Status** | ✅ Confirmed — no code fix needed; documented as a user workflow constraint |
|
|
896
896
|
| **Affected** | Foreground team runs |
|
|
897
|
-
| **Symptom** | Run cancelled
|
|
897
|
+
| **Symptom** | Run cancelled after the explore phase completes, before the execute phase |
|
|
898
898
|
|
|
899
|
-
###
|
|
899
|
+
### Description
|
|
900
900
|
|
|
901
|
-
|
|
901
|
+
A fast-fix team ran in a live-session:
|
|
902
902
|
```
|
|
903
903
|
04:12:20 live-session.prompt_start 01_explore
|
|
904
904
|
04:12:51 live-session.prompt_done 01_explore (31s, completed)
|
|
@@ -907,18 +907,18 @@ Fast-fix team chạy live-session:
|
|
|
907
907
|
04:12:51 run.cancelled: "This operation was aborted"
|
|
908
908
|
```
|
|
909
909
|
|
|
910
|
-
Task `01_explore`
|
|
910
|
+
Task `01_explore` completed successfully, but the run was cancelled before `02_execute` started.
|
|
911
911
|
|
|
912
|
-
###
|
|
912
|
+
### Possible causes
|
|
913
913
|
|
|
914
|
-
1. **Session concurrency limit** —
|
|
914
|
+
1. **Session concurrency limit** — only 1 active live-session, conflicting with parallel test operations
|
|
915
915
|
2. **User-initiated cancellation** — accidentally triggered
|
|
916
|
-
3. **Workflow phase transition bug** —
|
|
916
|
+
3. **Workflow phase transition bug** — does not trigger the next phase after explore completes
|
|
917
917
|
|
|
918
|
-
###
|
|
918
|
+
### Needs further investigation
|
|
919
919
|
|
|
920
|
-
-
|
|
921
|
-
- Check live-session-runtime.ts
|
|
920
|
+
- Run the fast-fix team standalone (no concurrent operations)
|
|
921
|
+
- Check live-session-runtime.ts for phase-transition logic
|
|
922
922
|
|
|
923
923
|
---
|
|
924
924
|
|
|
@@ -926,16 +926,16 @@ Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước
|
|
|
926
926
|
|
|
927
927
|
| # | Bug | Severity | Status | Category |
|
|
928
928
|
|---|---|---|---|---|
|
|
929
|
-
| 1 | Background workers timeout
|
|
930
|
-
| 2 | child-pi.ts
|
|
931
|
-
| 3 | background.log
|
|
932
|
-
| 4 | worker-startup.ts
|
|
933
|
-
| 5 | Stale heartbeat notifications
|
|
934
|
-
| 6 | Live-session foreground run
|
|
935
|
-
| 7 | Async notifier "stale ctx" — dies,
|
|
929
|
+
| 1 | Background workers timeout due to MiniMax 429 | 🔴 HIGH | ✅ Fixed — 429 now retries with fallback models via improved RETRYABLE_MODEL_FAILURE_PATTERNS | Code |
|
|
930
|
+
| 2 | child-pi.ts does not detect 429, reports wrong "heartbeat dead" | 🔴 HIGH | ✅ Fixed — removed 429 fast-fail; let task-runner handle retry+fallback | Code |
|
|
931
|
+
| 3 | background.log useless, does not capture worker output | 🟠 MEDIUM | ✅ Fixed — added PI_CREW_BACKGROUND_MODE flag + event logging to background.log | Observability |
|
|
932
|
+
| 4 | worker-startup.ts missing rate_limited classification | 🟡 LOW | ✅ Fixed — added rate_limited + provider_error to StartupFailureClassification | Code |
|
|
933
|
+
| 5 | Stale heartbeat notifications after prune | 🟡 LOW | ✅ Fixed — HeartbeatWatcher skips pruned runs via stateRoot existence check | UX |
|
|
934
|
+
| 6 | Live-session foreground run cancelled when there are concurrent tool calls | 🟠 MEDIUM | ✅ Confirmed — concurrent calls interrupt live-session → outputLength:0 → caller_cancelled. Avoid concurrent team actions during foreground runs. | Runtime |
|
|
935
|
+
| 7 | Async notifier "stale ctx" — dies, does not restart after Pi restart | 🔴 HIGH | ✅ Fixed — swallow stale error, isCurrent guard handles dormancy | Code |
|
|
936
936
|
| 8 | Background child-process 300s timeout — child Pi hangs, zero output | 🟠 MEDIUM | ✅ Fixed — Root cause found (Bug #10): MINIMAX_API_KEY stripped by sanitizeEnvSecrets(). Allow-list in child-pi.ts preserves model provider API keys. Restart Pi to verify fix. | Code |
|
|
937
|
-
| 9 | Executor hit yield limit — file write
|
|
938
|
-
| 10 | Child-process silent timeout — MINIMAX_API_KEY
|
|
937
|
+
| 9 | Executor hit yield limit — file write not completed | 🟡 LOW | 🔲 Open — executor hit 3 Yield Reminders and terminated before writing file. Task marked completed but artifact missing. | Runtime |
|
|
938
|
+
| 10 | Child-process silent timeout — MINIMAX_API_KEY filtered out of child env | 🔴 HIGH | ✅ Fixed — sanitizeEnvSecrets() strips *API_KEY* vars. Allow-list in buildChildPiSpawnOptions preserves model provider keys (MINIMAX_*, OPENAI_*, etc.). See docs/fixes/bug-010-child-process-api-key-filtered.md | Code |
|
|
939
939
|
|
|
940
940
|
|
|
941
941
|
| 11 | Background runner "spawn pi ENOENT" — pi binary not in PATH | 🔴 HIGH | ✅ Fixed — added resolvePiCliScript() call for non-Windows platforms in getPiSpawnCommand(). Restart Pi to verify. | Code |
|
|
@@ -945,7 +945,7 @@ Task `01_explore` hoàn thành thành công, nhưng run bị cancelled trước
|
|
|
945
945
|
### Priority fix order
|
|
946
946
|
|
|
947
947
|
1. **Bug #1** — ✅ Fixed — 429 now retried with model fallback chain
|
|
948
|
-
2. **Bug #2** — ✅ Fixed — removed fast-fail
|
|
948
|
+
2. **Bug #2** — ✅ Fixed — removed 429 fast-fail
|
|
949
949
|
3. **Bug #3** — ✅ Fixed — worker events now logged to background.log
|
|
950
950
|
4. **Bug #4** — ✅ Fixed — rate_limited + provider_error classification added
|
|
951
951
|
5. **Bug #5** — ✅ Fixed — HeartbeatWatcher skips pruned runs
|