pi-crew 0.7.4 → 0.7.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +79 -0
- package/README.md +11 -11
- package/docs/commands-reference.md +14 -10
- package/docs/troubleshooting.md +131 -0
- package/docs/usage.md +9 -4
- package/package.json +1 -1
- package/src/config/config.ts +11 -4
- package/src/config/types.ts +2 -0
- package/src/errors.ts +66 -0
- package/src/extension/action-suggestions.ts +71 -0
- package/src/extension/context-status-injection.ts +174 -0
- package/src/extension/knowledge-injection.ts +29 -1
- package/src/extension/register.ts +81 -65
- package/src/extension/team-tool/api.ts +3 -2
- package/src/extension/team-tool/cancel.ts +5 -4
- package/src/extension/team-tool/explain.ts +2 -1
- package/src/extension/team-tool/failure-patterns.ts +124 -0
- package/src/extension/team-tool/inspect.ts +10 -6
- package/src/extension/team-tool/lifecycle-actions.ts +5 -4
- package/src/extension/team-tool/respond.ts +4 -3
- package/src/extension/team-tool/run-not-found.ts +54 -0
- package/src/extension/team-tool/run.ts +26 -4
- package/src/extension/team-tool/status.ts +58 -4
- package/src/extension/team-tool.ts +5 -3
- package/src/runtime/async-runner.ts +7 -0
- package/src/runtime/background-runner.ts +7 -1
- package/src/runtime/chain-parser.ts +13 -5
- package/src/runtime/checkpoint.ts +13 -1
- package/src/runtime/child-pi.ts +9 -1
- package/src/runtime/live-session-runtime.ts +15 -1
- package/src/runtime/parent-guard.ts +2 -2
- package/src/runtime/pipeline-runner.ts +3 -1
- package/src/runtime/stale-reconciler.ts +28 -4
- package/src/runtime/task-runner.ts +50 -20
- package/src/runtime/team-runner.ts +19 -2
- package/src/runtime/verification-gates.ts +21 -1
- package/src/runtime/workspace-tree.ts +28 -2
- package/src/schema/team-tool-schema.ts +9 -0
- package/src/state/blob-store.ts +12 -10
- package/src/state/event-log-rotation.ts +114 -93
- package/src/state/event-log.ts +83 -23
- package/src/state/health-store.ts +6 -1
- package/src/state/locks.ts +66 -16
- package/src/state/state-store.ts +46 -2
- package/src/ui/card-colors.ts +7 -3
- package/src/ui/dashboard-panes/agents-pane.ts +15 -2
- package/src/ui/live-duration.ts +58 -0
- package/src/ui/tool-render.ts +7 -11
- package/src/ui/tool-renderers/index.ts +6 -3
- package/src/ui/widget/widget-formatters.ts +2 -13
- package/src/utils/fs-watch.ts +11 -60
- package/src/utils/run-watcher-registry.ts +164 -0
- package/src/workflows/discover-workflows.ts +2 -1
- package/src/workflows/workflow-config.ts +5 -0
- package/src/runtime/dynamic-script-runner.ts +0 -497
- package/src/runtime/sandbox.ts +0 -335
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,84 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## [0.7.6] — DX, observability, and a critical interactive-session hang fix (2026-06-16)
|
|
4
|
+
|
|
5
|
+
This release bundles Rounds 16–28: a developer-experience pass, an observability pass, and eight correctness/security audits — culminating in the **fix for the pts/2 interactive-session busy-loop hang** (two separate Pi sessions had hung at 71.5% CPU with 339 inotify watches). All 24 commits passed CI on Windows, Ubuntu, and macOS.
|
|
6
|
+
|
|
7
|
+
### 🚨 Critical — interactive-session hang (Round 28 + pts/2 investigation)
|
|
8
|
+
|
|
9
|
+
Report: `/home/bom/pts2-hang-investigation-2026-06-16.md`. Three root causes, all fixed:
|
|
10
|
+
|
|
11
|
+
- **BUG C (CRITICAL): recursive watcher busy-loop** — `watchCrewState` used `fs.watch(<crewRoot>/state, {recursive:true})`. On Linux, Node implements "recursive" as ONE inotify watch PER SUBDIRECTORY, so with many historical runs under `.crew/state/runs/` this ballooned to hundreds of watches (109→339 observed) and caused a permanent busy-loop even with no active work. **Fix**: new `src/utils/run-watcher-registry.ts` (`RunWatcherRegistry`) — one non-recursive watcher on the `runs/` root (for new-run detection, since `crew.run.created` is never emitted) + one non-recursive watcher per **active** run, reconciled each preload tick against `running`/`queued`/`planning` status. Total inotify cost is now O(active runs) — typically 1–5 — not O(total history). Completed runs leave the active set and their watcher closes within one tick. The dead `createRecursiveWatcher` / `watchCrewState` / `runIdFromStateRelativePath` primitives were deleted from `fs-watch.ts`.
|
|
12
|
+
- **BUG A (MEDIUM): health double-join path** — `HEALTH_DIR = ".crew/state/health"` was joined with a `crewRoot` computed only 2 `dirname`s up, writing to `.crew/state/.crew/state/health` — a path **no code ever reads**. It produced a growing ghost subtree that the recursive watcher then walked. **Fix**: `crewRoot` = 3 `dirname`s up; `HEALTH_DIR` = `"state/health"`.
|
|
13
|
+
- **BUG B (MEDIUM): OTLP CRLF injection** — header-value validation left CR (0x0D) and LF (0x0A) unblocked, enabling header-splitting / log-injection via crafted values. **Fix**: regex now `/[\x00-\x08\x0a-\x1f]/`.
|
|
14
|
+
|
|
15
|
+
Cleanup: 246 orphaned health snapshots (~1 MB) across 4 bogus `.crew/state/.crew/state/` subtrees were removed.
|
|
16
|
+
|
|
17
|
+
### Correctness audits (Rounds 22–27)
|
|
18
|
+
|
|
19
|
+
- **Round 27 — resource leaks**: (1) orphaned heartbeat timer in the team-runner catch block (`stopTeamHeartbeat()` never called on the error path; non-unref'd 30s interval kept the event loop alive → foreground pi hung); (2) FD leak in background-runner (`fs.openSync` without `closeSync`); (3) pipe FD leak + potential deadlock in async-runner (piped stdout/stderr never drained → >64 KB blocks forever); (4) AbortSignal listener leak in child-pi + live-session-runtime (anonymous `{once:true}` listeners never removed on normal completion).
|
|
20
|
+
- **Round 26 — cross-process file-locking** (5 bugs): TOCTOU split-read in `acquireLockWithRetry` (single-snapshot read closes the window); racy pre-acquisition target cleanup in `withFileLockSync` (removed); crash-between-mkdir-and-pidFile wedge (mtime-based stale check); PID-recycling wedge (mtime checked first for all holders); non-token-guarded release (PID-guarded removal).
|
|
21
|
+
- **Round 25 — security**: deleted two vulnerable dead modules — `sandbox.ts` (CRITICAL VM sandbox escape) and `dynamic-script-runner.ts` (HIGH `skip-validateScript`) — totalling −1701 LOC across 2 source + 5 test files. Plus closed verification-gate newline + `$VARNAME` injection (DANGEROUS_SHELL_PATTERNS extended).
|
|
22
|
+
- **Round 24 — event-log deadlock**: `appendEventInsideLock` (already inside `withEventLogLockSync`) called the public `compactEventLog`/`rotateEventLog` which re-acquired the same non-reentrant mkdir lock → 5 s timeout → compaction never ran → unbounded log growth → events silently dropped past 50 MB. Fix: extracted `prepareCompaction` / `applyCompactionUnlocked` / `rotateEventLogUnlocked` into `event-log-rotation.ts`.
|
|
23
|
+
- **Round 23 — UI correctness**: negative live duration in `agents-pane.ts` (shared `src/ui/live-duration.ts`); Unicode width/truncation bugs in `card-colors.ts`, `tool-renderers/index.ts`, `tool-render.ts`.
|
|
24
|
+
- **Round 22 — reliability**: checkpoint `.tmp.checkpoint` was reused across concurrent saves (cross-process data corruption → now unique per save); chain-parser had no recursion-depth limit (now `MAX_CHAIN_NESTING=100`).
|
|
25
|
+
|
|
26
|
+
### Developer experience (Round 16)
|
|
27
|
+
|
|
28
|
+
- **F1 "Did you mean?"** suggestions on unknown team actions.
|
|
29
|
+
- **F2 recovery hint** on all "Run not found" errors.
|
|
30
|
+
- **F3 compact status mode** (`details=false`) for low-noise polling.
|
|
31
|
+
- **F4 config errors surfaced** on the run path.
|
|
32
|
+
- **F5 pipeline dead-end redirect** — unsupported `action=pipeline` now points at a working workflow.
|
|
33
|
+
- **F6 troubleshooting guide** added at `docs/troubleshooting.md`; usage.md config path fixed.
|
|
34
|
+
|
|
35
|
+
### Observability (Round 17)
|
|
36
|
+
|
|
37
|
+
- **Progress % + ETA** in `status`; run age in the ambient context note.
|
|
38
|
+
- **Per-agent cost** in the dashboard + status output.
|
|
39
|
+
- **Aggregate failure patterns** in the run summary.
|
|
40
|
+
|
|
41
|
+
### Features
|
|
42
|
+
|
|
43
|
+
- **Round 21 (E4): `preStepOptional`** — advisory pre-step hooks that don't fail the run. Opt-in (`preStepOptional: true` on a `WorkflowStep`); fail-fast remains the default.
|
|
44
|
+
- **Round 18 (defense-in-depth)**: capped `suggestAction` input length.
|
|
45
|
+
|
|
46
|
+
### Tests
|
|
47
|
+
|
|
48
|
+
- +60+ tests across Rounds 16–28 (run-watcher-registry: 12, event-log deadlock: 5, injection guards: 6, file/event-log locks: 8, plus UI, DX, observability, and test-isolation coverage). 4955 pass / 0 fail. Test health pass restored the false-confidence security suite.
|
|
49
|
+
|
|
50
|
+
### Documentation
|
|
51
|
+
|
|
52
|
+
- Round 20 documentation-accuracy audit fixed 8 defects across README, CHANGELOG, and `docs/`.
|
|
53
|
+
|
|
54
|
+
## [0.7.5] — Ambient context status + perf hardening + error taxonomy (2026-06-15)
|
|
55
|
+
|
|
56
|
+
Three workstreams from the Round 11 API-gap and Round 15 perf/error audits: a new `context`-event feature, three performance fixes, and a full error-taxonomy expansion.
|
|
57
|
+
|
|
58
|
+
### Features
|
|
59
|
+
|
|
60
|
+
- **Ambient crew-status injection (GAP-2)** — registers Pi's `context` event handler so the parent agent stays continuously aware of in-flight crew runs on every LLM call, without calling the `team` tool. Injects a compact status note (runId/team/status/goal, capped at 3 inline) before the last message. **Transient and safe**: Pi uses the result only for that call (`agent-loop.ts:283-289`) — it never mutates persistent `state.messages`, so there's no accumulation or history corruption. No-op when zero runs are active. Toggle: `reliability.ambientStatusInjection`.
|
|
61
|
+
|
|
62
|
+
### Performance (Round 15 audit)
|
|
63
|
+
|
|
64
|
+
- **P1 (CRITICAL): throttle `persistSingleTaskUpdate` in `onJsonEvent`** — previously every child JSON event did a full locked read-parse-write of `tasks.json`; a 200-event task produced 200 such cycles. Now throttled to 500ms (in-memory progress stays fresh every event; final state force-flushed on completion).
|
|
65
|
+
- **P4: `buildWorkspaceTree` TTL cache (30s)** — workers in a run share a cwd, so the recursive walk was repeated once per task.
|
|
66
|
+
- **P5: `readKnowledge` mtime+size cache** — fired on every agent start (main + every worker), re-reading the same file N×/run.
|
|
67
|
+
|
|
68
|
+
### Error experience (Round 15 audit)
|
|
69
|
+
|
|
70
|
+
- **E1: extended CrewError taxonomy E007–E012** — the taxonomy previously covered only file I/O and discovery. The most common *runtime* failures (child timeout, model exhaustion, pre-step failure, event-log lock timeout, depth limit, stale run) now throw structured `CrewError`s with a machine-readable code, a default actionable help hint, and context. Wired into all six throw sites (`task-runner.ts`, `event-log.ts`, `pipeline-runner.ts`, `stale-reconciler.ts`).
|
|
71
|
+
- **E2: model fallback exhaustion surfaces the full chain tried** ("All N candidates exhausted (tried: a → b → c). Last failure: …") instead of only the last attempt's raw error.
|
|
72
|
+
- **E3: stale-reconcile error explains the heartbeat mechanism + remediation** instead of the bare "Stale run reconciled: <reason>".
|
|
73
|
+
|
|
74
|
+
### Tests
|
|
75
|
+
|
|
76
|
+
- +20 tests (context-status-injection: 11, errors E007–E012: 9). 4800+ pass / 0 fail.
|
|
77
|
+
|
|
78
|
+
### Research
|
|
79
|
+
|
|
80
|
+
This release was driven by the Round 11 Pi-API gap audit and the Round 15 performance/cost + error-experience audit, documented in `research-findings/`.
|
|
81
|
+
|
|
3
82
|
## [0.7.4] — Editor autocomplete + settings shortcut (2026-06-15)
|
|
4
83
|
|
|
5
84
|
Round 13 UX quick wins round-out: the remaining two Pi extension API integrations plus a hard-won CI reliability fix after the state-store test flake re-emerged on Windows and macOS.
|
package/README.md
CHANGED
|
@@ -231,20 +231,19 @@ If preconditions are not met, a friendly error message is returned instead of cr
|
|
|
231
231
|
|
|
232
232
|
| Scope | Path |
|
|
233
233
|
|-------|------|
|
|
234
|
-
| User | `~/.pi/agent/
|
|
235
|
-
|
|
|
236
|
-
| Project (
|
|
234
|
+
| User (primary) | `~/.pi/agent/pi-crew.json` |
|
|
235
|
+
| User (legacy, still read for migration) | `~/.pi/agent/extensions/pi-crew/config.json` |
|
|
236
|
+
| Project (crewRoot) | `.crew/config.json` (or `.pi/teams/config.json` legacy) |
|
|
237
|
+
| Project (alt) | `.pi/pi-crew.json` |
|
|
237
238
|
|
|
238
239
|
### Quick Config
|
|
239
240
|
|
|
240
241
|
```text
|
|
241
242
|
/team-config # view all settings
|
|
242
|
-
/team-config
|
|
243
|
-
/team-config
|
|
244
|
-
/team-config
|
|
245
|
-
/team-
|
|
246
|
-
/team-config --project # project scope
|
|
247
|
-
/team-settings path # show config file path
|
|
243
|
+
/team-config runtime.mode=scaffold # set a key (--project for project scope)
|
|
244
|
+
/team-config --unset=runtime.mode # reset a key to default
|
|
245
|
+
/team-config --project runtime.mode # project-scoped view
|
|
246
|
+
/team-settings path # show config file path
|
|
248
247
|
```
|
|
249
248
|
|
|
250
249
|
### Key Settings
|
|
@@ -262,8 +261,8 @@ If preconditions are not met, a friendly error message is returned instead of cr
|
|
|
262
261
|
| **UI** | `widgetPlacement`, `dashboardPlacement` | compact widget |
|
|
263
262
|
| | `showModel`, `showTokens` | display controls |
|
|
264
263
|
| **Reliability** | `autoRetry`, `autoRecover`, `deadletterThreshold` | opt-in |
|
|
265
|
-
| **Observability** | `
|
|
266
|
-
| **Worktree** | `worktree.
|
|
264
|
+
| **Observability** | `observability.enabled`, `observability.pollIntervalMs`, `otlp.enabled`/`otlp.endpoint` | opt-in |
|
|
265
|
+
| **Worktree** | `worktree.setupHook`, `worktree.linkNodeModules`, `worktree.seedPaths` (mode is set via `workspaceMode: "worktree"` at run time) | disabled by default |
|
|
267
266
|
|
|
268
267
|
> ⚠️ **Trust boundary**: project config cannot override sensitive execution controls (workers, runtime mode, autonomy, agent overrides). Set those in **user config** only.
|
|
269
268
|
|
|
@@ -515,6 +514,7 @@ Stats: **366 source files** (70K lines) · **506 test files** (66K lines) · **4
|
|
|
515
514
|
| [docs/commands-reference.md](docs/commands-reference.md) | Slash commands + `/team-api` |
|
|
516
515
|
| [docs/resource-formats.md](docs/resource-formats.md) | Agent/team/workflow file formats |
|
|
517
516
|
| [docs/usage.md](docs/usage.md) | Usage patterns + config examples |
|
|
517
|
+
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common errors, recovery, and error-code reference (E001–E012) |
|
|
518
518
|
| [docs/architecture.md](docs/architecture.md) | Internal architecture + run flow |
|
|
519
519
|
| [docs/runtime-flow.md](docs/runtime-flow.md) | Runtime execution details |
|
|
520
520
|
| [docs/live-mailbox-runtime.md](docs/live-mailbox-runtime.md) | Mailbox + live-session runtime |
|
|
@@ -188,13 +188,13 @@ Giữ lại 20 runs gần nhất, xóa phần còn lại.
|
|
|
188
188
|
| `runtime.groupJoinAckTimeoutMs` | number | `300000` | Group join ack timeout (ms) |
|
|
189
189
|
| `runtime.requirePlanApproval` | boolean | `false` | Yêu cầu approve plan trước execute |
|
|
190
190
|
| `runtime.completionMutationGuard` | string | `"warn"` | `off`, `warn`, `fail` |
|
|
191
|
-
| `limits.maxConcurrentWorkers` | number |
|
|
192
|
-
| `limits.maxTaskDepth` | number | `
|
|
193
|
-
| `limits.maxChildrenPerTask` | number |
|
|
194
|
-
| `limits.maxRunMinutes` | number | `
|
|
195
|
-
| `limits.maxRetriesPerTask` | number | `
|
|
196
|
-
| `limits.maxTasksPerRun` | number |
|
|
197
|
-
| `limits.heartbeatStaleMs` | number | `
|
|
191
|
+
| `limits.maxConcurrentWorkers` | number | `1024` | Max workers chạy song song |
|
|
192
|
+
| `limits.maxTaskDepth` | number | `100` | Max task tree depth |
|
|
193
|
+
| `limits.maxChildrenPerTask` | number | — | Max children per task |
|
|
194
|
+
| `limits.maxRunMinutes` | number | `1440` | Max run duration (phút) |
|
|
195
|
+
| `limits.maxRetriesPerTask` | number | `100` | Max retries per task |
|
|
196
|
+
| `limits.maxTasksPerRun` | number | `10000` | Max tasks per run |
|
|
197
|
+
| `limits.heartbeatStaleMs` | number | `86400000` | Heartbeat stale threshold |
|
|
198
198
|
| `control.enabled` | boolean | — | Enable agent control-plane |
|
|
199
199
|
| `control.needsAttentionAfterMs` | number | — | Attention timeout |
|
|
200
200
|
| `autonomous.profile` | string | `"suggested"` | `manual`, `suggested`, `assisted`, `aggressive` |
|
|
@@ -205,9 +205,13 @@ Giữ lại 20 runs gần nhất, xóa phần còn lại.
|
|
|
205
205
|
| `tools.enableSteer` | boolean | `true` | Enable steer tool |
|
|
206
206
|
| `tools.terminateOnForeground` | boolean | `false` | Return terminate từ foreground Agent |
|
|
207
207
|
| `agents.disableBuiltins` | boolean | `false` | Disable builtin agents |
|
|
208
|
-
| `observability.
|
|
209
|
-
| `observability.
|
|
210
|
-
| `
|
|
208
|
+
| `observability.enabled` | boolean | `false` | Enable metrics collection |
|
|
209
|
+
| `observability.pollIntervalMs` | number | — | Metrics poll interval |
|
|
210
|
+
| `otlp.enabled` | boolean | `false` | Enable OTLP exporter |
|
|
211
|
+
| `otlp.endpoint` | string | — | OTLP endpoint URL |
|
|
212
|
+
| `worktree.setupHook` | string | — | Worktree setup hook command |
|
|
213
|
+
| `worktree.linkNodeModules` | boolean | — | Symlink node_modules into worktree |
|
|
214
|
+
| `worktree.seedPaths` | array | — | Extra paths to seed into worktree |
|
|
211
215
|
|
|
212
216
|
---
|
|
213
217
|
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# Troubleshooting
|
|
2
|
+
|
|
3
|
+
Common problems and their fixes. If you hit an error code (E001–E012), see the
|
|
4
|
+
[Error codes](#error-codes) table below.
|
|
5
|
+
|
|
6
|
+
## Quick health check
|
|
7
|
+
|
|
8
|
+
```text
|
|
9
|
+
team action='doctor'
|
|
10
|
+
team action='health'
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
`doctor` validates your config, runtime, and worker setup. `health` shows live
|
|
14
|
+
run/process status. Start here.
|
|
15
|
+
|
|
16
|
+
## Runs won't start / workers are "blocked"
|
|
17
|
+
|
|
18
|
+
**Symptom:** `team action='run'` returns a `blocked` status with a message like
|
|
19
|
+
"Child worker execution is disabled".
|
|
20
|
+
|
|
21
|
+
**Cause:** Worker execution is off. pi-crew refuses to create no-op scaffold
|
|
22
|
+
subagents by default until you opt in.
|
|
23
|
+
|
|
24
|
+
**Fix — pick one:**
|
|
25
|
+
- Set in your config (`~/.pi/agent/pi-crew.json`): `"executeWorkers": true`
|
|
26
|
+
- Or set the env var: `PI_CREW_EXECUTE_WORKERS=1`
|
|
27
|
+
- Or pass at run time: `team action='run' config={runtime:{mode:'live-session'}}`
|
|
28
|
+
|
|
29
|
+
The blocked-run message lists the exact config + env vars in play — read it.
|
|
30
|
+
|
|
31
|
+
## "Run not found" / I lost my run ID
|
|
32
|
+
|
|
33
|
+
Run IDs are long (`team_20260615180014_a1b2c3d4e5f60718`). To recover:
|
|
34
|
+
|
|
35
|
+
```text
|
|
36
|
+
team action='list' # recent runs + IDs
|
|
37
|
+
team action='status' # status of in-flight runs in this project
|
|
38
|
+
team action='artifacts' runId=… # if you only have a partial ID
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Every "Run not found" error now appends a `Tip: run action='list'` hint.
|
|
42
|
+
|
|
43
|
+
## "Unknown action: X"
|
|
44
|
+
|
|
45
|
+
You typo'd an action. The error now suggests the closest match
|
|
46
|
+
(`Did you mean 'status'?`). To see all valid actions:
|
|
47
|
+
|
|
48
|
+
```text
|
|
49
|
+
team action='help'
|
|
50
|
+
team action='list' resource='workflow'
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Worktree runs fail ("not a git repo" / "tree is dirty")
|
|
54
|
+
|
|
55
|
+
Worktree mode (`workspaceMode: 'worktree'`) requires:
|
|
56
|
+
1. The target directory is a **git repository** (`git rev-parse` succeeds).
|
|
57
|
+
2. The working tree is **clean** (no uncommitted changes) unless you pass
|
|
58
|
+
`force: true`.
|
|
59
|
+
|
|
60
|
+
If you don't need isolation, use single mode instead:
|
|
61
|
+
`team action='run' workspaceMode='single'`.
|
|
62
|
+
|
|
63
|
+
## Stale async process / run stuck in "running"
|
|
64
|
+
|
|
65
|
+
A background run whose process died (crash, Ctrl+C, reboot) can appear stuck
|
|
66
|
+
in `running`. The stale-reconciler eventually marks it `failed`, but you can
|
|
67
|
+
force recovery:
|
|
68
|
+
|
|
69
|
+
```text
|
|
70
|
+
team action='status' runId=… # check the async liveness line
|
|
71
|
+
team action='cleanup' runId=… # repair stuck state
|
|
72
|
+
team action='cancel' runId=… # cancel a truly-dead run
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
The error message explains the heartbeat mechanism + remediation.
|
|
76
|
+
|
|
77
|
+
## Model fallback exhausted
|
|
78
|
+
|
|
79
|
+
**Symptom:** `All N candidates exhausted (tried: a → b → c)`.
|
|
80
|
+
|
|
81
|
+
**Cause:** Every model in your fallback chain failed (rate limit, auth, quota).
|
|
82
|
+
|
|
83
|
+
**Fix:** Check your provider config / API keys. The error now lists the full
|
|
84
|
+
chain tried and the last failure reason.
|
|
85
|
+
|
|
86
|
+
## Config is malformed / ignored
|
|
87
|
+
|
|
88
|
+
If your `pi-crew.json` has a syntax or type error, `team action='run'` emits a
|
|
89
|
+
`config.warning` event (visible via `team action='events'`) and proceeds with
|
|
90
|
+
defaults — it does **not** hard-fail. To validate explicitly:
|
|
91
|
+
|
|
92
|
+
```text
|
|
93
|
+
team action='config' # show loaded config + any warnings
|
|
94
|
+
team action='doctor' # full validation
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Compact vs full status
|
|
98
|
+
|
|
99
|
+
`team action='status'` defaults to full output (~40 lines). For a quick check:
|
|
100
|
+
|
|
101
|
+
```text
|
|
102
|
+
team action='status' details=false # compact: status, progress, goal, issues only
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Error codes
|
|
106
|
+
|
|
107
|
+
pi-crew uses a structured error taxonomy (E001–E012). Each error renders its
|
|
108
|
+
code + a help hint inline. Common ones:
|
|
109
|
+
|
|
110
|
+
| Code | Name | Meaning | First check |
|
|
111
|
+
|------|------|---------|-------------|
|
|
112
|
+
| E001 | FileReadError | a required file couldn't be read | check the file exists + read perms; may need `cleanup` |
|
|
113
|
+
| E002 | FileWriteError | an atomic write failed | check disk space + dir write perms |
|
|
114
|
+
| E003 | TaskNotFound | a referenced task id doesn't exist | `team status` to verify the run's tasks |
|
|
115
|
+
| E004 | InvalidStatusTransition | illegal run/task status change | verify status via `team status` before retrying |
|
|
116
|
+
| E005 | ConfigError | config has a syntax/type error | `team config` shows the offending field |
|
|
117
|
+
| E006 | ResourceNotFound | agent/team/workflow not found | `team list` to see available resources |
|
|
118
|
+
| E007 | ChildTimeout | a worker Pi didn't finish in time | raise `runtime.responseTimeoutMs` or simplify the task |
|
|
119
|
+
| E008 | ModelExhausted | all fallback models failed | see "Model fallback exhausted" above |
|
|
120
|
+
| E009 | PreStepFailed | a workflow pre-step hook failed | check the hook stderr in events; set `preStepOptional: true` on the step to make it advisory (non-fatal) |
|
|
121
|
+
| E010 | EventLogLockTimeout | event log locked under contention | transient; retry, or lower concurrency |
|
|
122
|
+
| E011 | DepthLimitExceeded | crew nesting too deep | raise `crew.maxDepth` or flatten the call |
|
|
123
|
+
| E012 | RunStale | run reconciled as stale | see "Stale async process" above |
|
|
124
|
+
|
|
125
|
+
## Still stuck
|
|
126
|
+
|
|
127
|
+
- `team action='explain' runId=…` — structured per-task analysis (why, files,
|
|
128
|
+
complexity).
|
|
129
|
+
- `team action='summary' runId=…` — includes common failure-pattern detection
|
|
130
|
+
("4 of 5 failures share 2 root causes").
|
|
131
|
+
- `team action='events' runId=…` — full event timeline for forensics.
|
package/docs/usage.md
CHANGED
|
@@ -5,9 +5,14 @@
|
|
|
5
5
|
Optional config path:
|
|
6
6
|
|
|
7
7
|
```text
|
|
8
|
-
~/.pi/agent/
|
|
8
|
+
~/.pi/agent/pi-crew.json
|
|
9
9
|
```
|
|
10
10
|
|
|
11
|
+
A **legacy** path `~/.pi/agent/extensions/pi-crew/config.json` is also read for
|
|
12
|
+
backward-compatibility migration — values there are merged but the new path
|
|
13
|
+
above is preferred. A project-local config may also live at `.pi/pi-crew.json`
|
|
14
|
+
in your repo root (project values are merged under the user config).
|
|
15
|
+
|
|
11
16
|
Create a default config:
|
|
12
17
|
|
|
13
18
|
```bash
|
|
@@ -58,9 +63,9 @@ Supported fields:
|
|
|
58
63
|
"deadletterThreshold": 3,
|
|
59
64
|
"retryPolicy": {
|
|
60
65
|
"maxAttempts": 3,
|
|
61
|
-
"
|
|
62
|
-
"
|
|
63
|
-
"
|
|
66
|
+
"backoffMs": 1000,
|
|
67
|
+
"jitterRatio": 0.3,
|
|
68
|
+
"exponentialFactor": 2
|
|
64
69
|
}
|
|
65
70
|
}
|
|
66
71
|
}
|
package/package.json
CHANGED
package/src/config/config.ts
CHANGED
|
@@ -494,8 +494,14 @@ function mergeConfig(
|
|
|
494
494
|
delete merged.otlp.headers;
|
|
495
495
|
// Validate OTLP headers for injection attacks:
|
|
496
496
|
// - Check top-level keys for dangerous prototype pollution patterns
|
|
497
|
-
// - Block
|
|
498
|
-
//
|
|
497
|
+
// - Block ALL control characters except tab (0x09) to prevent header
|
|
498
|
+
// injection via CR/LF/zero-byte/etc.
|
|
499
|
+
// BUG (Round 28, CRLF injection): the previous range
|
|
500
|
+
// /[\x00-\x08\x0b\x0c\x0e-\x1f]/ left THREE chars unblocked: tab (0x09,
|
|
501
|
+
// intentionally allowed), LF (0x0A) AND CR (0x0D). The comment claimed to
|
|
502
|
+
// "prevent header injection via CR/LF" but CR was never matched, and LF
|
|
503
|
+
// was explicitly allowed — both are CRLF injection vectors that can split
|
|
504
|
+
// HTTP headers. Fix: block 0x00-0x08 and 0x0A-0x1F, allowing only tab.
|
|
499
505
|
const invalidHeaders: string[] = [];
|
|
500
506
|
for (const [k, v] of Object.entries(merged.otlp.headers ?? {})) {
|
|
501
507
|
// Check top-level key for dangerous names (only top-level keys are checked)
|
|
@@ -505,9 +511,10 @@ function mergeConfig(
|
|
|
505
511
|
return false;
|
|
506
512
|
};
|
|
507
513
|
if (checkKey(k)) { invalidHeaders.push(k); continue; }
|
|
508
|
-
// Block any control characters except tab (0x09) in values
|
|
514
|
+
// Block any control characters except tab (0x09) in values.
|
|
515
|
+
// Round 28 fix: /[\x00-\x08\x0a-\x1f]/ blocks LF (0x0A) and CR (0x0D) too.
|
|
509
516
|
const valStr = String(v);
|
|
510
|
-
if (/[\x00-\x08\
|
|
517
|
+
if (/[\x00-\x08\x0a-\x1f]/.test(valStr)) { invalidHeaders.push(k); }
|
|
511
518
|
}
|
|
512
519
|
if (invalidHeaders.length > 0) {
|
|
513
520
|
delete merged.otlp.headers;
|
package/src/config/types.ts
CHANGED
|
@@ -178,6 +178,8 @@ export interface CrewReliabilityConfig {
|
|
|
178
178
|
autoRepairIntervalMs?: number;
|
|
179
179
|
/** Remove /tmp/pi-crew-* directories after their orphaned runs are reconciled. Default: true. */
|
|
180
180
|
cleanupOrphanedTempDirs?: boolean;
|
|
181
|
+
/** Inject a compact ambient crew-status note into the agent's context on every LLM call while crew runs are in-flight, so the agent stays continuously aware of active runs without calling the `team` tool. No-op when no runs are active. Default: true. */
|
|
182
|
+
ambientStatusInjection?: boolean;
|
|
181
183
|
}
|
|
182
184
|
|
|
183
185
|
export interface CrewOtlpConfig {
|
package/src/errors.ts
CHANGED
|
@@ -30,6 +30,14 @@ export const ErrorCode = {
|
|
|
30
30
|
InvalidStatusTransition: "E004", // Run/task status cannot legally transition
|
|
31
31
|
ConfigError: "E005", // Malformed config or missing required field
|
|
32
32
|
ResourceNotFound: "E006", // Agent/team/workflow not found in discovery paths
|
|
33
|
+
// E1 (Round 15): runtime failure categories that previously threw raw Error
|
|
34
|
+
// with no code, no help hint, and no context. Surfaces actionable guidance.
|
|
35
|
+
ChildTimeout: "E007", // Child Pi worker became unresponsive and was killed
|
|
36
|
+
ModelExhausted: "E008", // All model candidates in the fallback chain failed
|
|
37
|
+
PreStepFailed: "E009", // A pre-step hook script returned a non-zero exit
|
|
38
|
+
EventLogLockTimeout: "E010", // Could not acquire the event-log file lock
|
|
39
|
+
DepthLimitExceeded: "E011", // Pipeline/chain recursion depth limit hit (circular dep)
|
|
40
|
+
RunStale: "E012", // Run reconciled as stale/zombie (heartbeat expired)
|
|
33
41
|
} as const;
|
|
34
42
|
|
|
35
43
|
export type ErrorCode = typeof ErrorCode[keyof typeof ErrorCode];
|
|
@@ -41,6 +49,13 @@ const DEFAULT_HELP: Record<ErrorCode, string | undefined> = {
|
|
|
41
49
|
[ErrorCode.InvalidStatusTransition]: "Verify the run status using `team status` before retrying.",
|
|
42
50
|
[ErrorCode.ConfigError]: "Check the configuration file for syntax errors or missing required fields.",
|
|
43
51
|
[ErrorCode.ResourceNotFound]: "Use `team list` to see available agents, teams, and workflows.",
|
|
52
|
+
// E1 (Round 15): help hints for the new runtime categories.
|
|
53
|
+
[ErrorCode.ChildTimeout]: "The child Pi worker produced no output for too long and was terminated. Re-run the team; if it recurs, raise the response timeout in config or reduce the task scope.",
|
|
54
|
+
[ErrorCode.ModelExhausted]: "Every model in the fallback chain failed. Check your API key/quota and the per-attempt errors, then retry or swap the model in config.",
|
|
55
|
+
[ErrorCode.PreStepFailed]: "The pre-step hook script exited non-zero. Inspect its stderr, or mark it optional in the workflow step (preStepOptional).",
|
|
56
|
+
[ErrorCode.EventLogLockTimeout]: "Another process holds the event-log lock. Check for orphaned `.lock` files or stale pi-crew processes, then retry.",
|
|
57
|
+
[ErrorCode.DepthLimitExceeded]: "A pipeline/chain exceeded the recursion depth limit, which usually indicates a circular stage dependency. Review step `dependsOn` chains.",
|
|
58
|
+
[ErrorCode.RunStale]: "The worker stopped heartbeating and was treated as a zombie. Re-run the team (resume or fresh); if it recurs, check `runtime.executeWorkers` / system load.",
|
|
44
59
|
};
|
|
45
60
|
|
|
46
61
|
/**
|
|
@@ -122,4 +137,55 @@ export const errors = {
|
|
|
122
137
|
`${type} '${name}' not found in any discovery path`,
|
|
123
138
|
);
|
|
124
139
|
},
|
|
140
|
+
|
|
141
|
+
// E1 (Round 15): runtime failure constructors. These wrap the raw-throw
|
|
142
|
+
// sites identified in the Round 15 error-experience audit so failures carry
|
|
143
|
+
// a machine-readable code, a help hint, and structured context.
|
|
144
|
+
childTimeout(detail: { timeoutMs?: number; taskId?: string; stderr?: string }): CrewError {
|
|
145
|
+
const tail = detail.stderr ? ` Stderr tail: ${detail.stderr.slice(-400)}` : "";
|
|
146
|
+
const dur = detail.timeoutMs ? ` after ${detail.timeoutMs}ms of no output` : "";
|
|
147
|
+
return new CrewError(
|
|
148
|
+
ErrorCode.ChildTimeout,
|
|
149
|
+
`Child Pi worker became unresponsive${dur} and was terminated.${tail}`,
|
|
150
|
+
).withContext(`worker execution${detail.taskId ? ` (task ${detail.taskId})` : ""}`);
|
|
151
|
+
},
|
|
152
|
+
|
|
153
|
+
modelExhausted(chain: string[], lastFailure?: string): CrewError {
|
|
154
|
+
const tried = chain.join(" → ");
|
|
155
|
+
const last = lastFailure ? ` Last failure: ${lastFailure}` : "";
|
|
156
|
+
return new CrewError(
|
|
157
|
+
ErrorCode.ModelExhausted,
|
|
158
|
+
`All ${chain.length} model candidates exhausted (tried: ${tried}).${last}`,
|
|
159
|
+
).withContext("model fallback chain");
|
|
160
|
+
},
|
|
161
|
+
|
|
162
|
+
preStepFailed(script: string, exitCode: number | undefined, stderr?: string): CrewError {
|
|
163
|
+
const tail = stderr ? ` Stderr: ${stderr.slice(-400)}` : "";
|
|
164
|
+
return new CrewError(
|
|
165
|
+
ErrorCode.PreStepFailed,
|
|
166
|
+
`preStepScript '${script}' exited ${exitCode ?? "non-zero"}.${tail}`,
|
|
167
|
+
).withContext("pre-step hook execution");
|
|
168
|
+
},
|
|
169
|
+
|
|
170
|
+
eventLogLockTimeout(eventsPath: string, timeoutMs: number): CrewError {
|
|
171
|
+
return new CrewError(
|
|
172
|
+
ErrorCode.EventLogLockTimeout,
|
|
173
|
+
`Event log lock timeout for ${eventsPath}: could not acquire lock within ${timeoutMs}ms`,
|
|
174
|
+
).withContext("event-log append");
|
|
175
|
+
},
|
|
176
|
+
|
|
177
|
+
depthLimitExceeded(depth: number, kind = "pipeline"): CrewError {
|
|
178
|
+
return new CrewError(
|
|
179
|
+
ErrorCode.DepthLimitExceeded,
|
|
180
|
+
`${kind[0].toUpperCase() + kind.slice(1)} recursion depth limit exceeded (${depth}). Possible circular dependency.`,
|
|
181
|
+
).withContext(`${kind} execution`);
|
|
182
|
+
},
|
|
183
|
+
|
|
184
|
+
runStale(reason: string, heartbeatAgeSeconds?: number): CrewError {
|
|
185
|
+
const age = heartbeatAgeSeconds !== undefined ? ` Last heartbeat was ${heartbeatAgeSeconds}s ago.` : "";
|
|
186
|
+
return new CrewError(
|
|
187
|
+
ErrorCode.RunStale,
|
|
188
|
+
`Stale run reconciled (reason=${reason}).${age} The worker stopped heartbeating and was treated as dead/zombie.`,
|
|
189
|
+
).withContext("stale-run reconciliation");
|
|
190
|
+
},
|
|
125
191
|
} as const;
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* action-suggestions.ts — "Did you mean?" suggestions for team actions (DX: F1).
|
|
3
|
+
*
|
|
4
|
+
* Round 16 DX audit found that a typo'd action (`action: 'stat'`,
|
|
5
|
+
* `action: 'summery'`) hits a dead-end "Unknown action: stat" with no path
|
|
6
|
+
* forward. pi-crew already ships a Levenshtein fuzzy-matcher
|
|
7
|
+
* (`src/config/suggestions.ts → suggestConfigKey`); this module applies it to
|
|
8
|
+
* the known set of team actions.
|
|
9
|
+
*
|
|
10
|
+
* The known-action list mirrors the `action` enum in
|
|
11
|
+
* `src/schema/team-tool-schema.ts`. Kept as a hand-maintained constant (not
|
|
12
|
+
* derived from the TypeBox schema at runtime) so it is trivially testable and
|
|
13
|
+
* avoids pulling the schema into low-level error paths.
|
|
14
|
+
*/
|
|
15
|
+
|
|
16
|
+
import { findClosestKey } from "../config/suggestions.ts";
|
|
17
|
+
|
|
18
|
+
/**
|
|
19
|
+
* The complete set of valid top-level `team` actions (mirrors the action enum
|
|
20
|
+
* in `src/schema/team-tool-schema.ts`). Exported so callers and tests can use
|
|
21
|
+
* the single source of truth.
|
|
22
|
+
*/
|
|
23
|
+
export const KNOWN_TEAM_ACTIONS = [
|
|
24
|
+
"run", "parallel", "plan", "status", "wait", "list", "get",
|
|
25
|
+
"cancel", "retry", "resume", "respond", "create", "update", "delete",
|
|
26
|
+
"doctor", "cleanup", "events", "artifacts", "worktrees", "forget",
|
|
27
|
+
"summary", "prune", "export", "import", "imports", "help", "validate",
|
|
28
|
+
"config", "init", "recommend", "autonomy", "api", "settings", "steer",
|
|
29
|
+
"invalidate", "health", "graph", "onboard", "explain", "cache",
|
|
30
|
+
"checkpoint", "search", "orchestrate", "schedule", "scheduled", "anchor",
|
|
31
|
+
"auto-summarize", "auto_boomerang",
|
|
32
|
+
] as const;
|
|
33
|
+
|
|
34
|
+
/**
|
|
35
|
+
* Suggest the closest known team action for a (likely typo'd) input.
|
|
36
|
+
* Returns `null` when no action is close enough — callers should then omit
|
|
37
|
+
* the "Did you mean …?" hint rather than suggesting a poor match.
|
|
38
|
+
*
|
|
39
|
+
* Uses a tighter edit-distance budget than the generic config-key suggester
|
|
40
|
+
* (2 instead of 3): team actions are short command words, so distance-3
|
|
41
|
+
* matches against a short input (e.g. "" → "run") produce low-quality hints.
|
|
42
|
+
* Empty/whitespace input always returns null.
|
|
43
|
+
*
|
|
44
|
+
* Exported for unit testing.
|
|
45
|
+
*/
|
|
46
|
+
export function suggestAction(input: string): string | null {
|
|
47
|
+
const trimmed = input.trim();
|
|
48
|
+
if (!trimmed) return null;
|
|
49
|
+
// Defense-in-depth (Round 18 security F1): levenshtein is O(n×m). A hostile
|
|
50
|
+
// very-long input would waste cycles. The action is enum-validated upstream
|
|
51
|
+
// so this is unreachable in practice, but cap input length cheaply.
|
|
52
|
+
if (trimmed.length > 64) return null;
|
|
53
|
+
return findClosestKey(trimmed, KNOWN_TEAM_ACTIONS, 2);
|
|
54
|
+
}
|
|
55
|
+
|
|
56
|
+
/**
|
|
57
|
+
* Build a "Did you mean?" suffix for an unknown-action error message.
|
|
58
|
+
* Returns "" when there is no good suggestion (so the caller can just append
|
|
59
|
+
* it unconditionally). Keeps error formatting centralized.
|
|
60
|
+
*
|
|
61
|
+
* Exported for unit testing + use in the dispatch default-case.
|
|
62
|
+
*
|
|
63
|
+
* Example:
|
|
64
|
+
* formatActionSuggestion("stat") // "\n\nDid you mean 'status'? Use action='status'."
|
|
65
|
+
* formatActionSuggestion("xyzzy") // ""
|
|
66
|
+
*/
|
|
67
|
+
export function formatActionSuggestion(input: string): string {
|
|
68
|
+
const suggestion = suggestAction(input);
|
|
69
|
+
if (!suggestion || suggestion === input) return "";
|
|
70
|
+
return `\n\nDid you mean '${suggestion}'? Use action='${suggestion}'. Run action='help' to see all actions.`;
|
|
71
|
+
}
|