pi-crew 0.7.5 → 0.7.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/CHANGELOG.md +51 -0
  2. package/README.md +11 -11
  3. package/docs/commands-reference.md +14 -10
  4. package/docs/troubleshooting.md +131 -0
  5. package/docs/usage.md +9 -4
  6. package/package.json +1 -1
  7. package/src/config/config.ts +11 -4
  8. package/src/extension/action-suggestions.ts +71 -0
  9. package/src/extension/context-status-injection.ts +32 -1
  10. package/src/extension/register.ts +71 -65
  11. package/src/extension/team-tool/api.ts +3 -2
  12. package/src/extension/team-tool/cancel.ts +5 -4
  13. package/src/extension/team-tool/explain.ts +2 -1
  14. package/src/extension/team-tool/failure-patterns.ts +124 -0
  15. package/src/extension/team-tool/inspect.ts +10 -6
  16. package/src/extension/team-tool/lifecycle-actions.ts +5 -4
  17. package/src/extension/team-tool/respond.ts +4 -3
  18. package/src/extension/team-tool/run-not-found.ts +54 -0
  19. package/src/extension/team-tool/run.ts +26 -4
  20. package/src/extension/team-tool/status.ts +58 -4
  21. package/src/extension/team-tool.ts +5 -3
  22. package/src/runtime/async-runner.ts +7 -0
  23. package/src/runtime/background-runner.ts +7 -1
  24. package/src/runtime/chain-parser.ts +13 -5
  25. package/src/runtime/checkpoint.ts +13 -1
  26. package/src/runtime/child-pi.ts +9 -1
  27. package/src/runtime/live-session-runtime.ts +15 -1
  28. package/src/runtime/parent-guard.ts +2 -2
  29. package/src/runtime/stale-reconciler.ts +8 -3
  30. package/src/runtime/task-runner.ts +10 -1
  31. package/src/runtime/team-runner.ts +19 -2
  32. package/src/runtime/verification-gates.ts +21 -1
  33. package/src/schema/team-tool-schema.ts +9 -0
  34. package/src/state/blob-store.ts +12 -10
  35. package/src/state/event-log-rotation.ts +114 -93
  36. package/src/state/event-log.ts +79 -20
  37. package/src/state/health-store.ts +6 -1
  38. package/src/state/locks.ts +66 -16
  39. package/src/state/state-store.ts +14 -1
  40. package/src/ui/card-colors.ts +7 -3
  41. package/src/ui/dashboard-panes/agents-pane.ts +15 -2
  42. package/src/ui/live-duration.ts +58 -0
  43. package/src/ui/tool-render.ts +7 -11
  44. package/src/ui/tool-renderers/index.ts +6 -3
  45. package/src/ui/widget/widget-formatters.ts +2 -13
  46. package/src/utils/fs-watch.ts +11 -60
  47. package/src/utils/run-watcher-registry.ts +164 -0
  48. package/src/workflows/discover-workflows.ts +2 -1
  49. package/src/workflows/workflow-config.ts +5 -0
  50. package/src/runtime/dynamic-script-runner.ts +0 -497
  51. package/src/runtime/sandbox.ts +0 -335
package/CHANGELOG.md CHANGED
@@ -1,5 +1,56 @@
1
1
  # Changelog
2
2
 
3
+ ## [0.7.6] — DX, observability, and a critical interactive-session hang fix (2026-06-16)
4
+
5
+ This release bundles Rounds 16–28: a developer-experience pass, an observability pass, and eight correctness/security audits — culminating in the **fix for the pts/2 interactive-session busy-loop hang** (two separate Pi sessions had hung at 71.5% CPU with 339 inotify watches). All 24 commits passed CI on Windows, Ubuntu, and macOS.
6
+
7
+ ### 🚨 Critical — interactive-session hang (Round 28 + pts/2 investigation)
8
+
9
+ Report: `/home/bom/pts2-hang-investigation-2026-06-16.md`. Three root causes, all fixed:
10
+
11
+ - **BUG C (CRITICAL): recursive watcher busy-loop** — `watchCrewState` used `fs.watch(<crewRoot>/state, {recursive:true})`. On Linux, Node implements "recursive" as ONE inotify watch PER SUBDIRECTORY, so with many historical runs under `.crew/state/runs/` this ballooned to hundreds of watches (109→339 observed) and caused a permanent busy-loop even with no active work. **Fix**: new `src/utils/run-watcher-registry.ts` (`RunWatcherRegistry`) — one non-recursive watcher on the `runs/` root (for new-run detection, since `crew.run.created` is never emitted) + one non-recursive watcher per **active** run, reconciled each preload tick against `running`/`queued`/`planning` status. Total inotify cost is now O(active runs) — typically 1–5 — not O(total history). Completed runs leave the active set and their watcher closes within one tick. The dead `createRecursiveWatcher` / `watchCrewState` / `runIdFromStateRelativePath` primitives were deleted from `fs-watch.ts`.
12
+ - **BUG A (MEDIUM): health double-join path** — `HEALTH_DIR = ".crew/state/health"` was joined with a `crewRoot` computed only 2 `dirname`s up, writing to `.crew/state/.crew/state/health` — a path **no code ever reads**. It produced a growing ghost subtree that the recursive watcher then walked. **Fix**: `crewRoot` = 3 `dirname`s up; `HEALTH_DIR` = `"state/health"`.
13
+ - **BUG B (MEDIUM): OTLP CRLF injection** — header-value validation left CR (0x0D) and LF (0x0A) unblocked, enabling header-splitting / log-injection via crafted values. **Fix**: regex now `/[\x00-\x08\x0a-\x1f]/`.
14
+
15
+ Cleanup: 246 orphaned health snapshots (~1 MB) across 4 bogus `.crew/state/.crew/state/` subtrees were removed.
16
+
17
+ ### Correctness audits (Rounds 22–27)
18
+
19
+ - **Round 27 — resource leaks**: (1) orphaned heartbeat timer in the team-runner catch block (`stopTeamHeartbeat()` never called on the error path; non-unref'd 30s interval kept the event loop alive → foreground pi hung); (2) FD leak in background-runner (`fs.openSync` without `closeSync`); (3) pipe FD leak + potential deadlock in async-runner (piped stdout/stderr never drained → >64 KB blocks forever); (4) AbortSignal listener leak in child-pi + live-session-runtime (anonymous `{once:true}` listeners never removed on normal completion).
20
+ - **Round 26 — cross-process file-locking** (5 bugs): TOCTOU split-read in `acquireLockWithRetry` (single-snapshot read closes the window); racy pre-acquisition target cleanup in `withFileLockSync` (removed); crash-between-mkdir-and-pidFile wedge (mtime-based stale check); PID-recycling wedge (mtime checked first for all holders); non-token-guarded release (PID-guarded removal).
21
+ - **Round 25 — security**: deleted two vulnerable dead modules — `sandbox.ts` (CRITICAL VM sandbox escape) and `dynamic-script-runner.ts` (HIGH `skip-validateScript`) — totalling −1701 LOC across 2 source + 5 test files. Plus closed verification-gate newline + `$VARNAME` injection (DANGEROUS_SHELL_PATTERNS extended).
22
+ - **Round 24 — event-log deadlock**: `appendEventInsideLock` (already inside `withEventLogLockSync`) called the public `compactEventLog`/`rotateEventLog` which re-acquired the same non-reentrant mkdir lock → 5 s timeout → compaction never ran → unbounded log growth → events silently dropped past 50 MB. Fix: extracted `prepareCompaction` / `applyCompactionUnlocked` / `rotateEventLogUnlocked` into `event-log-rotation.ts`.
23
+ - **Round 23 — UI correctness**: negative live duration in `agents-pane.ts` (shared `src/ui/live-duration.ts`); Unicode width/truncation bugs in `card-colors.ts`, `tool-renderers/index.ts`, `tool-render.ts`.
24
+ - **Round 22 — reliability**: checkpoint `.tmp.checkpoint` was reused across concurrent saves (cross-process data corruption → now unique per save); chain-parser had no recursion-depth limit (now `MAX_CHAIN_NESTING=100`).
25
+
26
+ ### Developer experience (Round 16)
27
+
28
+ - **F1 "Did you mean?"** suggestions on unknown team actions.
29
+ - **F2 recovery hint** on all "Run not found" errors.
30
+ - **F3 compact status mode** (`details=false`) for low-noise polling.
31
+ - **F4 config errors surfaced** on the run path.
32
+ - **F5 pipeline dead-end redirect** — unsupported `action=pipeline` now points at a working workflow.
33
+ - **F6 troubleshooting guide** added at `docs/troubleshooting.md`; usage.md config path fixed.
34
+
35
+ ### Observability (Round 17)
36
+
37
+ - **Progress % + ETA** in `status`; run age in the ambient context note.
38
+ - **Per-agent cost** in the dashboard + status output.
39
+ - **Aggregate failure patterns** in the run summary.
40
+
41
+ ### Features
42
+
43
+ - **Round 21 (E4): `preStepOptional`** — advisory pre-step hooks that don't fail the run. Opt-in (`preStepOptional: true` on a `WorkflowStep`); fail-fast remains the default.
44
+ - **Round 18 (defense-in-depth)**: capped `suggestAction` input length.
45
+
46
+ ### Tests
47
+
48
+ - +60+ tests across Rounds 16–28 (run-watcher-registry: 12, event-log deadlock: 5, injection guards: 6, file/event-log locks: 8, plus UI, DX, observability, and test-isolation coverage). 4955 pass / 0 fail. Test health pass restored the false-confidence security suite.
49
+
50
+ ### Documentation
51
+
52
+ - Round 20 documentation-accuracy audit fixed 8 defects across README, CHANGELOG, and `docs/`.
53
+
3
54
  ## [0.7.5] — Ambient context status + perf hardening + error taxonomy (2026-06-15)
4
55
 
5
56
  Three workstreams from the Round 11 API-gap and Round 15 perf/error audits: a new `context`-event feature, three performance fixes, and a full error-taxonomy expansion.
package/README.md CHANGED
@@ -231,20 +231,19 @@ If preconditions are not met, a friendly error message is returned instead of cr
231
231
 
232
232
  | Scope | Path |
233
233
  |-------|------|
234
- | User | `~/.pi/agent/extensions/pi-crew/config.json` |
235
- | Project (new) | `.crew/config.json` |
236
- | Project (legacy) | `.pi/teams/config.json` |
234
+ | User (primary) | `~/.pi/agent/pi-crew.json` |
235
+ | User (legacy, still read for migration) | `~/.pi/agent/extensions/pi-crew/config.json` |
236
+ | Project (crewRoot) | `.crew/config.json` (or `.pi/teams/config.json` legacy) |
237
+ | Project (alt) | `.pi/pi-crew.json` |
237
238
 
238
239
  ### Quick Config
239
240
 
240
241
  ```text
241
242
  /team-config # view all settings
242
- /team-config get runtime.mode # read one key
243
- /team-config set runtime.mode=scaffold # scaffold mode
244
- /team-config set asyncByDefault=true # async by default
245
- /team-config unset runtime.mode # reset to default
246
- /team-config --project # project scope
247
- /team-settings path # show config file path
243
+ /team-config runtime.mode=scaffold # set a key (--project for project scope)
244
+ /team-config --unset=runtime.mode # reset a key to default
245
+ /team-config --project runtime.mode # project-scoped view
246
+ /team-settings path # show config file path
248
247
  ```
249
248
 
250
249
  ### Key Settings
@@ -262,8 +261,8 @@ If preconditions are not met, a friendly error message is returned instead of cr
262
261
  | **UI** | `widgetPlacement`, `dashboardPlacement` | compact widget |
263
262
  | | `showModel`, `showTokens` | display controls |
264
263
  | **Reliability** | `autoRetry`, `autoRecover`, `deadletterThreshold` | opt-in |
265
- | **Observability** | `prometheus.enabled`, `otlp.enabled`, `heartbeatStaleMs` | opt-in |
266
- | **Worktree** | `worktree.enabled` | disabled by default |
264
+ | **Observability** | `observability.enabled`, `observability.pollIntervalMs`, `otlp.enabled`/`otlp.endpoint` | opt-in |
265
+ | **Worktree** | `worktree.setupHook`, `worktree.linkNodeModules`, `worktree.seedPaths` (mode is set via `workspaceMode: "worktree"` at run time) | disabled by default |
267
266
 
268
267
  > ⚠️ **Trust boundary**: project config cannot override sensitive execution controls (workers, runtime mode, autonomy, agent overrides). Set those in **user config** only.
269
268
 
@@ -515,6 +514,7 @@ Stats: **366 source files** (70K lines) · **506 test files** (66K lines) · **4
515
514
  | [docs/commands-reference.md](docs/commands-reference.md) | Slash commands + `/team-api` |
516
515
  | [docs/resource-formats.md](docs/resource-formats.md) | Agent/team/workflow file formats |
517
516
  | [docs/usage.md](docs/usage.md) | Usage patterns + config examples |
517
+ | [docs/troubleshooting.md](docs/troubleshooting.md) | Common errors, recovery, and error-code reference (E001–E012) |
518
518
  | [docs/architecture.md](docs/architecture.md) | Internal architecture + run flow |
519
519
  | [docs/runtime-flow.md](docs/runtime-flow.md) | Runtime execution details |
520
520
  | [docs/live-mailbox-runtime.md](docs/live-mailbox-runtime.md) | Mailbox + live-session runtime |
@@ -188,13 +188,13 @@ Giữ lại 20 runs gần nhất, xóa phần còn lại.
188
188
  | `runtime.groupJoinAckTimeoutMs` | number | `300000` | Group join ack timeout (ms) |
189
189
  | `runtime.requirePlanApproval` | boolean | `false` | Yêu cầu approve plan trước execute |
190
190
  | `runtime.completionMutationGuard` | string | `"warn"` | `off`, `warn`, `fail` |
191
- | `limits.maxConcurrentWorkers` | number | | Max workers chạy song song |
192
- | `limits.maxTaskDepth` | number | `2` | Max task tree depth |
193
- | `limits.maxChildrenPerTask` | number | `5` | Max children per task |
194
- | `limits.maxRunMinutes` | number | `60` | Max run duration (phút) |
195
- | `limits.maxRetriesPerTask` | number | `1` | Max retries per task |
196
- | `limits.maxTasksPerRun` | number | | Max tasks per run |
197
- | `limits.heartbeatStaleMs` | number | `60000` | Heartbeat stale threshold |
191
+ | `limits.maxConcurrentWorkers` | number | `1024` | Max workers chạy song song |
192
+ | `limits.maxTaskDepth` | number | `100` | Max task tree depth |
193
+ | `limits.maxChildrenPerTask` | number | | Max children per task |
194
+ | `limits.maxRunMinutes` | number | `1440` | Max run duration (phút) |
195
+ | `limits.maxRetriesPerTask` | number | `100` | Max retries per task |
196
+ | `limits.maxTasksPerRun` | number | `10000` | Max tasks per run |
197
+ | `limits.heartbeatStaleMs` | number | `86400000` | Heartbeat stale threshold |
198
198
  | `control.enabled` | boolean | — | Enable agent control-plane |
199
199
  | `control.needsAttentionAfterMs` | number | — | Attention timeout |
200
200
  | `autonomous.profile` | string | `"suggested"` | `manual`, `suggested`, `assisted`, `aggressive` |
@@ -205,9 +205,13 @@ Giữ lại 20 runs gần nhất, xóa phần còn lại.
205
205
  | `tools.enableSteer` | boolean | `true` | Enable steer tool |
206
206
  | `tools.terminateOnForeground` | boolean | `false` | Return terminate từ foreground Agent |
207
207
  | `agents.disableBuiltins` | boolean | `false` | Disable builtin agents |
208
- | `observability.prometheus.enabled` | boolean | `false` | Enable Prometheus exporter |
209
- | `observability.otlp.enabled` | boolean | `false` | Enable OTLP exporter |
210
- | `worktree.enabled` | boolean | | Enable worktree isolation |
208
+ | `observability.enabled` | boolean | `false` | Enable metrics collection |
209
+ | `observability.pollIntervalMs` | number | | Metrics poll interval |
210
+ | `otlp.enabled` | boolean | `false` | Enable OTLP exporter |
211
+ | `otlp.endpoint` | string | — | OTLP endpoint URL |
212
+ | `worktree.setupHook` | string | — | Worktree setup hook command |
213
+ | `worktree.linkNodeModules` | boolean | — | Symlink node_modules into worktree |
214
+ | `worktree.seedPaths` | array | — | Extra paths to seed into worktree |
211
215
 
212
216
  ---
213
217
 
@@ -0,0 +1,131 @@
1
+ # Troubleshooting
2
+
3
+ Common problems and their fixes. If you hit an error code (E001–E012), see the
4
+ [Error codes](#error-codes) table below.
5
+
6
+ ## Quick health check
7
+
8
+ ```text
9
+ team action='doctor'
10
+ team action='health'
11
+ ```
12
+
13
+ `doctor` validates your config, runtime, and worker setup. `health` shows live
14
+ run/process status. Start here.
15
+
16
+ ## Runs won't start / workers are "blocked"
17
+
18
+ **Symptom:** `team action='run'` returns a `blocked` status with a message like
19
+ "Child worker execution is disabled".
20
+
21
+ **Cause:** Worker execution is off. pi-crew refuses to create no-op scaffold
22
+ subagents by default until you opt in.
23
+
24
+ **Fix — pick one:**
25
+ - Set in your config (`~/.pi/agent/pi-crew.json`): `"executeWorkers": true`
26
+ - Or set the env var: `PI_CREW_EXECUTE_WORKERS=1`
27
+ - Or pass at run time: `team action='run' config={runtime:{mode:'live-session'}}`
28
+
29
+ The blocked-run message lists the exact config + env vars in play — read it.
30
+
31
+ ## "Run not found" / I lost my run ID
32
+
33
+ Run IDs are long (`team_20260615180014_a1b2c3d4e5f60718`). To recover:
34
+
35
+ ```text
36
+ team action='list' # recent runs + IDs
37
+ team action='status' # status of in-flight runs in this project
38
+ team action='artifacts' runId=… # if you only have a partial ID
39
+ ```
40
+
41
+ Every "Run not found" error now appends a `Tip: run action='list'` hint.
42
+
43
+ ## "Unknown action: X"
44
+
45
+ You typo'd an action. The error now suggests the closest match
46
+ (`Did you mean 'status'?`). To see all valid actions:
47
+
48
+ ```text
49
+ team action='help'
50
+ team action='list' resource='workflow'
51
+ ```
52
+
53
+ ## Worktree runs fail ("not a git repo" / "tree is dirty")
54
+
55
+ Worktree mode (`workspaceMode: 'worktree'`) requires:
56
+ 1. The target directory is a **git repository** (`git rev-parse` succeeds).
57
+ 2. The working tree is **clean** (no uncommitted changes) unless you pass
58
+ `force: true`.
59
+
60
+ If you don't need isolation, use single mode instead:
61
+ `team action='run' workspaceMode='single'`.
62
+
63
+ ## Stale async process / run stuck in "running"
64
+
65
+ A background run whose process died (crash, Ctrl+C, reboot) can appear stuck
66
+ in `running`. The stale-reconciler eventually marks it `failed`, but you can
67
+ force recovery:
68
+
69
+ ```text
70
+ team action='status' runId=… # check the async liveness line
71
+ team action='cleanup' runId=… # repair stuck state
72
+ team action='cancel' runId=… # cancel a truly-dead run
73
+ ```
74
+
75
+ The error message explains the heartbeat mechanism + remediation.
76
+
77
+ ## Model fallback exhausted
78
+
79
+ **Symptom:** `All N candidates exhausted (tried: a → b → c)`.
80
+
81
+ **Cause:** Every model in your fallback chain failed (rate limit, auth, quota).
82
+
83
+ **Fix:** Check your provider config / API keys. The error now lists the full
84
+ chain tried and the last failure reason.
85
+
86
+ ## Config is malformed / ignored
87
+
88
+ If your `pi-crew.json` has a syntax or type error, `team action='run'` emits a
89
+ `config.warning` event (visible via `team action='events'`) and proceeds with
90
+ defaults — it does **not** hard-fail. To validate explicitly:
91
+
92
+ ```text
93
+ team action='config' # show loaded config + any warnings
94
+ team action='doctor' # full validation
95
+ ```
96
+
97
+ ## Compact vs full status
98
+
99
+ `team action='status'` defaults to full output (~40 lines). For a quick check:
100
+
101
+ ```text
102
+ team action='status' details=false # compact: status, progress, goal, issues only
103
+ ```
104
+
105
+ ## Error codes
106
+
107
+ pi-crew uses a structured error taxonomy (E001–E012). Each error renders its
108
+ code + a help hint inline. Common ones:
109
+
110
+ | Code | Name | Meaning | First check |
111
+ |------|------|---------|-------------|
112
+ | E001 | FileReadError | a required file couldn't be read | check the file exists + read perms; may need `cleanup` |
113
+ | E002 | FileWriteError | an atomic write failed | check disk space + dir write perms |
114
+ | E003 | TaskNotFound | a referenced task id doesn't exist | `team status` to verify the run's tasks |
115
+ | E004 | InvalidStatusTransition | illegal run/task status change | verify status via `team status` before retrying |
116
+ | E005 | ConfigError | config has a syntax/type error | `team config` shows the offending field |
117
+ | E006 | ResourceNotFound | agent/team/workflow not found | `team list` to see available resources |
118
+ | E007 | ChildTimeout | a worker Pi didn't finish in time | raise `runtime.responseTimeoutMs` or simplify the task |
119
+ | E008 | ModelExhausted | all fallback models failed | see "Model fallback exhausted" above |
120
+ | E009 | PreStepFailed | a workflow pre-step hook failed | check the hook stderr in events; set `preStepOptional: true` on the step to make it advisory (non-fatal) |
121
+ | E010 | EventLogLockTimeout | event log locked under contention | transient; retry, or lower concurrency |
122
+ | E011 | DepthLimitExceeded | crew nesting too deep | raise `crew.maxDepth` or flatten the call |
123
+ | E012 | RunStale | run reconciled as stale | see "Stale async process" above |
124
+
125
+ ## Still stuck
126
+
127
+ - `team action='explain' runId=…` — structured per-task analysis (why, files,
128
+ complexity).
129
+ - `team action='summary' runId=…` — includes common failure-pattern detection
130
+ ("4 of 5 failures share 2 root causes").
131
+ - `team action='events' runId=…` — full event timeline for forensics.
package/docs/usage.md CHANGED
@@ -5,9 +5,14 @@
5
5
  Optional config path:
6
6
 
7
7
  ```text
8
- ~/.pi/agent/extensions/pi-crew/config.json
8
+ ~/.pi/agent/pi-crew.json
9
9
  ```
10
10
 
11
+ A **legacy** path `~/.pi/agent/extensions/pi-crew/config.json` is also read for
12
+ backward-compatibility migration — values there are merged but the new path
13
+ above is preferred. A project-local config may also live at `.pi/pi-crew.json`
14
+ in your repo root (project values are merged under the user config).
15
+
11
16
  Create a default config:
12
17
 
13
18
  ```bash
@@ -58,9 +63,9 @@ Supported fields:
58
63
  "deadletterThreshold": 3,
59
64
  "retryPolicy": {
60
65
  "maxAttempts": 3,
61
- "initialDelayMs": 1000,
62
- "backoffMultiplier": 2,
63
- "maxDelayMs": 30000
66
+ "backoffMs": 1000,
67
+ "jitterRatio": 0.3,
68
+ "exponentialFactor": 2
64
69
  }
65
70
  }
66
71
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-crew",
3
- "version": "0.7.5",
3
+ "version": "0.7.6",
4
4
  "description": "Pi extension for coordinated AI teams, workflows, worktrees, and async task orchestration",
5
5
  "author": "baphuongna",
6
6
  "license": "MIT",
@@ -494,8 +494,14 @@ function mergeConfig(
494
494
  delete merged.otlp.headers;
495
495
  // Validate OTLP headers for injection attacks:
496
496
  // - Check top-level keys for dangerous prototype pollution patterns
497
- // - Block all control characters (except tab=0x09, newline=0x0A) to prevent
498
- // header injection via CR/LF/zero-byte/etc.
497
+ // - Block ALL control characters except tab (0x09) to prevent header
498
+ // injection via CR/LF/zero-byte/etc.
499
+ // BUG (Round 28, CRLF injection): the previous range
500
+ // /[\x00-\x08\x0b\x0c\x0e-\x1f]/ left THREE chars unblocked: tab (0x09,
501
+ // intentionally allowed), LF (0x0A) AND CR (0x0D). The comment claimed to
502
+ // "prevent header injection via CR/LF" but CR was never matched, and LF
503
+ // was explicitly allowed — both are CRLF injection vectors that can split
504
+ // HTTP headers. Fix: block 0x00-0x08 and 0x0A-0x1F, allowing only tab.
499
505
  const invalidHeaders: string[] = [];
500
506
  for (const [k, v] of Object.entries(merged.otlp.headers ?? {})) {
501
507
  // Check top-level key for dangerous names (only top-level keys are checked)
@@ -505,9 +511,10 @@ function mergeConfig(
505
511
  return false;
506
512
  };
507
513
  if (checkKey(k)) { invalidHeaders.push(k); continue; }
508
- // Block any control characters except tab (0x09) in values
514
+ // Block any control characters except tab (0x09) in values.
515
+ // Round 28 fix: /[\x00-\x08\x0a-\x1f]/ blocks LF (0x0A) and CR (0x0D) too.
509
516
  const valStr = String(v);
510
- if (/[\x00-\x08\x0b\x0c\x0e-\x1f]/.test(valStr)) { invalidHeaders.push(k); }
517
+ if (/[\x00-\x08\x0a-\x1f]/.test(valStr)) { invalidHeaders.push(k); }
511
518
  }
512
519
  if (invalidHeaders.length > 0) {
513
520
  delete merged.otlp.headers;
@@ -0,0 +1,71 @@
1
+ /**
2
+ * action-suggestions.ts — "Did you mean?" suggestions for team actions (DX: F1).
3
+ *
4
+ * Round 16 DX audit found that a typo'd action (`action: 'stat'`,
5
+ * `action: 'summery'`) hits a dead-end "Unknown action: stat" with no path
6
+ * forward. pi-crew already ships a Levenshtein fuzzy-matcher
7
+ * (`src/config/suggestions.ts → suggestConfigKey`); this module applies it to
8
+ * the known set of team actions.
9
+ *
10
+ * The known-action list mirrors the `action` enum in
11
+ * `src/schema/team-tool-schema.ts`. Kept as a hand-maintained constant (not
12
+ * derived from the TypeBox schema at runtime) so it is trivially testable and
13
+ * avoids pulling the schema into low-level error paths.
14
+ */
15
+
16
+ import { findClosestKey } from "../config/suggestions.ts";
17
+
18
+ /**
19
+ * The complete set of valid top-level `team` actions (mirrors the action enum
20
+ * in `src/schema/team-tool-schema.ts`). Exported so callers and tests can use
21
+ * the single source of truth.
22
+ */
23
+ export const KNOWN_TEAM_ACTIONS = [
24
+ "run", "parallel", "plan", "status", "wait", "list", "get",
25
+ "cancel", "retry", "resume", "respond", "create", "update", "delete",
26
+ "doctor", "cleanup", "events", "artifacts", "worktrees", "forget",
27
+ "summary", "prune", "export", "import", "imports", "help", "validate",
28
+ "config", "init", "recommend", "autonomy", "api", "settings", "steer",
29
+ "invalidate", "health", "graph", "onboard", "explain", "cache",
30
+ "checkpoint", "search", "orchestrate", "schedule", "scheduled", "anchor",
31
+ "auto-summarize", "auto_boomerang",
32
+ ] as const;
33
+
34
+ /**
35
+ * Suggest the closest known team action for a (likely typo'd) input.
36
+ * Returns `null` when no action is close enough — callers should then omit
37
+ * the "Did you mean …?" hint rather than suggesting a poor match.
38
+ *
39
+ * Uses a tighter edit-distance budget than the generic config-key suggester
40
+ * (2 instead of 3): team actions are short command words, so distance-3
41
+ * matches against a short input (e.g. "" → "run") produce low-quality hints.
42
+ * Empty/whitespace input always returns null.
43
+ *
44
+ * Exported for unit testing.
45
+ */
46
+ export function suggestAction(input: string): string | null {
47
+ const trimmed = input.trim();
48
+ if (!trimmed) return null;
49
+ // Defense-in-depth (Round 18 security F1): levenshtein is O(n×m). A hostile
50
+ // very-long input would waste cycles. The action is enum-validated upstream
51
+ // so this is unreachable in practice, but cap input length cheaply.
52
+ if (trimmed.length > 64) return null;
53
+ return findClosestKey(trimmed, KNOWN_TEAM_ACTIONS, 2);
54
+ }
55
+
56
+ /**
57
+ * Build a "Did you mean?" suffix for an unknown-action error message.
58
+ * Returns "" when there is no good suggestion (so the caller can just append
59
+ * it unconditionally). Keeps error formatting centralized.
60
+ *
61
+ * Exported for unit testing + use in the dispatch default-case.
62
+ *
63
+ * Example:
64
+ * formatActionSuggestion("stat") // "\n\nDid you mean 'status'? Use action='status'."
65
+ * formatActionSuggestion("xyzzy") // ""
66
+ */
67
+ export function formatActionSuggestion(input: string): string {
68
+ const suggestion = suggestAction(input);
69
+ if (!suggestion || suggestion === input) return "";
70
+ return `\n\nDid you mean '${suggestion}'? Use action='${suggestion}'. Run action='help' to see all actions.`;
71
+ }
@@ -45,6 +45,36 @@ const MAX_INLINE_RUNS = 3;
45
45
  /** Truncate long goals so one run can't dominate the context window. */
46
46
  const MAX_GOAL_LEN = 80;
47
47
 
48
+ /**
49
+ * Cheap human-readable run age from manifest timestamps (no extra I/O).
50
+ * Returns "running 12m" / "updated 3m ago" style, or "" if timestamps are
51
+ * missing/invalid. Keeps the ambient note informative without reading
52
+ * tasks.json on every LLM call.
53
+ */
54
+ function runAge(createdAt?: string, updatedAt?: string): string {
55
+ try {
56
+ const updated = updatedAt ? Date.parse(updatedAt) : NaN;
57
+ const created = createdAt ? Date.parse(createdAt) : NaN;
58
+ if (Number.isFinite(updated)) {
59
+ const sinceUpdate = Date.now() - updated;
60
+ if (sinceUpdate < 60_000) return `, updated just now`;
61
+ return `, updated ${humanizeMs(sinceUpdate)} ago`;
62
+ }
63
+ if (Number.isFinite(created)) {
64
+ return `, running ${humanizeMs(Date.now() - created)}`;
65
+ }
66
+ } catch { /* ignore malformed timestamps */ }
67
+ return "";
68
+ }
69
+
70
+ function humanizeMs(ms: number): string {
71
+ if (ms < 60_000) return `${Math.round(ms / 1000)}s`;
72
+ const m = Math.floor(ms / 60_000);
73
+ if (m < 60) return `${m}m`;
74
+ const h = Math.floor(m / 60);
75
+ return h < 24 ? `${h}h${m % 60}m` : `${Math.floor(h / 24)}d`;
76
+ }
77
+
48
78
  /**
49
79
  * Build a compact, human+LLM-readable ambient status string for the given
50
80
  * in-flight runs. Returns "" for an empty list (caller treats as no-op).
@@ -62,7 +92,8 @@ export function formatAmbientStatus(runs: TeamRunManifest[]): string {
62
92
  const shown = runs.slice(0, MAX_INLINE_RUNS);
63
93
  for (const run of shown) {
64
94
  const wf = run.workflow ? `, ${run.workflow}` : "";
65
- lines.push(`• ${run.runId} (${run.status}, ${run.team}${wf}): ${truncate(run.goal ?? "(no goal)", MAX_GOAL_LEN)}`);
95
+ const age = runAge(run.createdAt, run.updatedAt);
96
+ lines.push(`• ${run.runId} (${run.status}, ${run.team}${wf})${age}: ${truncate(run.goal ?? "(no goal)", MAX_GOAL_LEN)}`);
66
97
  }
67
98
  if (runs.length > MAX_INLINE_RUNS) {
68
99
  lines.push(`• …and ${runs.length - MAX_INLINE_RUNS} more`);
@@ -82,7 +82,8 @@ import {
82
82
  import { RenderScheduler } from "../ui/render-scheduler.ts";
83
83
  import { runEventBus } from "../ui/run-event-bus.ts";
84
84
  import { createRunSnapshotCache } from "../ui/run-snapshot-cache.ts";
85
- import { closeWatcher, watchCrewState } from "../utils/fs-watch.ts";
85
+ import { closeWatcher } from "../utils/fs-watch.ts";
86
+ import { RunWatcherRegistry } from "../utils/run-watcher-registry.ts";
86
87
  import { logInternalError } from "../utils/internal-error.ts";
87
88
  import {
88
89
  clearProjectRootCache,
@@ -725,8 +726,13 @@ export function registerPiTeams(pi: ExtensionAPI): void {
725
726
  // Linux), file changes (manifest/tasks/events/agents) trigger an
726
727
  // immediate cache invalidate via renderScheduler.schedule. Falls back to
727
728
  // poll-only behavior on systems where fs.watch errors.
728
- let crewWatcher: import("node:fs").FSWatcher | undefined;
729
- let userCrewWatcher: import("node:fs").FSWatcher | undefined;
729
+ // pts/2 hang fix (2026-06-16): the previous RECURSIVE fs.watch(<state>, {recursive:true})
730
+ // exploded to O(total run history) inotify watches on Linux (109→339 observed) and
731
+ // caused a permanent busy-loop. Replaced with bounded per-active-run watchers via
732
+ // RunWatcherRegistry (root watcher on runs/ for new-run detection + one non-recursive
733
+ // watcher per active run, reconciled each preload tick in buildFrame).
734
+ let crewRunWatchers: RunWatcherRegistry | undefined;
735
+ let userCrewWatchers: RunWatcherRegistry | undefined;
730
736
  // Separate map for foreground team-run AbortControllers (distinct from subagent controllers).
731
737
  // P0 fix: stopSessionBoundSubagents must NOT abort foreground team runs on session switch.
732
738
  // Foreground team runs run in the same process as the session; they naturally clean up
@@ -1116,10 +1122,10 @@ export function registerPiTeams(pi: ExtensionAPI): void {
1116
1122
  clearTimeout(preloadTimer);
1117
1123
  preloadTimer = undefined;
1118
1124
  }
1119
- closeWatcher(crewWatcher);
1120
- crewWatcher = undefined;
1121
- closeWatcher(userCrewWatcher);
1122
- userCrewWatcher = undefined;
1125
+ crewRunWatchers?.closeAll();
1126
+ crewRunWatchers = undefined;
1127
+ userCrewWatchers?.closeAll();
1128
+ userCrewWatchers = undefined;
1123
1129
  stopSessionBoundSubagents();
1124
1130
  // P0 fix: also abort foreground team runs on session shutdown (not on session switch).
1125
1131
  // This is the only place where foreground team run controllers should be aborted.
@@ -1590,6 +1596,25 @@ export function registerPiTeams(pi: ExtensionAPI): void {
1590
1596
  lastFrameSnapshotCache = getRunSnapshotCache(currentCtx.cwd);
1591
1597
  const manifests = lastFrameManifestCache.list(20);
1592
1598
  lastPreloadedManifests = manifests;
1599
+ // pts/2 hang fix: reconcile per-run watchers against the ACTIVE set only.
1600
+ // This bounds inotify cost to O(active runs) — completed runs stop being
1601
+ // watched as soon as they leave running/queued/planning status, instead of
1602
+ // the recursive watcher watching the entire run history forever.
1603
+ {
1604
+ const onRunChange = (runId: string): void => {
1605
+ if (cleanedUp || sessionGeneration !== ownerGeneration) return;
1606
+ getRunSnapshotCache(currentCtx?.cwd ?? process.cwd()).invalidate(runId);
1607
+ renderScheduler?.schedule({ runId });
1608
+ };
1609
+ const onWatchErr = (error: unknown): void => {
1610
+ logInternalError("register.runWatcher.change", error);
1611
+ };
1612
+ const active = manifests
1613
+ .filter((r) => r.status === "running" || r.status === "queued" || r.status === "planning")
1614
+ .map((r) => ({ runId: r.runId, runDir: r.stateRoot }));
1615
+ crewRunWatchers?.reconcile(active, onRunChange, onWatchErr);
1616
+ userCrewWatchers?.reconcile(active, onRunChange, onWatchErr);
1617
+ }
1593
1618
  const runIds = manifests.map((r) => r.runId);
1594
1619
  await lastFrameSnapshotCache.preloadAllStale(runIds);
1595
1620
  return true;
@@ -1815,72 +1840,53 @@ export function registerPiTeams(pi: ExtensionAPI): void {
1815
1840
  renderSchedulerUnsubscribers.push(unsubscribeRunEvents);
1816
1841
  // Start async preload loop — refreshes snapshot cache in background
1817
1842
  startPreloadLoop(fallbackMs, effectiveRefreshMs);
1818
- // 1.3: native FS watcher on `<crewRoot>/state`. Triggers an immediate
1819
- // renderScheduler.schedule({runId}) when files inside any run change so
1820
- // the snapshot cache invalidates well before the 1s preload tick. Falls
1821
- // back silently to poll-only behavior on systems where recursive
1822
- // fs.watch is not supported.
1843
+ // 1.3: BOUNDED run watcher (pts/2 hang fix 2026-06-16). Previously this was
1844
+ // a RECURSIVE fs.watch(<state>, {recursive:true}) which on Linux expands to
1845
+ // ONE inotify watch PER SUBDIR with many historical runs under
1846
+ // .crew/state/runs/ this ballooned to hundreds of watches (109→339 observed)
1847
+ // and the event volume caused a permanent busy-loop (71% CPU, 400KB/s read).
1848
+ // Now: a single non-recursive watcher on the runs/ ROOT (to detect new run
1849
+ // dirs appearing — crew.run.created is never emitted) plus per-active-run
1850
+ // watchers reconciled each preload tick in buildFrame. Total inotify cost is
1851
+ // O(active runs), not O(total history). Falls back to poll-only (the preload
1852
+ // loop already polls every effectiveRefreshMs) on systems where fs.watch
1853
+ // errors or the runs dir is absent.
1854
+ const crewRunWatcherOnChange = (runId: string): void => {
1855
+ if (cleanedUp || sessionGeneration !== ownerGeneration) return;
1856
+ getRunSnapshotCache(currentCtx?.cwd ?? process.cwd()).invalidate(runId);
1857
+ renderScheduler?.schedule({ runId });
1858
+ };
1859
+ const crewRunWatcherOnError = (error: unknown): void => {
1860
+ logInternalError("register.crewRunWatchers.error", error);
1861
+ };
1823
1862
  try {
1824
- closeWatcher(crewWatcher);
1825
- crewWatcher = undefined;
1826
- const stateDir = path.join(projectCrewRoot(ctx.cwd), "state");
1827
- const watcher = watchCrewState(
1828
- stateDir,
1829
- (runId) => {
1830
- if (cleanedUp || sessionGeneration !== ownerGeneration)
1831
- return;
1832
- // Invalidate snapshot cache so the next renderTick reads fresh state from disk.
1833
- // Without this, renderTick re-renders from stale lastPreloadedManifests and
1834
- // shows ghost "running" entries for runs that already completed on disk.
1835
- const sc = getRunSnapshotCache(
1836
- currentCtx?.cwd ?? process.cwd(),
1837
- );
1838
- sc.invalidate(runId);
1839
- renderScheduler?.schedule({ runId });
1840
- },
1841
- (error) => {
1842
- logInternalError("register.crewWatcher.error", error);
1843
- closeWatcher(crewWatcher);
1844
- crewWatcher = undefined;
1845
- },
1846
- );
1847
- if (watcher) crewWatcher = watcher;
1863
+ crewRunWatchers?.closeAll();
1864
+ crewRunWatchers = undefined;
1865
+ const crewRunsDir = path.join(projectCrewRoot(ctx.cwd), "state", "runs");
1866
+ if (fs.existsSync(crewRunsDir)) {
1867
+ crewRunWatchers = new RunWatcherRegistry();
1868
+ crewRunWatchers.setRootWatcher(crewRunsDir, crewRunWatcherOnChange, crewRunWatcherOnError);
1869
+ }
1848
1870
  } catch (error) {
1849
- logInternalError("register.crewWatcher.start", error);
1871
+ logInternalError("register.crewRunWatchers.start", error);
1850
1872
  }
1851
- // Also watch user-level state dir — fast-fix and other user-scoped runs
1852
- // write manifests there. Without this watcher, runs completing in user-level
1873
+ // Also watch user-level runs dir — fast-fix and other user-scoped runs
1874
+ // write manifests there. Without this, runs completing in user-level
1853
1875
  // state never trigger cache invalidation, causing ghost "running" entries.
1854
1876
  try {
1855
- closeWatcher(userCrewWatcher);
1856
- userCrewWatcher = undefined;
1857
- const userStateDir = path.join(userCrewRoot(), "state");
1858
- if (fs.existsSync(userStateDir)) {
1859
- const userWatcher = watchCrewState(
1860
- userStateDir,
1861
- (runId) => {
1862
- if (cleanedUp || sessionGeneration !== ownerGeneration)
1863
- return;
1864
- const sc = getRunSnapshotCache(
1865
- currentCtx?.cwd ?? process.cwd(),
1866
- );
1867
- sc.invalidate(runId);
1868
- renderScheduler?.schedule({ runId });
1869
- },
1870
- (error) => {
1871
- logInternalError(
1872
- "register.userCrewWatcher.error",
1873
- error,
1874
- );
1875
- closeWatcher(userCrewWatcher);
1876
- userCrewWatcher = undefined;
1877
- },
1878
- );
1879
- if (userWatcher) userCrewWatcher = userWatcher;
1877
+ userCrewWatchers?.closeAll();
1878
+ userCrewWatchers = undefined;
1879
+ const userRunsDir = path.join(userCrewRoot(), "state", "runs");
1880
+ if (fs.existsSync(userRunsDir)) {
1881
+ userCrewWatchers = new RunWatcherRegistry();
1882
+ userCrewWatchers.setRootWatcher(userRunsDir, crewRunWatcherOnChange, crewRunWatcherOnError);
1880
1883
  }
1881
1884
  } catch (error) {
1882
- logInternalError("register.userCrewWatcher.start", error);
1885
+ logInternalError("register.userCrewWatchers.start", error);
1883
1886
  }
1887
+ // Kick an immediate preload so the first buildFrame reconciles per-run
1888
+ // watchers for any runs that are already active on session start.
1889
+ backgroundPreload();
1884
1890
  });
1885
1891
  pi.on("session_before_switch", () => {
1886
1892
  sessionGeneration++;