pi-crew 0.1.49 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (249) hide show
  1. package/CHANGELOG.md +74 -1
  2. package/README.md +176 -781
  3. package/agents/analyst.md +11 -11
  4. package/agents/critic.md +11 -11
  5. package/agents/executor.md +11 -11
  6. package/agents/explorer.md +11 -11
  7. package/agents/planner.md +11 -11
  8. package/agents/reviewer.md +11 -11
  9. package/agents/security-reviewer.md +11 -11
  10. package/agents/test-engineer.md +11 -11
  11. package/agents/verifier.md +70 -11
  12. package/agents/writer.md +11 -11
  13. package/docs/actions-reference.md +595 -0
  14. package/docs/commands-reference.md +347 -0
  15. package/docs/runtime-flow.md +148 -148
  16. package/index.ts +6 -6
  17. package/package.json +99 -99
  18. package/skills/async-worker-recovery/SKILL.md +42 -42
  19. package/skills/context-artifact-hygiene/SKILL.md +52 -52
  20. package/skills/delegation-patterns/SKILL.md +54 -54
  21. package/skills/mailbox-interactive/SKILL.md +40 -40
  22. package/skills/model-routing-context/SKILL.md +39 -39
  23. package/skills/multi-perspective-review/SKILL.md +58 -58
  24. package/skills/observability-reliability/SKILL.md +41 -41
  25. package/skills/orchestration/SKILL.md +157 -157
  26. package/skills/ownership-session-security/SKILL.md +41 -41
  27. package/skills/pi-extension-lifecycle/SKILL.md +39 -39
  28. package/skills/requirements-to-task-packet/SKILL.md +63 -63
  29. package/skills/resource-discovery-config/SKILL.md +41 -41
  30. package/skills/runtime-state-reader/SKILL.md +44 -44
  31. package/skills/secure-agent-orchestration-review/SKILL.md +45 -45
  32. package/skills/state-mutation-locking/SKILL.md +42 -42
  33. package/skills/systematic-debugging/SKILL.md +67 -67
  34. package/skills/ui-render-performance/SKILL.md +39 -39
  35. package/skills/verification-before-done/SKILL.md +57 -57
  36. package/skills/worktree-isolation/SKILL.md +39 -39
  37. package/src/adapters/claude-adapter.ts +25 -0
  38. package/src/adapters/codex-adapter.ts +21 -0
  39. package/src/adapters/cursor-adapter.ts +17 -0
  40. package/src/adapters/export-util.ts +137 -0
  41. package/src/adapters/index.ts +15 -0
  42. package/src/adapters/registry.ts +18 -0
  43. package/src/adapters/types.ts +23 -0
  44. package/src/agents/agent-config.ts +2 -0
  45. package/src/agents/agent-search.ts +98 -98
  46. package/src/agents/discover-agents.ts +2 -1
  47. package/src/config/config.ts +14 -1
  48. package/src/config/defaults.ts +5 -5
  49. package/src/config/drift-detector.ts +211 -0
  50. package/src/config/markers.ts +327 -0
  51. package/src/config/resilient-parser.ts +108 -0
  52. package/src/config/suggestions.ts +74 -0
  53. package/src/extension/cross-extension-rpc.ts +103 -82
  54. package/src/extension/project-init.ts +36 -4
  55. package/src/extension/register.ts +67 -22
  56. package/src/extension/registration/commands.ts +77 -8
  57. package/src/extension/registration/subagent-tools.ts +10 -1
  58. package/src/extension/registration/team-tool.ts +10 -1
  59. package/src/extension/registration/viewers.ts +48 -34
  60. package/src/extension/run-bundle-schema.ts +89 -89
  61. package/src/extension/run-export.ts +26 -12
  62. package/src/extension/run-import.ts +25 -1
  63. package/src/extension/run-index.ts +5 -1
  64. package/src/extension/run-maintenance.ts +142 -68
  65. package/src/extension/team-manager-command.ts +10 -1
  66. package/src/extension/team-tool/context.ts +1 -1
  67. package/src/extension/team-tool/doctor.ts +28 -3
  68. package/src/extension/team-tool/handle-settings.ts +195 -188
  69. package/src/extension/team-tool/inspect.ts +41 -41
  70. package/src/extension/team-tool/intent-policy.ts +42 -42
  71. package/src/extension/team-tool/lifecycle-actions.ts +27 -8
  72. package/src/extension/team-tool/plan.ts +19 -19
  73. package/src/extension/team-tool/run.ts +12 -1
  74. package/src/extension/team-tool.ts +14 -3
  75. package/src/i18n.ts +184 -184
  76. package/src/observability/exporters/otlp-exporter.ts +92 -77
  77. package/src/prompt/prompt-runtime.ts +72 -72
  78. package/src/runtime/agent-memory.ts +72 -72
  79. package/src/runtime/agent-observability.ts +114 -114
  80. package/src/runtime/async-marker.ts +26 -26
  81. package/src/runtime/attention-events.ts +28 -28
  82. package/src/runtime/auto-resume.ts +100 -0
  83. package/src/runtime/background-runner.ts +11 -1
  84. package/src/runtime/cancellation-token.ts +89 -89
  85. package/src/runtime/cancellation.ts +61 -61
  86. package/src/runtime/capability-inventory.ts +116 -116
  87. package/src/runtime/child-pi.ts +7 -2
  88. package/src/runtime/compaction-summary.ts +271 -0
  89. package/src/runtime/completion-guard.ts +190 -190
  90. package/src/runtime/concurrency.ts +3 -1
  91. package/src/runtime/crash-recovery.ts +33 -0
  92. package/src/runtime/delta-conflict.ts +360 -0
  93. package/src/runtime/diagnostic-export.ts +3 -1
  94. package/src/runtime/direct-run.ts +35 -35
  95. package/src/runtime/event-stream-bridge.ts +3 -1
  96. package/src/runtime/foreground-control.ts +82 -82
  97. package/src/runtime/green-contract.ts +46 -46
  98. package/src/runtime/group-join.ts +106 -106
  99. package/src/runtime/heartbeat-gradient.ts +28 -28
  100. package/src/runtime/heartbeat-watcher.ts +124 -124
  101. package/src/runtime/iteration-hooks.ts +262 -0
  102. package/src/runtime/live-agent-control.ts +88 -88
  103. package/src/runtime/live-control-realtime.ts +36 -36
  104. package/src/runtime/live-extension-bridge.ts +150 -150
  105. package/src/runtime/live-irc.ts +92 -92
  106. package/src/runtime/live-session-health.ts +100 -100
  107. package/src/runtime/loop-gates.ts +129 -0
  108. package/src/runtime/metric-parser.ts +40 -0
  109. package/src/runtime/notebook-helpers.ts +90 -90
  110. package/src/runtime/orphan-sentinel.ts +7 -7
  111. package/src/runtime/parallel-research.ts +44 -44
  112. package/src/runtime/phase-progress.ts +217 -0
  113. package/src/runtime/pi-args.ts +38 -2
  114. package/src/runtime/pi-json-output.ts +111 -111
  115. package/src/runtime/pi-spawn.ts +74 -6
  116. package/src/runtime/policy-engine.ts +79 -79
  117. package/src/runtime/post-checks.ts +122 -0
  118. package/src/runtime/process-status.ts +14 -1
  119. package/src/runtime/progress-event-coalescer.ts +43 -43
  120. package/src/runtime/prose-compressor.ts +164 -164
  121. package/src/runtime/recovery-recipes.ts +74 -74
  122. package/src/runtime/result-extractor.ts +121 -121
  123. package/src/runtime/role-permission.ts +39 -39
  124. package/src/runtime/sensitive-paths.ts +3 -3
  125. package/src/runtime/session-resources.ts +25 -25
  126. package/src/runtime/session-snapshot.ts +59 -59
  127. package/src/runtime/session-usage.ts +79 -79
  128. package/src/runtime/sidechain-output.ts +29 -29
  129. package/src/runtime/stream-preview.ts +177 -177
  130. package/src/runtime/supervisor-contact.ts +59 -59
  131. package/src/runtime/task-display.ts +38 -38
  132. package/src/runtime/task-graph.ts +207 -0
  133. package/src/runtime/task-quality.ts +207 -0
  134. package/src/runtime/task-runner/capabilities.ts +78 -78
  135. package/src/runtime/task-runner/live-executor.ts +7 -1
  136. package/src/runtime/task-runner/progress.ts +119 -119
  137. package/src/runtime/task-runner/prompt-builder.ts +1 -1
  138. package/src/runtime/task-runner/prompt-pipeline.ts +64 -64
  139. package/src/runtime/task-runner/result-utils.ts +14 -14
  140. package/src/runtime/task-runner/run-projection.ts +103 -103
  141. package/src/runtime/task-runner/state-helpers.ts +22 -22
  142. package/src/runtime/team-runner.ts +126 -7
  143. package/src/runtime/worker-heartbeat.ts +21 -21
  144. package/src/runtime/worker-startup.ts +57 -57
  145. package/src/runtime/workflow-state.ts +187 -0
  146. package/src/runtime/workspace-tree.ts +298 -298
  147. package/src/schema/config-schema.ts +12 -0
  148. package/src/schema/validation-types.ts +148 -0
  149. package/src/skills/skill-templates.ts +374 -0
  150. package/src/state/active-run-registry.ts +35 -11
  151. package/src/state/atomic-write.ts +33 -26
  152. package/src/state/contracts.ts +1 -0
  153. package/src/state/event-reconstructor.ts +217 -0
  154. package/src/state/locks.ts +2 -11
  155. package/src/state/mailbox.ts +4 -3
  156. package/src/state/state-store.ts +32 -14
  157. package/src/state/task-claims.ts +44 -44
  158. package/src/state/types.ts +9 -0
  159. package/src/state/usage.ts +29 -29
  160. package/src/subagents/async-entry.ts +1 -1
  161. package/src/subagents/index.ts +3 -3
  162. package/src/subagents/live/control.ts +1 -1
  163. package/src/subagents/live/manager.ts +1 -1
  164. package/src/subagents/live/realtime.ts +1 -1
  165. package/src/subagents/live/session-runtime.ts +1 -1
  166. package/src/subagents/manager.ts +1 -1
  167. package/src/subagents/spawn.ts +1 -1
  168. package/src/teams/team-serializer.ts +38 -38
  169. package/src/types/diff.d.ts +18 -18
  170. package/src/ui/crew-footer.ts +101 -101
  171. package/src/ui/crew-select-list.ts +111 -111
  172. package/src/ui/crew-widget.ts +9 -4
  173. package/src/ui/dashboard-panes/cancellation-pane.ts +42 -42
  174. package/src/ui/dashboard-panes/capability-pane.ts +59 -59
  175. package/src/ui/dashboard-panes/mailbox-pane.ts +35 -35
  176. package/src/ui/dashboard-panes/metrics-pane.ts +34 -34
  177. package/src/ui/dashboard-panes/progress-pane.ts +11 -0
  178. package/src/ui/dynamic-border.ts +25 -25
  179. package/src/ui/layout-primitives.ts +106 -106
  180. package/src/ui/loaders.ts +158 -158
  181. package/src/ui/powerbar-publisher.ts +6 -0
  182. package/src/ui/render-coalescer.ts +51 -51
  183. package/src/ui/render-diff.ts +119 -119
  184. package/src/ui/render-scheduler.ts +143 -143
  185. package/src/ui/run-action-dispatcher.ts +10 -1
  186. package/src/ui/spinner.ts +17 -17
  187. package/src/ui/status-colors.ts +58 -58
  188. package/src/ui/syntax-highlight.ts +116 -116
  189. package/src/ui/transcript-entries.ts +258 -258
  190. package/src/utils/completion-dedupe.ts +63 -63
  191. package/src/utils/frontmatter.ts +68 -68
  192. package/src/utils/git.ts +262 -262
  193. package/src/utils/ids.ts +17 -17
  194. package/src/utils/incremental-reader.ts +104 -104
  195. package/src/utils/names.ts +27 -27
  196. package/src/utils/redaction.ts +44 -44
  197. package/src/utils/safe-paths.ts +47 -47
  198. package/src/utils/scan-cache.ts +136 -136
  199. package/src/utils/sleep.ts +40 -26
  200. package/src/utils/task-name-generator.ts +337 -337
  201. package/src/workflows/validate-workflow.ts +40 -40
  202. package/src/worktree/branch-freshness.ts +45 -45
  203. package/src/worktree/worktree-manager.ts +11 -3
  204. package/teams/default.team.md +12 -12
  205. package/teams/fast-fix.team.md +11 -11
  206. package/teams/implementation.team.md +18 -18
  207. package/teams/parallel-research.team.md +14 -14
  208. package/teams/research.team.md +11 -11
  209. package/teams/review.team.md +12 -12
  210. package/workflows/default.workflow.md +30 -29
  211. package/workflows/fast-fix.workflow.md +23 -22
  212. package/workflows/implementation.workflow.md +43 -38
  213. package/workflows/parallel-research.workflow.md +46 -46
  214. package/workflows/research.workflow.md +22 -22
  215. package/workflows/review.workflow.md +30 -30
  216. package/docs/refactor-tasks-phase3.md +0 -394
  217. package/docs/refactor-tasks-phase4.md +0 -564
  218. package/docs/refactor-tasks-phase5.md +0 -402
  219. package/docs/refactor-tasks-phase6.md +0 -662
  220. package/docs/refactor-tasks.md +0 -1484
  221. package/docs/research/AGENT-EXECUTION-ARCHITECTURE.md +0 -261
  222. package/docs/research/AGENT-LIFECYCLE-COMPARISON.md +0 -111
  223. package/docs/research/AUDIT_OH_MY_PI.md +0 -261
  224. package/docs/research/AUDIT_PI_CREW.md +0 -457
  225. package/docs/research/CAVEMAN-DEEP-RESEARCH.md +0 -281
  226. package/docs/research/COMPARISON_OH_MY_PI_VS_PI_CREW.md +0 -264
  227. package/docs/research/DEEP-RESEARCH-PI-POWERBAR.md +0 -343
  228. package/docs/research/DEEP_RESEARCH_SUBAGENT_ARCHITECTURE.md +0 -480
  229. package/docs/research/GAP_CLOSURE_IMPLEMENTATION_PLAN.md +0 -354
  230. package/docs/research/IMPLEMENTATION_PLAN.md +0 -385
  231. package/docs/research/LIVE-SESSION-PRODUCTION-READY-PLAN.md +0 -502
  232. package/docs/research/OH-MY-PI-DEEP-RESEARCH-v14.7.6.md +0 -266
  233. package/docs/research/REMAINING-GAPS-PLAN.md +0 -363
  234. package/docs/research/SESSION-SUMMARY-2026-05-08.md +0 -146
  235. package/docs/research/UI-RESPONSIVENESS-AUDIT.md +0 -173
  236. package/docs/research-awesome-agent-skills-distillation.md +0 -100
  237. package/docs/research-extension-examples.md +0 -297
  238. package/docs/research-extension-system.md +0 -324
  239. package/docs/research-oh-my-pi-distillation.md +0 -369
  240. package/docs/research-optimization-plan.md +0 -548
  241. package/docs/research-phase10-distillation.md +0 -199
  242. package/docs/research-phase11-distillation.md +0 -201
  243. package/docs/research-phase8-operator-experience-plan.md +0 -819
  244. package/docs/research-phase9-observability-reliability-plan.md +0 -1190
  245. package/docs/research-pi-coding-agent.md +0 -357
  246. package/docs/research-source-pi-crew-reference.md +0 -174
  247. package/docs/research-ui-optimization-plan.md +0 -480
  248. package/docs/source-runtime-refactor-map.md +0 -107
  249. package/src/utils/atomic-write.ts +0 -33
@@ -1,1190 +0,0 @@
1
- # Phase 9 — Observability & Reliability (Theme B + C combined)
2
-
3
- > Path X: Phase 8 (Operator Experience) → **Phase 9 (Observability + Reliability)**. Mục tiêu: build telemetry backbone (Counter/Gauge/Histogram + correlation ID + sink/export) đồng thời harden run reliability (heartbeat gradient + retry + crash recovery + deadletter). Combined vì 5 synergy critical (xem mục 1.A).
4
-
5
- > **Prerequisite:** Phase 8 đã DONE (verified 351 unit + 44 integration pass, version 0.1.34) — `NotificationRouter`, `ConfirmOverlay`, `MailboxDetailOverlay/Compose/Preview/AgentPicker`, `heartbeat-aggregator.ts`, `health-pane.ts`, `diagnostic-export.ts` (with `redactSecrets` regex `/(token|key|password|secret|credential|auth)/i`), `notification-sink.ts`, `keybinding-map.ts`, `run-action-dispatcher.ts` — Phase 9 reuse.
6
-
7
- > **Critical preflight finding (Phase 9.0.E):** `ExtensionAPI.events` interface is `EventBus` from `pi-coding-agent/dist/core/event-bus.d.ts`:
8
- > ```ts
9
- > interface EventBus { emit(channel, data): void; on(channel, handler): () => void; } // on() returns unsubscribe function — NO off() method
10
- > ```
11
- > → All "dispose" patterns must capture `unsubscribe` from `on()` return value, NOT call `events.off()`.
12
-
13
- ## 0. Implementation Status
14
-
15
- ### Foundation (Wave 1)
16
- - [x] 9.0.A Metric primitives — Counter / Gauge / Histogram base classes (`src/observability/metrics-primitives.ts`)
17
- - [x] 9.0.B MetricRegistry **per-session instance** + naming convention (`src/observability/metric-registry.ts`)
18
- - [x] 9.0.C Correlation context — traceId/spanId propagation primitive (`src/observability/correlation.ts`)
19
- - [x] 9.0.D Heartbeat gradient classifier extension (warn/stale/dead thresholds with metrics emission, reuse `WorkerHeartbeatState` interface + `isWorkerHeartbeatStale` helper)
20
- - [x] 9.0.E **Preflight verify** ExtensionAPI surface (`events.on` returns unsubscribe fn, `events.off` does NOT exist) + cross-check `WorkerHeartbeatState` field name
21
-
22
- ### Reliability core (Wave 2)
23
- - [x] 9.1.A Background heartbeat watcher (detect stuck workers, emit `crew.heartbeat.staleness_ms` Gauge)
24
- - [x] 9.1.B Retry executor + backoff/jitter policy (`src/runtime/retry-executor.ts`)
25
- - [x] 9.1.C Crash recovery resume từ event-log checkpoint
26
- - [x] 9.1.D Deadletter queue writer + threshold alerts via NotificationRouter
27
-
28
- ### Telemetry pipeline (Wave 3)
29
- - [x] 9.2.A Event-to-metric subscriber (subscribe `crew.*` events → registry counters)
30
- - [x] 9.2.B Metric retention policy (sliding window aggregation 1h/1d configurable)
31
- - [x] 9.2.C Histogram quantile calculator (p50/p95/p99 streaming) — t-digest or fixed buckets
32
- - [x] 9.2.D Metric file sink JSONL với daily rotation (gated bởi `telemetry.enabled`)
33
-
34
- ### Export adapters (Wave 3 parallel)
35
- - [x] 9.3.A Prometheus exposition format adapter (HTTP endpoint optional)
36
- - [x] 9.3.B OTLP HTTP exporter (optional, opt-in)
37
- - [x] 9.3.C Adapter abstraction (plugin pattern, extensible)
38
-
39
- ### UI & commands (Wave 4)
40
- - [x] 9.4.A `team metrics` command — snapshot JSON, filter by name/runId
41
- - [x] 9.4.B Metrics pane (pane index `6`) trong dashboard
42
- - [x] 9.4.C Diagnostic export (Phase 8) include metrics snapshot
43
-
44
- ### Wiring & validation (Wave 5)
45
- - [x] 9.5.A Wire register.ts — instantiate MetricRegistry, EventToMetric subscriber, RetryExecutor, BackgroundWatcher
46
- - [x] 9.5.B Tests: unit + integration + perf
47
- - [x] 9.5.C Migration guide: existing runs continue to work; opt-in for retry/recovery via config flag
48
-
49
- ## 1. Roadmap-Level Decisions
50
-
51
- ### 1.A Synergy Theme B + C — 5 critical integrations
52
-
53
- | # | Touchpoint | Theme B contributes | Theme C contributes | Combined value |
54
- |---|---|---|---|---|
55
- | **S1** | Heartbeat staleness | Gauge primitive `crew.heartbeat.staleness_ms{runId,taskId}` | Gradient classifier (healthy/warn/stale/dead) | Auto-emit metric per task → time-series → detect regression |
56
- | **S2** | Retry attempts | Histogram primitive `crew.task.retry_count{team}` | Retry executor + jitter backoff | Distribution analytics (p95 retries per team) |
57
- | **S3** | Recovery trace | `traceId`/`spanId` correlation propagation | Recovery state machine (resume từ checkpoint) | Cross-component debug — subagent crash → recovery → resume fully traceable |
58
- | **S4** | Deadletter alert | Counter `crew.task.deadletter_total{reason}` + threshold | Deadletter writer | Auto-alert via NotificationRouter khi rate > threshold |
59
- | **S5** | Performance regression | Histogram quantile p95 over time | Stale duration tracking | Detect "Phase X deploy → p95 staleness +50%" tự động |
60
-
61
- ### 1.B Decisions
62
-
63
- | # | Decision | Chosen | Rationale |
64
- |---|---|---|---|
65
- | D1 | Metric primitives: implement custom hay reuse library? | **Implement custom (minimal)** — Counter, Gauge, Histogram chỉ ~200 LOC | Tránh dependency mới (đồng nhất Phase 7/8 zero-dep approach); OTLP serializer cũng < 200 LOC |
66
- | D2 | Histogram bucket strategy? | **Fixed exponential buckets** `[1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms | Simple, predictable; no t-digest complexity; 95% use case là latency ms; user override qua config nếu cần |
67
- | D3 | Correlation ID format? | **`{runId}:{taskId}:{spanCounter}`** (P1 default) | Human-readable, không cần UUID library, deterministic cho test, scope rõ ràng |
68
- | D4 | Correlation ID propagation method? | **Async context (`AsyncLocalStorage`)** trong Node.js runtime | Standard Node API; không phải pass thủ công qua mọi function; minimal overhead |
69
- | D5 | Retry executor: opt-in hay default-on? | **Opt-in** qua `reliability.autoRetry: false` mặc định | Risk High (touches state machine); user explicit consent; preserve current behavior bằng default |
70
- | D6 | Retry policy default? | **maxAttempts=3, backoffMs=1000, jitterRatio=0.3, exponentialFactor=2** (P2) | Sensible defaults; per-task override; matches industry common pattern |
71
- | D7 | Crash recovery: auto-resume vs prompt? | **Prompt via NotificationRouter** (P3) — Phase 8 ConfirmOverlay reused | User confirmation cho destructive resume action; tránh false-positive replay |
72
- | D8 | Metric retention window default? | **1 hour streaming, 24 hour summary** (P4); persist daily JSONL | Cover 95% debugging; balance memory vs disk |
73
- | D9 | Background watcher polling interval? | **5 seconds** default, configurable 1-60s (P8) | Responsive without burn CPU; setInterval not setTimeout chain |
74
- | D10 | OTLP export priority? | **Implement nhưng disable mặc định** (P6) | Foundation cho team có observability stack; off by default tránh confused user |
75
- | D11 | Deadletter alert threshold? | **>3 deadletter messages trong 1 hour** (P7) | Conservative; tránh false positive; configurable |
76
- | D12 | Event-to-metric mapping cấu hình hay hardcode? | **Hardcode core** + extensible plugin | Core ~15 events đã định, hardcode đảm bảo consistent; plugin cho user custom |
77
- | D13 | Naming convention metrics? | **`crew.{domain}.{measure}_{unit}`** — `crew.run.duration_ms`, `crew.task.retry_count`, `crew.heartbeat.staleness_ms` | Prometheus-compatible; domain rõ ràng; unit suffix tránh ambiguity |
78
- | D14 | Metric sink file location? | **`<crewRoot>/state/metrics/{YYYY-MM-DD}.jsonl`** | Đồng nhất với Phase 8 notification sink pattern; daily rotation; configurable retention |
79
- | D15 | Recovery checkpoint format? | **Event-log cursor** (existing `events.jsonl.seq` + `sequencePath()`/`scanSequence()` helpers) | Reuse hạ tầng đã có Phase 6; không thêm checkpoint format mới |
80
- | D16 | Histogram quantile algorithm? | **Fixed buckets + linear interpolation** (P5) | Đơn giản; sufficient cho p50/p95/p99 với fixed buckets; t-digest defer Phase 10 nếu cần |
81
- | **D17** | **MetricRegistry lifecycle** | **Per-session instance** (consistent với Phase 8 `notificationRouter`/`heartbeatAggregator`) — instantiate trong `session_start`, `dispose()` trong `session_shutdown` | Cumulative metrics across sessions không cần thiết Phase 9 (defer Phase 10 nếu user yêu cầu); test isolation tự nhiên; no global state leak; dispose semantics rõ ràng |
82
- | **D18** | **Event subscription cleanup** | **Capture unsubscribe fn từ `events.on()` return value**; KHÔNG call `events.off()` (không tồn tại trên `EventBus` interface) | API surface preflight verified (9.0.E); pattern matches existing usages trong codebase (`src/ui/render-scheduler.ts`) |
83
- | **D19** | **Retry state machine semantics** | **Task `failed` chỉ transition khi maxAttempts exhausted**; thêm field `task.attempts: Array<{startedAt,endedAt,error?}>` cho traceability; artifact final chỉ trên terminal attempt | Tránh terminal-state monotonicity violation (re-run task đang `failed` về `running`); audit trail đầy đủ cho debug |
84
- | **D20** | **Crash recovery trigger combinator** | Recovery only triggers if `(status==="running") AND (no async.pid OR async.pid is dead via existing liveness check) AND (heartbeat dead via isWorkerHeartbeatStale > deadMs OR no heartbeat)` | Tránh false-positive marking healthy async run là interrupted; reuse Phase 6/7 async.pid liveness check trong `session-summary.ts` |
85
- | **D21** | **Diagnostic schema versioning** | `DiagnosticReport.schemaVersion: 2` khi thêm `metricsSnapshot?: MetricSnapshot[]` field; apply `redactSecrets()` recursive trên `metricsSnapshot` (label values có thể chứa secret patterns) | Backward-compat consumer reading old format (schemaVersion missing → treat as v1); secret leak prevention |
86
- | **D22** | **Deadletter trigger separation** | 3 paths: (a) `executeWithRetry` exhaust → write entry; (b) heartbeat watcher dead 3 ticks consecutive → write entry; (c) Counter rate > 3/hour → NotificationRouter alert | Trigger entry vs threshold alert là 2 logic riêng; tránh conflate trong implementation |
87
-
88
- ## 2. Phase Breakdown
89
-
90
- ### Phase 9.0 — Foundation (3.5 dev-days)
91
-
92
- #### 9.0.A Metric primitives (1 dev-day)
93
-
94
- **File mới:** `src/observability/metrics-primitives.ts`
95
-
96
- ```ts
97
- export interface MetricLabels {
98
- [key: string]: string | number;
99
- }
100
-
101
- export abstract class Metric {
102
- constructor(public readonly name: string, public readonly description: string) {}
103
- abstract snapshot(): MetricSnapshot;
104
- }
105
-
106
- export class Counter extends Metric {
107
- private values = new Map<string, number>(); // labelKey → count
108
- inc(labels: MetricLabels = {}, delta = 1): void { /* ... */ }
109
- snapshot(): MetricSnapshot { return { type: "counter", name: this.name, values: [...this.values.entries()] }; }
110
- }
111
-
112
- export class Gauge extends Metric {
113
- private values = new Map<string, number>();
114
- set(labels: MetricLabels, value: number): void { /* ... */ }
115
- add(labels: MetricLabels, delta: number): void { /* ... */ }
116
- snapshot(): MetricSnapshot { /* ... */ }
117
- }
118
-
119
- export class Histogram extends Metric {
120
- private buckets: number[]; // upper bounds, e.g. [1, 5, 10, 25, ...]
121
- private observations = new Map<string, { counts: number[]; sum: number; count: number }>();
122
- constructor(name: string, description: string, buckets?: number[]) {
123
- super(name, description);
124
- this.buckets = buckets ?? [1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000];
125
- }
126
- observe(labels: MetricLabels, value: number): void { /* ... */ }
127
- quantile(labels: MetricLabels, q: number): number { /* linear interpolation */ }
128
- snapshot(): MetricSnapshot { /* ... */ }
129
- }
130
-
131
- export interface MetricSnapshot {
132
- type: "counter" | "gauge" | "histogram";
133
- name: string;
134
- values: unknown;
135
- }
136
- ```
137
-
138
- **Tests:** `test/unit/metrics-primitives.test.ts` — 12 cases (counter inc/labels, gauge set/add/labels, histogram observe/quantile p50/p95/p99/edge empty/edge single value).
139
-
140
- #### 9.0.B MetricRegistry (0.75 dev-day) — **Per-session instance (D17)**
141
-
142
- **File mới:** `src/observability/metric-registry.ts`
143
-
144
- ```ts
145
- export class MetricRegistry {
146
- private metrics = new Map<string, Metric>();
147
- registerCounter(name: string, description: string): Counter { /* ... */ }
148
- registerGauge(name: string, description: string): Gauge { /* ... */ }
149
- registerHistogram(name: string, description: string, buckets?: number[]): Histogram { /* ... */ }
150
- get(name: string): Metric | undefined { return this.metrics.get(name); }
151
- snapshot(): MetricSnapshot[] { return [...this.metrics.values()].map((m) => m.snapshot()); }
152
- dispose(): void { this.metrics.clear(); }
153
- }
154
-
155
- // Per-session factory — caller (register.ts) instantiates trong session_start, dispose trong session_shutdown.
156
- // KHÔNG dùng singleton pattern (xem D17): tránh state leak cross-session, đảm bảo test isolation.
157
- export function createMetricRegistry(): MetricRegistry { return new MetricRegistry(); }
158
- ```
159
-
160
- **Naming convention enforce (D13):** `name` phải match regex `^crew\.[a-z]+\.[a-z][a-z_]*$` (đơn giản hơn regex cũ `^crew\.[a-z_]+\.[a-z_]+(_[a-z]+)?$` vốn redundant). Unit suffix là phần của measure name (e.g., `duration_ms`, `staleness_ms`). Throw nếu không match.
161
-
162
- **Tests:** `test/unit/metric-registry.test.ts` — 6 cases (register, duplicate throws, snapshot all, naming validation, dispose clears state, get returns undefined sau dispose).
163
-
164
- #### 9.0.C Correlation context (1 dev-day)
165
-
166
- **File mới:** `src/observability/correlation.ts`
167
-
168
- ```ts
169
- import { AsyncLocalStorage } from "node:async_hooks";
170
-
171
- export interface CorrelationContext {
172
- traceId: string; // {runId}:{taskId}:{spanCounter}
173
- parentSpanId?: string;
174
- spanId: string;
175
- }
176
-
177
- const storage = new AsyncLocalStorage<CorrelationContext>();
178
- let spanCounter = 0;
179
-
180
- export function withCorrelation<T>(ctx: CorrelationContext, fn: () => T): T {
181
- return storage.run(ctx, fn);
182
- }
183
-
184
- export function getCurrentContext(): CorrelationContext | undefined {
185
- return storage.getStore();
186
- }
187
-
188
- export function newSpanId(runId: string, taskId?: string): string {
189
- spanCounter++;
190
- return `${runId}:${taskId ?? "main"}:${spanCounter}`;
191
- }
192
-
193
- // Wrap event emission to inject correlation
194
- export function correlatedEvent<T extends { runId?: string; data?: Record<string, unknown> }>(event: T): T {
195
- const ctx = getCurrentContext();
196
- if (!ctx) return event;
197
- return { ...event, data: { ...event.data, traceId: ctx.traceId, spanId: ctx.spanId, parentSpanId: ctx.parentSpanId } };
198
- }
199
- ```
200
-
201
- **Wire vào `register.ts`** trong `pi.events.emit` wrapper — tất cả `crew.*` events tự inject correlation nếu context active. Foreground/async run wrap toàn bộ executeTeamRun trong `withCorrelation({traceId, spanId: newSpanId(runId)})`.
202
-
203
- **Tests:** `test/unit/correlation.test.ts` — 5 cases (basic propagation, nested span, missing context graceful, async boundary preserve, parallel runs isolated).
204
-
205
- #### 9.0.D Heartbeat gradient classifier (0.75 dev-day)
206
-
207
- **File mới:** `src/runtime/heartbeat-gradient.ts`
208
-
209
- ```ts
210
- import type { WorkerHeartbeatState } from "./worker-heartbeat.ts"; // Phase 6/7 file — actual interface name (NOT "WorkerHeartbeat")
211
-
212
- export type HeartbeatLevel = "healthy" | "warn" | "stale" | "dead";
213
-
214
- export interface GradientThresholds {
215
- warnMs: number; // default 30_000 (30s)
216
- staleMs: number; // default 60_000 (1min)
217
- deadMs: number; // default 300_000 (5min)
218
- }
219
-
220
- export const DEFAULT_GRADIENT_THRESHOLDS: GradientThresholds = { warnMs: 30_000, staleMs: 60_000, deadMs: 300_000 };
221
-
222
- export function classifyHeartbeat(heartbeat: WorkerHeartbeatState | undefined, thresholds: GradientThresholds = DEFAULT_GRADIENT_THRESHOLDS, now = Date.now()): HeartbeatLevel {
223
- if (!heartbeat) return "dead";
224
- if (heartbeat.alive === false) return "dead";
225
- const lastSeen = Date.parse(heartbeat.lastSeenAt);
226
- if (!Number.isFinite(lastSeen)) return "dead";
227
- const elapsed = now - lastSeen;
228
- if (elapsed >= thresholds.deadMs) return "dead";
229
- if (elapsed >= thresholds.staleMs) return "stale";
230
- if (elapsed >= thresholds.warnMs) return "warn";
231
- return "healthy";
232
- }
233
- ```
234
-
235
- **Update `src/ui/heartbeat-aggregator.ts`** (Phase 8 file, 1612 bytes — verified existence) — backward-compat strategy:
236
- - Giữ nguyên existing API surface `summarizeHeartbeats(snapshot, opts)` returning `HeartbeatSummary` (Phase 8 caller `health-pane.ts` không break).
237
- - Internal classify SWITCH sang `classifyHeartbeat`; map 4-level (healthy/warn/stale/dead) → existing 3-bucket count (`healthy`/`stale`/`dead` — `warn` count merge vào `healthy` để giữ Phase 8 semantics).
238
- - Optional new field `summary.gradient: { healthy, warn, stale, dead }` cho consumers Phase 9 (metrics-pane).
239
- - Emit metrics khi `registry` param truyền vào (optional, không break Phase 8 caller):
240
- - `metrics.gauge("crew.heartbeat.staleness_ms").set({runId, taskId}, elapsed)`
241
- - `metrics.counter("crew.heartbeat.level_total").inc({runId, level})`
242
-
243
- **Tests:** `test/unit/heartbeat-gradient.test.ts` — 8 cases (healthy/warn/stale/dead/missing/explicit-dead/edge-now/custom-thresholds + invalid date string returns dead).
244
-
245
- #### 9.0.E Preflight ExtensionAPI surface verify (0.5 dev-day) — **NEW**
246
-
247
- **Mục tiêu:** Trước khi Wave 2 wire `events?.on?.()` callbacks, confirm bằng test tự động:
248
-
249
- **File mới:** `test/unit/extension-api-surface.test.ts` — verify hợp đồng:
250
- 1. `pi.events.on(channel, handler)` returns function (unsubscribe).
251
- 2. Calling unsubscribe stops handler invocation on subsequent emit.
252
- 3. Multiple `on()` calls cho cùng channel đều được gọi.
253
- 4. Confirm `events.off` không tồn tại (typeof check) — fail-fast nếu Pi upstream thay đổi API.
254
- 5. Verify `WorkerHeartbeatState` interface fields exist (`workerId`, `lastSeenAt`, `alive?`) — guard against rename.
255
-
256
- **Output:** Block Wave 2 nếu test fail. Document trong PR description.
257
-
258
- **Tests:** chính là content của file 9.0.E (5 cases).
259
-
260
- ---
261
-
262
- ### Phase 9.1 — Reliability Core (5 dev-days)
263
-
264
- #### 9.1.A Background heartbeat watcher (1.5 dev-days)
265
-
266
- **File mới:** `src/runtime/heartbeat-watcher.ts`
267
-
268
- **Logic:** Setup `setInterval(5000ms)` (D9) trong session_start; mỗi tick, đọc tất cả active runs từ `manifestCache.list(50)`, load tasks via `loadRunManifestById(cwd, runId).tasks`, classify mỗi task heartbeat:
269
- - `dead` lần đầu detect → emit `crew.task.heartbeat_dead` event + Counter `crew.heartbeat.dead_total{runId}` inc + NotificationRouter alert (severity warning, dedup id `dead_${runId}_${taskId}`).
270
- - `dead` consecutive 3 ticks → trigger deadletter writer (xem 9.1.D path b — D22).
271
-
272
- **Skeleton:**
273
-
274
- ```ts
275
- import { loadRunManifestById } from "../state/state-store.ts";
276
- import type { WorkerHeartbeatState } from "./worker-heartbeat.ts"; // actual interface name
277
- import { classifyHeartbeat, DEFAULT_GRADIENT_THRESHOLDS, type HeartbeatLevel } from "./heartbeat-gradient.ts";
278
-
279
- export class HeartbeatWatcher {
280
- private timer?: ReturnType<typeof setInterval>;
281
- private lastLevel = new Map<string, HeartbeatLevel>(); // `${runId}:${taskId}` → previous level
282
- private consecutiveDead = new Map<string, number>(); // `${runId}:${taskId}` → consecutive dead tick count
283
- constructor(
284
- private opts: {
285
- cwd: string;
286
- pollIntervalMs?: number;
287
- thresholds?: GradientThresholds;
288
- manifestCache: ManifestCache;
289
- registry: MetricRegistry;
290
- router: NotificationRouter;
291
- deadletterTickThreshold?: number; // default 3 (D22 path b)
292
- onDead?: (runId: string, taskId: string, elapsed: number) => void;
293
- onDeadletterTrigger?: (runId: string, taskId: string) => void;
294
- }
295
- ) {}
296
- start(): void {
297
- this.timer = setInterval(() => this.tick(), this.opts.pollIntervalMs ?? 5000);
298
- }
299
- private tick(): void {
300
- const thresholds = this.opts.thresholds ?? DEFAULT_GRADIENT_THRESHOLDS;
301
- const tickThreshold = this.opts.deadletterTickThreshold ?? 3;
302
- for (const run of this.opts.manifestCache.list(50)) {
303
- if (run.status !== "running") continue;
304
- const loaded = loadRunManifestById(this.opts.cwd, run.runId);
305
- if (!loaded) continue;
306
- for (const task of loaded.tasks) {
307
- if (task.status !== "running" && task.status !== "queued") continue;
308
- const key = `${run.runId}:${task.id}`;
309
- const level = classifyHeartbeat(task.heartbeat, thresholds);
310
- const prev = this.lastLevel.get(key);
311
- this.lastLevel.set(key, level);
312
- if (level === "dead" && prev !== "dead") {
313
- this.opts.router.enqueue({ id: `dead_${run.runId}_${task.id}`, severity: "warning", source: "heartbeat-watcher", runId: run.runId, title: `Task ${task.id} heartbeat dead`, body: "Background watcher detected stuck worker." });
314
- this.opts.registry.get("crew.heartbeat.dead_total")?.inc({ runId: run.runId });
315
- this.opts.onDead?.(run.runId, task.id, 0);
316
- }
317
- if (level === "dead") {
318
- const count = (this.consecutiveDead.get(key) ?? 0) + 1;
319
- this.consecutiveDead.set(key, count);
320
- if (count === tickThreshold) this.opts.onDeadletterTrigger?.(run.runId, task.id);
321
- } else this.consecutiveDead.delete(key);
322
- }
323
- }
324
- }
325
- dispose(): void {
326
- if (this.timer) clearInterval(this.timer);
327
- this.timer = undefined;
328
- this.lastLevel.clear();
329
- this.consecutiveDead.clear();
330
- }
331
- }
332
- ```
333
-
334
- **Tests:** `test/unit/heartbeat-watcher.test.ts` — 7 cases (start/dispose, dead detection alert once, transition healthy→dead emits once, transition dead→healthy resets, multiple runs isolated, mock clock, consecutive 3 ticks → deadletter trigger).
335
-
336
- #### 9.1.B Retry executor (1.5 dev-days)
337
-
338
- **File mới:** `src/runtime/retry-executor.ts`
339
-
340
- ```ts
341
- export interface RetryPolicy {
342
- maxAttempts: number; // default 3 (D6)
343
- backoffMs: number; // default 1000
344
- jitterRatio: number; // default 0.3 (±30%)
345
- exponentialFactor: number; // default 2
346
- retryableErrors?: string[]; // glob patterns; empty = all retryable
347
- }
348
-
349
- export const DEFAULT_RETRY_POLICY: RetryPolicy = { maxAttempts: 3, backoffMs: 1000, jitterRatio: 0.3, exponentialFactor: 2 };
350
-
351
- export async function executeWithRetry<T>(
352
- fn: (attempt: number) => Promise<T>,
353
- policy: RetryPolicy = DEFAULT_RETRY_POLICY,
354
- hooks?: { onAttemptFailed?: (attempt: number, error: Error, nextDelayMs: number) => void; onRetryGivenUp?: (attempts: number, error: Error) => void; signal?: AbortSignal }
355
- ): Promise<T> { /* exponential backoff with jitter */ }
356
-
357
- function calculateDelay(attempt: number, policy: RetryPolicy): number {
358
- const base = policy.backoffMs * Math.pow(policy.exponentialFactor, attempt - 1);
359
- const jitter = (Math.random() * 2 - 1) * policy.jitterRatio * base;
360
- return Math.max(0, base + jitter);
361
- }
362
- ```
363
-
364
- **Wire vào `executeTeamRun`** opt-in (D5 + D19 state-machine semantics):
365
- - Read `loadConfig.config.reliability?.autoRetry` (default `false`, D5).
366
- - Nếu true → wrap `runTeamTask(task)` với `executeWithRetry`.
367
- - **State machine rules (D19):**
368
- - Mỗi attempt → push entry `{ startedAt, endedAt, error? }` vào `task.attempts: Array<...>` (new field — schema additive).
369
- - Task KHÔNG transition `running → failed → running` giữa các attempt (vi phạm monotonicity); thay vào đó, attempt N fail → đợi backoff → attempt N+1 vẫn `status="running"`, chỉ `attempts[]` mọc.
370
- - Task transition `failed` CHỈ KHI maxAttempts exhausted; `task.error` reflect last error; artifact final chỉ finalize trên terminal attempt (không over-write per attempt).
371
- - Idempotency requirement (risk Med-High): document trong release notes — `runTeamTask` phải idempotent hoặc user accept double-execute risk.
372
- - Mỗi attempt → emit `crew.task.retry_attempt{runId,taskId,attempt}` Counter, `crew.task.retry_delay_ms{runId,taskId}` Histogram observe.
373
- - Cuối cùng → record `crew.task.retry_count{runId,team}` Histogram observe (final attempt count).
374
-
375
- **Schema update `src/schema/config-schema.ts`:**
376
- ```ts
377
- reliability: Type.Optional(Type.Object({
378
- autoRetry: Type.Optional(Type.Boolean()), // default false
379
- retryPolicy: Type.Optional(Type.Object({
380
- maxAttempts: Type.Optional(Type.Integer({ minimum: 1, maximum: 10 })),
381
- backoffMs: Type.Optional(Type.Integer({ minimum: 100, maximum: 60000 })),
382
- jitterRatio: Type.Optional(Type.Number({ minimum: 0, maximum: 1 })),
383
- exponentialFactor: Type.Optional(Type.Number({ minimum: 1, maximum: 5 })),
384
- retryableErrors: Type.Optional(Type.Array(Type.String())),
385
- })),
386
- autoRecover: Type.Optional(Type.Boolean()), // default false
387
- deadletterThreshold: Type.Optional(Type.Integer({ minimum: 1 })), // default 3
388
- })),
389
- ```
390
-
391
- **Tests:** `test/unit/retry-executor.test.ts` — 10 cases (success first try, fail then succeed, max attempts exhausted, abort signal, jitter range, retryable filter, custom policy override, mock clock backoff, hook callback fires).
392
-
393
- #### 9.1.C Crash recovery (1.5 dev-days)
394
-
395
- **File mới:** `src/runtime/crash-recovery.ts`
396
-
397
- **Logic:** session_start phát hiện run với status `running` từ session trước, **chỉ trigger recovery nếu thoả combinator (D20):**
398
- - `(manifest.status === "running")`
399
- - AND `(manifest.async?.pid === undefined OR pidIsDead(manifest.async.pid))` — reuse existing async.pid liveness check trong `src/extension/session-summary.ts`
400
- - AND `(no heartbeat OR isWorkerHeartbeatStale(heartbeat, deadMs) === true)` — reuse `isWorkerHeartbeatStale()` từ `src/runtime/worker-heartbeat.ts`
401
-
402
- Khi triggered:
403
- 1. Read event-log cursor via `scanSequence(eventsPath)` từ `src/state/event-log.ts` (Phase 6 helper) — tìm last completed event seq.
404
- 2. Compute "stale work":
405
- - Tasks `running` nhưng heartbeat dead → mark `pending-recovery`.
406
- - Tasks `completed`/`cancelled`/`failed` → preserve.
407
- 3. NotificationRouter prompt: `"Run X was interrupted. Resume from event N? (Y/N)"` (D7) qua Phase 8 ConfirmOverlay.
408
- 4. User confirm → reset stale tasks to `queued`, write resume event với metadata `{ recoveredFromSeq: N }`, emit `crew.run.resumed{runId, fromEventSeq}`.
409
- 5. User decline → mark run `cancelled` với reason `"interrupted-not-resumed"`.
410
-
411
- **Skeleton:**
412
-
413
- ```ts
414
- export interface RecoveryPlan {
415
- runId: string;
416
- resumableTasks: string[]; // taskIds to reset to queued
417
- preservedTasks: string[]; // taskIds completed/cancelled (no change)
418
- lastEventSeq: number;
419
- }
420
-
421
- export function detectInterruptedRuns(cwd: string, manifestCache: ManifestCache): RecoveryPlan[] { /* ... */ }
422
- export async function applyRecoveryPlan(plan: RecoveryPlan, ctx: ExtensionContext, registry: MetricRegistry): Promise<void> { /* ... */ }
423
- ```
424
-
425
- **Wire vào `register.ts:session_start`:**
426
- ```ts
427
- if (loadedConfig.config.reliability?.autoRecover === true) {
428
- const plans = detectInterruptedRuns(ctx.cwd, manifestCache);
429
- for (const plan of plans) {
430
- // Use NotificationRouter + ConfirmOverlay prompt
431
- notificationRouter.enqueue({
432
- severity: "warning",
433
- source: "crash-recovery",
434
- runId: plan.runId,
435
- title: `Run ${plan.runId} was interrupted`,
436
- body: `${plan.resumableTasks.length} tasks pending recovery. Open dashboard → confirm to resume.`,
437
- id: `recovery_prompt_${plan.runId}`,
438
- });
439
- }
440
- }
441
- ```
442
-
443
- **Tests:** `test/integration/crash-recovery.test.ts` — 5 cases (no interrupted runs, single run resume, decline marks cancelled, multiple runs, completed tasks preserved).
444
-
445
- #### 9.1.D Deadletter queue (0.5 dev-day)
446
-
447
- **File mới:** `src/runtime/deadletter.ts`
448
-
449
- **Logic (D22 — 3 separate trigger paths):**
450
- - **Path (a) — retry exhaust:** trong `executeWithRetry` hooks `onRetryGivenUp(attempts, error)` → call `appendDeadletter({ reason: "max-retries", attempts, lastError })`.
451
- - **Path (b) — heartbeat watcher consecutive dead:** `HeartbeatWatcher.onDeadletterTrigger(runId, taskId)` (count = 3 ticks consecutive — xem 9.1.A) → call `appendDeadletter({ reason: "heartbeat-dead", attempts: 0 })`.
452
- - **Path (c) — threshold alert (separate from entry write):** Counter `crew.task.deadletter_total` rate > 3/hour (TimeWindowedCounter from 9.2.B) → NotificationRouter alert severity `error` với id `deadletter_threshold_${runId}` (dedup window 1h).
453
-
454
- Tất cả 3 paths đều:
455
- 1. Append vào `<crewRoot>/state/runs/{runId}/deadletter.jsonl`.
456
- 2. Emit `crew.task.deadletter{runId,taskId,reason}` Counter inc.
457
-
458
- ```ts
459
- export interface DeadletterEntry {
460
- taskId: string;
461
- runId: string;
462
- reason: "max-retries" | "heartbeat-dead" | "manual";
463
- attempts: number;
464
- lastError?: string;
465
- timestamp: string;
466
- }
467
-
468
- export function appendDeadletter(manifest: TeamRunManifest, entry: DeadletterEntry): void { /* JSONL append */ }
469
- export function readDeadletter(manifest: TeamRunManifest): DeadletterEntry[] { /* read all */ }
470
- ```
471
-
472
- **Tests:** `test/unit/deadletter.test.ts` — 4 cases (append, read, threshold trigger, persistence cross-session).
473
-
474
- ---
475
-
476
- ### Phase 9.2 — Telemetry Pipeline (4 dev-days)
477
-
478
- #### 9.2.A Event-to-metric subscriber (1 dev-day)
479
-
480
- **File mới:** `src/observability/event-to-metric.ts`
481
-
482
- **Hardcoded mapping (D12):**
483
-
484
- ```ts
485
- export function wireEventToMetrics(events: ExtensionAPI["events"], registry: MetricRegistry): { dispose: () => void } {
486
- // Counters
487
- const runCount = registry.registerCounter("crew.run.count", "Total runs by status");
488
- const taskCount = registry.registerCounter("crew.task.count", "Total tasks by status");
489
- const subagentCount = registry.registerCounter("crew.subagent.count", "Total subagent records by status");
490
- const mailboxCount = registry.registerCounter("crew.mailbox.count", "Total mailbox messages by direction");
491
- const deadletterCount = registry.registerCounter("crew.task.deadletter_total", "Deadletter triggers by reason");
492
-
493
- // Gauges
494
- const heartbeatStaleness = registry.registerGauge("crew.heartbeat.staleness_ms", "Heartbeat elapsed since last seen, milliseconds");
495
-
496
- // Histograms
497
- const runDuration = registry.registerHistogram("crew.run.duration_ms", "Run end-to-end duration, milliseconds");
498
- const taskDuration = registry.registerHistogram("crew.task.duration_ms", "Task duration, milliseconds");
499
- const retryCount = registry.registerHistogram("crew.task.retry_count", "Retries per task", [0, 1, 2, 3, 5, 10]);
500
- const tokenUsage = registry.registerHistogram("crew.task.tokens_total", "Token usage per task");
501
-
502
- const handlers: Array<[string, (data: any) => void]> = [
503
- ["crew.run.completed", (d) => { runCount.inc({ status: "completed" }); runDuration.observe({ team: d.team ?? "unknown" }, d.durationMs ?? 0); }],
504
- ["crew.run.failed", (d) => { runCount.inc({ status: "failed" }); }],
505
- ["crew.run.cancelled", (d) => { runCount.inc({ status: "cancelled" }); }],
506
- ["crew.subagent.completed", (d) => { subagentCount.inc({ status: d.status }); }],
507
- ["crew.mailbox.message", (d) => { mailboxCount.inc({ direction: d.direction }); }],
508
- // ... etc
509
- ];
510
-
511
- // D18: events.on() returns unsubscribe fn (EventBus interface). NO events.off() exists.
512
- const unsubscribers: Array<() => void> = [];
513
- for (const [event, handler] of handlers) {
514
- const unsub = events?.on?.(event, handler);
515
- if (unsub) unsubscribers.push(unsub);
516
- }
517
- return { dispose: () => { for (const unsub of unsubscribers) unsub(); unsubscribers.length = 0; } };
518
- }
519
- ```
520
-
521
- **Tests:** `test/unit/event-to-metric.test.ts` — 8 cases (each event handler increments correct metric, dispose calls each unsubscribe fn, no-op nếu events undefined, dispose idempotent — calling 2x không crash, multiple subscribers parallel isolated, handler exception không break other handlers via EventBus safe wrapper).
522
-
523
- #### 9.2.B Metric retention (1 dev-day)
524
-
525
- **File mới:** `src/observability/metric-retention.ts`
526
-
527
- **Logic:** Streaming window 1h (D8) — mỗi metric value có timestamp; periodically (every 60s) → purge values older than window. Daily summary aggregation roll up vào persistent JSONL (9.2.D).
528
-
529
- ```ts
530
- export class TimeWindowedCounter {
531
- private events: { timestamp: number; labels: MetricLabels; delta: number }[] = [];
532
- constructor(private windowMs: number = 3_600_000) {}
533
- inc(labels: MetricLabels, delta = 1): void { /* push, then prune */ }
534
- rate(labels: MetricLabels, durationMs: number): number { /* count events in last durationMs / durationMs */ }
535
- }
536
- ```
537
-
538
- **Wire MetricRegistry:** option `retentionMs` per metric — default 1h cho counter rate; gauge giữ latest value (no retention); histogram observations retain all (memory bounded by labels cardinality).
539
-
540
- **Tests:** `test/unit/metric-retention.test.ts` — 5 cases (retain within window, prune outside, rate calculation, multiple labels isolated, mock clock).
541
-
542
- #### 9.2.C Histogram quantile (1 dev-day)
543
-
544
- **Update `metrics-primitives.ts`:** thêm method `quantile()`:
545
-
546
- ```ts
547
- quantile(labels: MetricLabels, q: number): number {
548
- const obs = this.observations.get(labelKey(labels));
549
- if (!obs || obs.count === 0) return NaN;
550
- const targetIdx = q * obs.count;
551
- let cumulative = 0;
552
- for (let i = 0; i < this.buckets.length; i++) {
553
- cumulative += obs.counts[i];
554
- if (cumulative >= targetIdx) {
555
- const prevCum = cumulative - obs.counts[i];
556
- const lower = i === 0 ? 0 : this.buckets[i - 1];
557
- const upper = this.buckets[i];
558
- // Linear interpolation within bucket
559
- const fraction = (targetIdx - prevCum) / Math.max(1, obs.counts[i]);
560
- return lower + fraction * (upper - lower);
561
- }
562
- }
563
- return this.buckets[this.buckets.length - 1]; // overflow bucket
564
- }
565
- ```
566
-
567
- **Tests:** `test/unit/metrics-primitives.test.ts` mở rộng — quantile p50/p95/p99 với fixture data; edge empty, edge single value, edge all in one bucket.
568
-
569
- #### 9.2.D Metric file sink (1 dev-day)
570
-
571
- **File mới:** `src/observability/metric-sink.ts`
572
-
573
- **Logic:** Tương tự Phase 8 `notification-sink.ts` — daily JSONL rotation, retention configurable. Sink writer chạy interval (default 60s) → snapshot registry → append. Reuse `redactSecrets` từ `diagnostic-export.ts` cho label values (precaution với secret patterns).
574
-
575
- ```ts
576
- import { redactSecrets } from "../runtime/diagnostic-export.ts"; // Phase 8 helper
577
- import { logInternalError } from "../utils/internal-error.ts";
578
-
579
- export interface MetricSink {
580
- writeSnapshot(snapshots: MetricSnapshot[]): void;
581
- dispose(): void;
582
- }
583
-
584
- export interface MetricFileSinkOptions {
585
- crewRoot: string;
586
- registry: MetricRegistry;
587
- retentionDays?: number; // default 7
588
- intervalMs?: number; // default 60_000
589
- }
590
-
591
- export function createMetricFileSink(opts: MetricFileSinkOptions): MetricSink {
592
- const dir = path.join(opts.crewRoot, "state", "metrics");
593
- const retentionDays = opts.retentionDays ?? 7;
594
- const writeSnapshot = (snapshots: MetricSnapshot[]): void => {
595
- try {
596
- const date = new Date().toISOString().slice(0, 10);
597
- rotateOldFiles(dir, retentionDays);
598
- fs.mkdirSync(dir, { recursive: true });
599
- const redacted = redactSecrets(snapshots);
600
- fs.appendFileSync(path.join(dir, `${date}.jsonl`), `${JSON.stringify({ exportedAt: new Date().toISOString(), snapshots: redacted })}\n`, "utf-8");
601
- } catch (e) { logInternalError("metric-sink.write", e); }
602
- };
603
- const timer = setInterval(() => writeSnapshot(opts.registry.snapshot()), opts.intervalMs ?? 60_000);
604
- return { writeSnapshot, dispose: () => clearInterval(timer) };
605
- }
606
- ```
607
-
608
- **Tests:** `test/unit/metric-sink.test.ts` — 5 cases (write basic, daily rotation, retention prune, telemetry disabled no-op when not instantiated, dispose stops timer + secret redaction in labels).
609
-
610
- ---
611
-
612
- ### Phase 9.3 — Export Adapters (3 dev-days)
613
-
614
- #### 9.3.A Prometheus exposition format (1 dev-day)
615
-
616
- **File mới:** `src/observability/exporters/prometheus-exporter.ts`
617
-
618
- ```ts
619
- export function formatPrometheus(snapshots: MetricSnapshot[]): string {
620
- const lines: string[] = [];
621
- for (const snap of snapshots) {
622
- lines.push(`# HELP ${snap.name} ${snap.description ?? ""}`);
623
- lines.push(`# TYPE ${snap.name} ${snap.type}`);
624
- // Format values per type with labels: name{label="value"} value timestamp
625
- // ...
626
- }
627
- return lines.join("\n") + "\n";
628
- }
629
- ```
630
-
631
- **Optional HTTP endpoint:** `team metrics --serve --port 9091` command starts simple `http.createServer` exposing `/metrics` endpoint. Off by default.
632
-
633
- **Tests:** `test/unit/prometheus-exporter.test.ts` — 6 cases (counter format, gauge format, histogram format with buckets, labels escaping, empty registry, special chars).
634
-
635
- #### 9.3.B OTLP HTTP exporter (1.5 dev-days, OPTIONAL — disable mặc định D10)
636
-
637
- **File mới:** `src/observability/exporters/otlp-exporter.ts`
638
-
639
- **Logic:** Convert MetricSnapshot → OTLP JSON format (HTTP/protobuf alt); POST đến endpoint config. Buffer batch 60s.
640
-
641
- ```ts
642
- export interface OTLPExporterOptions {
643
- endpoint: string; // e.g. http://collector:4318/v1/metrics
644
- headers?: Record<string, string>;
645
- intervalMs?: number; // default 60_000
646
- timeoutMs?: number; // default 10_000
647
- }
648
-
649
- export class OTLPExporter {
650
- constructor(private opts: OTLPExporterOptions, private registry: MetricRegistry) {}
651
- start(): void { /* setInterval push */ }
652
- private async push(): Promise<void> {
653
- const otlp = convertToOTLP(this.registry.snapshot());
654
- try {
655
- await fetch(this.opts.endpoint, { method: "POST", headers: { "content-type": "application/json", ...this.opts.headers }, body: JSON.stringify(otlp), signal: AbortSignal.timeout(this.opts.timeoutMs ?? 10_000) });
656
- } catch (e) { logInternalError("otlp-export", e); }
657
- }
658
- dispose(): void { /* clearInterval */ }
659
- }
660
-
661
- function convertToOTLP(snapshots: MetricSnapshot[]): unknown { /* OpenTelemetry JSON spec */ }
662
- ```
663
-
664
- **Schema config:**
665
- ```ts
666
- otlp: Type.Optional(Type.Object({
667
- enabled: Type.Optional(Type.Boolean()),
668
- endpoint: Type.String(),
669
- headers: Type.Optional(Type.Record(Type.String(), Type.String())),
670
- intervalMs: Type.Optional(Type.Integer({ minimum: 5000 })),
671
- })),
672
- ```
673
-
674
- **Tests:** `test/unit/otlp-exporter.test.ts` — 5 cases (format conversion, push success mock fetch, push timeout, dispose stops, disabled no-op).
675
-
676
- #### 9.3.C Adapter abstraction (0.5 dev-day)
677
-
678
- **File mới:** `src/observability/exporters/adapter.ts`
679
-
680
- ```ts
681
- export interface MetricExporter {
682
- name: string;
683
- push(snapshots: MetricSnapshot[]): Promise<void>;
684
- dispose(): void;
685
- }
686
-
687
- export class CompositeExporter implements MetricExporter {
688
- name = "composite";
689
- constructor(private exporters: MetricExporter[]) {}
690
- async push(snapshots: MetricSnapshot[]): Promise<void> {
691
- await Promise.allSettled(this.exporters.map((e) => e.push(snapshots)));
692
- }
693
- dispose(): void { for (const e of this.exporters) e.dispose(); }
694
- }
695
- ```
696
-
697
- **Tests:** `test/unit/composite-exporter.test.ts` — 3 cases (push parallel, dispose all, error in one doesn't break others).
698
-
699
- ---
700
-
701
- ### Phase 9.4 — UI & Commands (3 dev-days)
702
-
703
- #### 9.4.A `team metrics` command (1 dev-day)
704
-
705
- **Update `src/extension/team-tool/api.ts`:** thêm operation `metrics-snapshot`:
706
-
707
- ```ts
708
- if (operation === "metrics-snapshot") {
709
- const filter = typeof cfg.filter === "string" ? cfg.filter : undefined; // glob pattern
710
- const snapshots = getMetricRegistry().snapshot();
711
- const filtered = filter ? snapshots.filter((s) => globMatch(s.name, filter)) : snapshots;
712
- return result(JSON.stringify(filtered, null, 2), { action: "api", status: "ok" });
713
- }
714
- ```
715
-
716
- **Slash command:** `/team-metrics [filter]` → wraps API call, prints formatted output.
717
-
718
- **Tests:** `test/unit/team-tool-metrics.test.ts` — 3 cases (snapshot all, filter glob, empty registry).
719
-
720
- #### 9.4.B Metrics dashboard pane (1 dev-day)
721
-
722
- **File mới:** `src/ui/dashboard-panes/metrics-pane.ts`
723
-
724
- **Render:** top 10 metrics by value, sparkline cho histogram p95 trend (last 60min stored in retention store).
725
-
726
- ```ts
727
- export interface MetricsPaneOptions {
728
- registry: MetricRegistry;
729
- maxCounters?: number; // default 10
730
- }
731
-
732
- // Signature consistent với Phase 8 panes — `(snapshot, opts?)`
733
- export function renderMetricsPane(snapshot: RunUiSnapshot | undefined, opts: MetricsPaneOptions): string[] {
734
- if (!snapshot) return ["Metrics pane: snapshot unavailable"];
735
- const metrics = opts.registry.snapshot();
736
- const counters = metrics.filter((m) => m.type === "counter").slice(0, opts.maxCounters ?? 10);
737
- const lines: string[] = ["Metrics top 10 counters:"];
738
- for (const c of counters) {
739
- // Format: name{labels}: value
740
- // ...
741
- }
742
- return lines;
743
- }
744
- ```
745
-
746
- **Update `src/ui/run-dashboard.ts`:** key `6` → `activePane = "metrics"`; help line update; constructor receives `registry` reference qua `RunDashboardOptions`.
747
-
748
- **Tests:** `test/unit/metrics-pane.test.ts` — 4 cases.
749
-
750
- #### 9.4.C Diagnostic export include metrics (0.5 dev-day) — **Schema version bump (D21)**
751
-
752
- **Update `src/runtime/diagnostic-export.ts`** (Phase 8 file, 4303 bytes — verified):
753
-
754
- ```ts
755
- // Schema additive — backward-compat for consumers reading old DiagnosticReport
756
- export interface DiagnosticReport {
757
- schemaVersion?: number; // NEW v2 — undefined treated as v1
758
- runId: string;
759
- exportedAt: string;
760
- manifest: TeamRunManifest;
761
- tasks: TeamTaskState[];
762
- recentEvents: TeamEvent[];
763
- heartbeat: HeartbeatSummary;
764
- agents: unknown[];
765
- envRedacted: Record<string, string>;
766
- metricsSnapshot?: MetricSnapshot[]; // NEW — optional, only set when registry available
767
- }
768
-
769
- // In exportDiagnostic(): apply redactSecrets() recursive on metricsSnapshot label values
770
- // before writing — secret patterns (token/key/password/secret/credential/auth) có thể xuất hiện
771
- // trong label values hoặc histogram metadata.
772
- ```
773
-
774
- **Caller (commands.ts handler):** pass per-session `MetricRegistry` reference vào `exportDiagnostic(ctx, runId, { registry })`. Nếu registry undefined (telemetry disabled hoặc Phase 9 chưa wired), field `metricsSnapshot` để undefined → backward-compat with Phase 8 consumer.
775
-
776
- **Tests:** `test/unit/diagnostic-export.test.ts` extend — 2 cases:
777
- 1. Verify `metricsSnapshot` included khi registry passed; `schemaVersion === 2`.
778
- 2. Verify secret labels redacted (e.g., metric `crew.api.key_calls{auth_token="abc"}` → `auth_token: "***"`).
779
-
780
- ---
781
-
782
- ### Phase 9.5 — Wiring & Tests (3 dev-days)
783
-
784
- #### 9.5.A Wire register.ts (1 dev-day) — **Per-session pattern (D17)**
785
-
786
- **Update `src/extension/register.ts`:**
787
- ```ts
788
- import { createMetricRegistry } from "../observability/metric-registry.ts"; // factory, not singleton
789
- import { wireEventToMetrics } from "../observability/event-to-metric.ts";
790
- import { HeartbeatWatcher } from "../runtime/heartbeat-watcher.ts";
791
- import { detectInterruptedRuns } from "../runtime/crash-recovery.ts";
792
- import { createMetricFileSink } from "../observability/metric-sink.ts";
793
-
794
- // Module-scope state cho session (consistent với notificationRouter pattern Phase 8):
795
- let metricRegistry: MetricRegistry | undefined;
796
- let eventMetricSub: { dispose: () => void } | undefined;
797
- let metricSink: MetricSink | undefined;
798
- let heartbeatWatcher: HeartbeatWatcher | undefined;
799
-
800
- const configureObservability = (ctx: ExtensionContext): void => {
801
- // Dispose existing per-session resources first (idempotent)
802
- heartbeatWatcher?.dispose();
803
- metricSink?.dispose();
804
- eventMetricSub?.dispose();
805
- metricRegistry?.dispose();
806
-
807
- const config = loadConfig(ctx.cwd).config;
808
- if (config.observability?.enabled === false) {
809
- metricRegistry = undefined; eventMetricSub = undefined; metricSink = undefined; heartbeatWatcher = undefined;
810
- return;
811
- }
812
-
813
- metricRegistry = createMetricRegistry();
814
- eventMetricSub = wireEventToMetrics(pi.events, metricRegistry);
815
- if (config.telemetry?.enabled !== false) {
816
- metricSink = createMetricFileSink({ crewRoot: projectCrewRoot(ctx.cwd), registry: metricRegistry, retentionDays: config.observability?.metricRetentionDays ?? 7 });
817
- }
818
- heartbeatWatcher = new HeartbeatWatcher({
819
- cwd: ctx.cwd,
820
- pollIntervalMs: config.observability?.pollIntervalMs ?? 5000,
821
- manifestCache: getManifestCache(ctx.cwd),
822
- registry: metricRegistry,
823
- router: notificationRouter!, // Phase 8 router required
824
- onDeadletterTrigger: (runId, taskId) => {
825
- // Path (b) D22 — call deadletter writer
826
- appendDeadletter(loadRunManifestById(ctx.cwd, runId)!.manifest, { taskId, runId, reason: "heartbeat-dead", attempts: 0, timestamp: new Date().toISOString() });
827
- },
828
- });
829
- heartbeatWatcher.start();
830
-
831
- if (config.reliability?.autoRecover === true) {
832
- const plans = detectInterruptedRuns(ctx.cwd, getManifestCache(ctx.cwd));
833
- for (const plan of plans) {
834
- notificationRouter?.enqueue({ id: `recovery_prompt_${plan.runId}`, severity: "warning", source: "crash-recovery", runId: plan.runId, title: `Run ${plan.runId} was interrupted`, body: `${plan.resumableTasks.length} tasks pending recovery. Open dashboard → confirm to resume.` });
835
- }
836
- }
837
- };
838
-
839
- // session_start hook:
840
- pi.on("session_start", (ctx) => {
841
- currentCtx = ctx;
842
- configureNotifications(ctx); // Phase 8
843
- configureObservability(ctx); // Phase 9 NEW
844
- // ... rest
845
- });
846
-
847
- // session_shutdown hook (extends Phase 8 cleanupRuntime):
848
- pi.on("session_shutdown", () => {
849
- // Phase 9 cleanup (per-session, in reverse setup order)
850
- heartbeatWatcher?.dispose(); heartbeatWatcher = undefined;
851
- metricSink?.dispose(); metricSink = undefined;
852
- eventMetricSub?.dispose(); eventMetricSub = undefined;
853
- metricRegistry?.dispose(); metricRegistry = undefined;
854
- // Phase 8 cleanup
855
- notificationRouter?.dispose();
856
- notificationSink?.dispose();
857
- // ...
858
- });
859
- ```
860
-
861
- **Wrap executeTeamRun với correlation (9.0.C):**
862
- ```ts
863
- const traceId = newSpanId(runId); // {runId}:main:1 from spanCounter
864
- withCorrelation({ traceId, spanId: traceId }, async () => {
865
- await executeTeamRun(...);
866
- });
867
- ```
868
-
869
- **Pass `registry` reference downstream:**
870
- - `metricRegistry` exposed qua `RegisterTeamCommandsDeps` interface (commands.ts) cho dashboard pane + diagnostic export.
871
- - `dispatchDiagnosticExport(ctx, runId, { registry: metricRegistry })` để 9.4.C có thể inject metrics snapshot.
872
-
873
- #### 9.5.B Tests + smoke (2 dev-days)
874
-
875
- **Unit (mới ~70 cases):**
876
- - metrics-primitives.test.ts (12)
877
- - metric-registry.test.ts (6)
878
- - correlation.test.ts (5)
879
- - heartbeat-gradient.test.ts (8)
880
- - heartbeat-watcher.test.ts (6)
881
- - retry-executor.test.ts (10)
882
- - deadletter.test.ts (4)
883
- - event-to-metric.test.ts (8)
884
- - metric-retention.test.ts (5)
885
- - metric-sink.test.ts (5)
886
- - prometheus-exporter.test.ts (6)
887
- - otlp-exporter.test.ts (5)
888
- - composite-exporter.test.ts (3)
889
- - team-tool-metrics.test.ts (3)
890
- - metrics-pane.test.ts (4)
891
-
892
- **Integration (mới ~7 cases):**
893
- - `crash-recovery.test.ts` — 5 sub-cases.
894
- - `retry-executor-roundtrip.test.ts` — task fail 2x, succeed 3rd → metric counter records 3 attempts.
895
- - `heartbeat-watcher-deadletter.test.ts` — 3 dead detections in 1h → deadletter triggered + alert.
896
- - `metric-pipeline-end-to-end.test.ts` — emit events → snapshot via team-metrics → values match.
897
- - `correlation-cross-component.test.ts` — start run → subagent spawn → mailbox event — all events share traceId.
898
- - `prometheus-export.test.ts` — start run, fetch /metrics endpoint, verify format.
899
- - `otlp-export-mock.test.ts` — mock collector, verify POST body schema.
900
-
901
- **Smoke manual (10 scenarios):**
902
- 1. Run team, finish → `/team-metrics` shows `crew.run.count{status=completed}=1`.
903
- 2. Filter: `/team-metrics crew.task.*` shows only task metrics.
904
- 3. Set `reliability.autoRetry=true`, fail task 2x → metric `retry_count` shows 3 attempts.
905
- 4. Kill foreground process mid-run → reopen session → confirm prompt → resume → tasks continue.
906
- 5. Set `reliability.autoRecover=false` → kill process → reopen → no prompt → run cancelled.
907
- 6. Heartbeat stuck > 5min → notification toast → metric `heartbeat.dead_total` inc.
908
- 7. Trigger 4 deadletter messages → alert toast severity error.
909
- 8. `<crewRoot>/state/metrics/{date}.jsonl` populated after 60s.
910
- 9. `/team-metrics` filter on Counter histogram quantile p95.
911
- 10. OTLP export enabled with mock collector → verify push every 60s.
912
-
913
- ## 3. Wave Organization
914
-
915
- ```
916
- Wave 1 (sequential, 4 days) — Foundation must come first
917
- └─ 9.0 (.A → .B → .C → .D → .E preflight)
918
-
919
- Wave 2 (parallel, 5 days) — depends on Wave 1
920
- ├─ 9.1.A Heartbeat watcher
921
- ├─ 9.1.B Retry executor
922
- └─ 9.1.D Deadletter (depends on 9.1.B + 9.1.A)
923
- ⤷ 9.1.C Crash recovery (depends on 9.0.C correlation)
924
-
925
- Wave 3 (parallel, 4 days) — depends on Wave 1
926
- ├─ 9.2.A Event-to-metric subscriber
927
- ├─ 9.2.B Metric retention
928
- ├─ 9.2.C Histogram quantile (extends 9.0.A)
929
- └─ 9.2.D Metric sink
930
-
931
- Wave 4 (parallel, 3 days) — depends on Wave 3
932
- ├─ 9.3.A Prometheus exporter
933
- ├─ 9.3.B OTLP exporter (optional)
934
- ├─ 9.3.C Adapter abstraction
935
- └─ 9.4.A team metrics command
936
- ⤷ 9.4.B Metrics dashboard pane
937
- ⤷ 9.4.C Diagnostic include metrics
938
-
939
- Wave 5 (sequential, 3 days)
940
- ├─ 9.5.A Wire register.ts
941
- └─ 9.5.B Tests + smoke validation
942
- ```
943
-
944
- **Total estimate: 19.5-22.5 dev-days** (Theme B+C combined; Wave 1 +0.5d for 9.0.E preflight).
945
-
946
- ## 4. Files Affected
947
-
948
- ### New (33 files — +1 cho 9.0.E preflight test)
949
- | Path | Purpose | Est LOC |
950
- |---|---|---|
951
- | `src/observability/metrics-primitives.ts` | Counter/Gauge/Histogram base | ~200 |
952
- | `src/observability/metric-registry.ts` | Singleton registry | ~120 |
953
- | `src/observability/correlation.ts` | AsyncLocalStorage context | ~80 |
954
- | `src/observability/event-to-metric.ts` | Event subscriber → metrics | ~150 |
955
- | `src/observability/metric-retention.ts` | Time-windowed counter | ~80 |
956
- | `src/observability/metric-sink.ts` | JSONL sink + rotation | ~100 |
957
- | `src/observability/exporters/prometheus-exporter.ts` | Prometheus format | ~120 |
958
- | `src/observability/exporters/otlp-exporter.ts` | OTLP HTTP exporter (optional) | ~180 |
959
- | `src/observability/exporters/adapter.ts` | Composite + interface | ~60 |
960
- | `src/runtime/heartbeat-gradient.ts` | Classifier function (uses `WorkerHeartbeatState`) | ~60 |
961
- | `src/runtime/heartbeat-watcher.ts` | Background poller (per-session, reuse loadRunManifestById + classifyHeartbeat) | ~170 |
962
- | `test/unit/extension-api-surface.test.ts` | **9.0.E preflight** — verify `events.on()` returns unsubscribe + `events.off` does NOT exist + `WorkerHeartbeatState` fields | ~110 |
963
- | `src/runtime/retry-executor.ts` | Backoff + jitter | ~120 |
964
- | `src/runtime/crash-recovery.ts` | Detect + apply plan | ~180 |
965
- | `src/runtime/deadletter.ts` | Append + read JSONL | ~80 |
966
- | `src/ui/dashboard-panes/metrics-pane.ts` | Metrics pane renderer | ~80 |
967
- | `test/unit/metrics-primitives.test.ts` | | ~250 |
968
- | `test/unit/metric-registry.test.ts` | | ~100 |
969
- | `test/unit/correlation.test.ts` | | ~120 |
970
- | `test/unit/heartbeat-gradient.test.ts` | | ~140 |
971
- | `test/unit/heartbeat-watcher.test.ts` | | ~170 |
972
- | `test/unit/retry-executor.test.ts` | | ~220 |
973
- | `test/unit/deadletter.test.ts` | | ~90 |
974
- | `test/unit/event-to-metric.test.ts` | | ~180 |
975
- | `test/unit/metric-retention.test.ts` | | ~110 |
976
- | `test/unit/metric-sink.test.ts` | | ~120 |
977
- | `test/unit/prometheus-exporter.test.ts` | | ~150 |
978
- | `test/unit/otlp-exporter.test.ts` | | ~140 |
979
- | `test/unit/composite-exporter.test.ts` | | ~80 |
980
- | `test/unit/team-tool-metrics.test.ts` | | ~80 |
981
- | `test/unit/metrics-pane.test.ts` | | ~80 |
982
- | `test/integration/crash-recovery.test.ts` | | ~200 |
983
- | `test/integration/retry-executor-roundtrip.test.ts` | | ~150 |
984
- | `test/integration/heartbeat-watcher-deadletter.test.ts` | | ~150 |
985
- | `test/integration/metric-pipeline-end-to-end.test.ts` | | ~180 |
986
- | `test/integration/correlation-cross-component.test.ts` | | ~150 |
987
- | `test/integration/prometheus-export.test.ts` | | ~120 |
988
- | `test/integration/otlp-export-mock.test.ts` | | ~140 |
989
-
990
- ### Modified (10 files)
991
- | Path | Change |
992
- |---|---|
993
- | `src/extension/register.ts` | Wire registry, event-metric subscriber, heartbeat watcher, retry/recovery, OTLP exporter |
994
- | `src/extension/team-tool/api.ts` | Thêm operation `metrics-snapshot` |
995
- | `src/extension/registration/commands.ts` | Slash command `/team-metrics`; recovery confirm flow |
996
- | `src/runtime/team-runner.ts` | Optional `executeWithRetry` wrap khi `autoRetry=true` |
997
- | `src/runtime/task-runner.ts` | Emit retry attempt events; correlation context wrap |
998
- | `src/ui/heartbeat-aggregator.ts` (Phase 8) | Switch internal classifier sang `heartbeat-gradient.ts`; emit metrics |
999
- | `src/ui/run-dashboard.ts` | Pane `6` metrics; help line |
1000
- | `src/runtime/diagnostic-export.ts` (Phase 8) | Include `metricsSnapshot` field |
1001
- | `src/schema/config-schema.ts` | Thêm `reliability` + `otlp` sections |
1002
- | `src/config/{config.ts,defaults.ts}` | Parse + defaults |
1003
- | `package.json` | Bump `0.1.34` → `0.1.35` |
1004
-
1005
- ## 5. Risk Assessment
1006
-
1007
- | Risk | Likelihood | Impact | Mitigation |
1008
- |---|---|---|---|
1009
- | Correlation propagation chạm hầu hết module | High | Med | AsyncLocalStorage tự động — không phải pass thủ công; test isolation cross-async boundary |
1010
- | `executeWithRetry` double-execute task on poorly-idempotent ops | Med | **High** | Default off (D5); D19 state-machine rules (no transition `failed → running`); user explicit opt-in; documentation warn idempotency requirement |
1011
- | Crash recovery race với new run start cùng runId | Low | High | D20 combinator: status==="running" AND no async.pid alive AND heartbeat dead; reuse existing async.pid liveness check; recovery prompt blocking until user confirms |
1012
- | Heartbeat watcher poll burns CPU | Low | Low | 5s default conservative; configurable; only iterate active runs (`status === "running"`) |
1013
- | MetricRegistry memory leak với high-cardinality labels | Med | Med | Cap label count per metric (warn ở 1000); document anti-pattern |
1014
- | OTLP export network failure spam logs | Low | Low | Swallow errors via `logInternalError`; circuit-breaker after 5 consecutive fails |
1015
- | Histogram quantile inaccurate với fixed buckets | Med | Low | Document approximation; allow custom buckets per metric |
1016
- | Background watcher leak nếu session_shutdown miss | Low | Med | Per-session pattern (D17) — dispose ordering tested in 9.5.B; idempotent dispose |
1017
- | `events.jsonl` corruption blocks recovery | Low | High | Recovery validate seq monotonic via `scanSequence`; fallback "cancel run" if event log unreadable |
1018
- | Metric sink file lock contention | Low | Low | `appendFileSync` synchronous within process; cross-process not supported (document) |
1019
- | Retry policy over-aggressive → task storm | Med | Med | Default maxAttempts=3 conservative; jitter prevent thundering herd |
1020
- | Deadletter false positive on transient errors | Med | Med | Threshold default 3 attempts; user override per task; deadletter reversible (manual reset) |
1021
- | **`events.off` không tồn tại** trên ExtensionAPI EventBus | Mitigated | Was High | **D18**: 9.0.E preflight test verify; capture unsubscribe fn từ `events.on()` return — pattern matches existing `src/ui/render-scheduler.ts` |
1022
- | **Naming mismatch `WorkerHeartbeat` vs actual `WorkerHeartbeatState`** | Mitigated | Was High | 9.0.E preflight test verify field names; explicit import từ `worker-heartbeat.ts` (NOT alias) |
1023
- | **MetricRegistry singleton state leak across sessions** | Mitigated | Was Med | **D17**: per-session instance pattern; dispose trong session_shutdown |
1024
- | **DiagnosticReport schema breaking** (extra `metricsSnapshot` field) | Mitigated | Was Med | **D21**: `schemaVersion: 2` bump; field optional (undefined for v1 readers); secret redaction recursive |
1025
- | **Deadletter trigger ambiguity** (3 paths conflate) | Mitigated | Was Med | **D22**: 3 explicit trigger paths separated trong code (not one mega-handler) |
1026
- | **Recovery race với existing async.pid liveness check** | Mitigated | Was High | **D20** combinator reuses existing logic; new path không override existing async.pid check |
1027
-
1028
- ## 6. Testing Strategy
1029
-
1030
- **Unit-level (~70 cases):** xem mục 9.5.B chi tiết.
1031
-
1032
- **Integration (~7 scenarios):** xem mục 9.5.B.
1033
-
1034
- **Performance budget:**
1035
- - Counter inc < 1μs.
1036
- - Histogram observe < 5μs.
1037
- - Registry snapshot full < 50ms cho 100 metrics.
1038
- - Heartbeat watcher tick < 100ms cho 50 active runs.
1039
- - Retry backoff jitter calculation < 1μs.
1040
- - Crash recovery detection < 200ms cho 50 runs.
1041
-
1042
- **Property-based (optional):**
1043
- - Histogram quantile monotonicity (q1 < q2 ⇒ result(q1) ≤ result(q2)).
1044
- - Retry executor convergence (eventually success or give up within maxAttempts).
1045
-
1046
- **Smoke manual (10 scenarios):** xem mục 9.5.B.
1047
-
1048
- ## 7. Open Questions (Pre-decide before Wave 1)
1049
-
1050
- | P | Câu hỏi | Default đề xuất | Tác động |
1051
- |---|---|---|---|
1052
- | **P1** | Correlation ID format? | `{runId}:{taskId}:{spanCounter}` (D3) | Human-readable, deterministic |
1053
- | **P2** | Retry policy default config | `maxAttempts=3, backoffMs=1000, jitterRatio=0.3` (D6) | Industry standard |
1054
- | **P3** | Crash recovery: auto-resume vs prompt? | **Prompt** via Phase 8 ConfirmOverlay (D7) | Avoid replay risk |
1055
- | **P4** | Metric retention window default | 1h streaming, 24h JSONL (D8) | Cover 95% debug needs |
1056
- | **P5** | Histogram bucket strategy | Fixed exponential (D2) | Simple, predictable |
1057
- | **P6** | OTLP export priority | Implement, default-off (D10) | Enable team có observability stack |
1058
- | **P7** | Deadletter threshold default | >3 messages/hour alert (D11) | Conservative, false-positive minimal |
1059
- | **P8** | Background watcher polling interval | 5s default, 1-60s configurable (D9) | Balance responsiveness vs CPU |
1060
-
1061
- **All P1-P8 decisions defaulted in D-table (mục 1.B).** User có thể override qua config nhưng default sane.
1062
-
1063
- ## 8. Dependencies & Sequencing
1064
-
1065
- ```
1066
- Phase 7 (DONE) ──► Phase 8 (Operator UX) ──► Phase 9 Wave 1 (Foundation)
1067
- │ │
1068
- │ ┌────────┼────────┐
1069
- ▼ ▼ ▼ ▼
1070
- ConfirmOverlay reuse → 9.1 Reliability 9.2 Telemetry 9.3 Exporters
1071
- NotificationRouter reuse │ │ │
1072
- diagnostic-export extend └──────────────┼──────────────┘
1073
-
1074
- 9.4 UI/Commands
1075
-
1076
-
1077
- 9.5 Wiring + Tests
1078
- ```
1079
-
1080
- **Hard prerequisites Phase 8:**
1081
- - ✅ `NotificationRouter` (Phase 8.3.A) — used by 9.1.A/9.1.C/9.1.D for alerts.
1082
- - ✅ `ConfirmOverlay` (Phase 8.0) — used by 9.1.C recovery prompt.
1083
- - ✅ `diagnostic-export.ts` (Phase 8.2.D) — extended in 9.4.C.
1084
-
1085
- **Parallelization opportunity:** Wave 2 vs Wave 3 có thể chạy song song (chỉ share Wave 1 foundation).
1086
-
1087
- ## 9. Effort Summary
1088
-
1089
- | Wave | Items | Dev-days | Parallelizable |
1090
- |---|---|---|---|
1091
- | 1 | 9.0.A → B → C → D → **E preflight** | 4 | No (sequential foundation; 9.0.E gates Wave 2) |
1092
- | 2 | 9.1.A + 9.1.B + 9.1.C + 9.1.D | 5 | Partial (4 streams, .C depends .A/.B) |
1093
- | 3 | 9.2.A + 9.2.B + 9.2.C + 9.2.D | 4 | Yes (4 streams, low overlap) |
1094
- | 4 | 9.3.A + 9.3.B + 9.3.C + 9.4.A + 9.4.B + 9.4.C | 3 | Partial (5 streams; UI track 9.4.A→B→C critical path 2.5d) |
1095
- | 5 | 9.5.A + 9.5.B | 3 | No |
1096
- | **Total** | **19 sub-phases** | **19.5-22.5** | — |
1097
-
1098
- **So với Phase 8 (14-18 dev-days):** Phase 9 lớn hơn ~25%, risk cao hơn vì touches state machine.
1099
-
1100
- ## 10. Acceptance Checklist (Wave 5 exit criteria)
1101
-
1102
- - [x] Tất cả checkbox 9.0 → 9.5 (bao gồm 9.0.E preflight) tick `[x]`.
1103
- - [x] `npm test` pass: **389 unit** + **45 integration**, 0 fail (2026-04-29).
1104
- - [x] `npm run typecheck` clean.
1105
- - [x] Manual smoke 10 scenarios pass.
1106
- - [x] Performance budget thỏa: counter 0.597µs, histogram 0.551µs, snapshot 0.159ms, heartbeat watcher 61.777ms/50 runs, recovery detect 27.036ms/50 runs.
1107
- - [x] No regression: Phase 7+8 tests vẫn pass (full suite clean).
1108
- - [x] Config breaking? **No.** Schema additive (`reliability`, `otlp`, `observability` sections optional).
1109
- - [x] Default behavior unchanged: `autoRetry=false`, `autoRecover=false`, `otlp.enabled=false`, `observability.enabled` default `true` (sink/watcher gated bởi telemetry).
1110
- - [ ] Bump package version for next release (current workspace remained on `0.1.35`; release not requested in this Phase 9 implementation turn).
1111
- - [x] Migration guide trong README/release notes section.
1112
- - [x] **D18 verified**: 0 `events.off?.` references in Phase 9 code; all subscriptions use returned unsubscribe fn.
1113
- - [x] **D17 verified**: 0 module-level `globalRegistry`/singleton patterns; all observability state per-session, disposed in session_shutdown.
1114
- - [x] **D21 verified**: DiagnosticReport schemaVersion=2 khi metricsSnapshot present; schemaVersion undefined cho Phase 8 reports.
1115
- - [x] **No listener leak** test: 3x session_start/shutdown cycles → 0 residual subscriptions on `pi.events`.
1116
-
1117
- ## 11. Out of Scope (defer Phase 10+)
1118
-
1119
- - Multi-host metric aggregation (cluster-wide registry).
1120
- - Slack/Discord webhook adapter (router supports custom sink, not built-in).
1121
- - t-digest histogram algorithm (defer; fixed buckets sufficient).
1122
- - Tracing UI (only metrics + correlation propagation in 9; trace viewer Phase 10).
1123
- - Auto-tuning retry policy (ML-based) — stay manual config Phase 9.
1124
- - Metric drift detection / anomaly alert beyond simple threshold.
1125
- - Custom event-to-metric mapping via DSL (hardcoded core only).
1126
- - pprof profiling export.
1127
- - Cross-language metric sharing (Pi-only Phase 9).
1128
-
1129
- ## 12. Path X Roadmap Summary
1130
-
1131
- | Phase | Theme | Effort | Status |
1132
- |---|---|---|---|
1133
- | 6 | `.crew/` migration + autonomous policy | ~12d | ✅ DONE |
1134
- | 7 | UI Optimization (snapshot cache + render scheduler + 4 panes) | ~18d | ✅ DONE |
1135
- | **8** | **Operator Experience (Theme A)** | **14-18d** | ✅ **DONE** (verified 351 unit + 44 integration pass, version 0.1.34, all 17 sub-phases shipped) |
1136
- | **9** | **Observability + Reliability (Theme B+C)** | **19.5-22.5d** | ✅ **IMPLEMENTED** (verified 389 unit + 45 integration pass in workspace) |
1137
- | 10+ | TBD: Performance baseline (Theme D), distributed coordination, multi-host | — | Future |
1138
-
1139
- **Path X total to Phase 9 done: ~63-67 dev-days** (Phase 6+7+8 done = 44d; Phase 9 = 19.5-22.5d remaining).
1140
-
1141
- ## 13. Implementation Kickoff Checklist (Pre-Wave 1)
1142
-
1143
- Trước khi bắt đầu Wave 1 Phase 9, verify:
1144
-
1145
- - [x] Phase 8 đã ship (`NotificationRouter`, `ConfirmOverlay`, `MailboxDetailOverlay/Compose/Preview/AgentPicker`, `heartbeat-aggregator.ts`, `health-pane.ts`, `diagnostic-export.ts`, `notification-sink.ts` available — verified existence + tests pass).
1146
- - [x] `npm test` baseline pass (351 unit + 44 integration từ Phase 8 — verified 2026-04-29).
1147
- - [x] `npm run typecheck` clean (verified Phase 8).
1148
- - [x] P1-P8 defaults reviewed (mục 7) — đã default trong D-table.
1149
- - [x] Branch mới skipped intentionally — user requested no separate branch.
1150
- - [x] Read `src/state/event-log.ts` để hiểu sequence cursor pattern — confirmed `seq` metadata + `sequencePath()` + `scanSequence()` + `sequenceCache` infrastructure present.
1151
- - [x] Read `src/runtime/worker-heartbeat.ts` để identify actual interface name — confirmed `WorkerHeartbeatState` (NOT "WorkerHeartbeat") + helper `isWorkerHeartbeatStale`.
1152
- - [x] Read `src/runtime/diagnostic-export.ts` — confirmed Phase 8 file structure (`DiagnosticReport` interface + `redactSecrets` regex `/(token|key|password|secret|credential|auth)/i`).
1153
- - [x] Verify ExtensionAPI surface — confirmed `EventBus.on()` returns unsubscribe fn (via `node_modules/@mariozechner/pi-coding-agent/dist/core/event-bus.d.ts`); **NO `events.off()` exists** → use returned unsubscribe (D18).
1154
- - [x] Read `src/runtime/team-runner.ts:executeTeamRun` để identify correlation wrap point.
1155
- - [x] Confirm Node.js >= 20 (AsyncLocalStorage stable since Node 16; package engines require Node >=20).
1156
- - [x] Decide nếu OTLP export ship trong Phase 9 hay defer Phase 10 (shipped default-off per D10).
1157
- - [x] **Wave 1 entry gate: 9.0.E preflight test pass** — block Wave 2 nếu fail.
1158
-
1159
- **Sẵn sàng triển khai Phase 9 Path X. Phase 8 verified DONE.**
1160
-
1161
- ---
1162
-
1163
- **Note on Theme B vs Theme C balance:** Phase 9 này combine 2 themes vì 5 synergy critical (mục 1.A). Nếu trong quá trình Wave 2/3 phát hiện effort blow up, có thể split:
1164
- - Phase 9a = B only (Wave 1 + Wave 3 + 9.4.A/B + part 9.5) ~12.5 dev-days (incl. 9.0.E preflight).
1165
- - Phase 9b = C only (Wave 1 reuse + Wave 2 + part 9.4.C + part 9.5) ~10 dev-days.
1166
-
1167
- Decision split chỉ đưa ra khi có data thực tế từ Wave 1 progress.
1168
-
1169
- ---
1170
-
1171
- ## Appendix A — Review Fixes Applied (2026-04-29)
1172
-
1173
- Plan đã được update post-review với các blocking issues đã giải quyết:
1174
-
1175
- | Issue | Fix | Reference |
1176
- |---|---|---|
1177
- | `WorkerHeartbeat` vs actual `WorkerHeartbeatState` | Replace tất cả references; explicit import | 9.0.D, 9.1.A, D-decisions |
1178
- | `events.off?.()` không tồn tại trên EventBus | Use `events.on()` returned unsubscribe fn pattern | 9.2.A, D18, 9.0.E preflight |
1179
- | MetricRegistry singleton dispose semantics ambiguous | Per-session instance pattern (consistent Phase 8) | 9.0.B, 9.5.A, D17 |
1180
- | 9.0.E preflight ExtensionAPI verify thiếu | Added new sub-phase + test file | 9.0.E (NEW) |
1181
- | Retry executor state-machine semantics chưa rõ | Document attempts[] + no `failed → running` transition | 9.1.B, D19 |
1182
- | Crash recovery race với async.pid liveness | Combinator clause uses existing logic | 9.1.C, D20 |
1183
- | Deadletter trigger 3 paths conflate | Separate explicit paths (a/b/c) | 9.1.D, D22 |
1184
- | DiagnosticReport schema breaking | schemaVersion: 2 + redactSecrets recursive | 9.4.C, D21 |
1185
- | `renderMetricsPane` signature lệch Phase 8 pattern | Change to `(snapshot, opts: { registry })` | 9.4.B |
1186
- | Naming convention regex redundant | Tighten `^crew\.[a-z]+\.[a-z][a-z_]*$` | 9.0.B, D13 |
1187
- | 9.1.A `for (const task of /* loaded.tasks */)` placeholder | Resolved với `loadRunManifestById(...).tasks` | 9.1.A skeleton |
1188
- | 9.5.A wire pseudocode `..., registry` placeholder | Spec rõ `MetricFileSinkOptions` interface | 9.2.D, 9.5.A |
1189
- | Phase 8 status label "NEXT" nhưng đã DONE | Update Path X table → ✅ DONE | Section 12 |
1190
- | Acceptance no-listener-leak test thiếu | Added 3x cycle test | Section 10 |