pi-crew 0.1.51 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +56 -1
- package/README.md +176 -781
- package/agents/analyst.md +11 -11
- package/agents/critic.md +11 -11
- package/agents/executor.md +11 -11
- package/agents/explorer.md +11 -11
- package/agents/planner.md +11 -11
- package/agents/reviewer.md +11 -11
- package/agents/security-reviewer.md +11 -11
- package/agents/test-engineer.md +11 -11
- package/agents/verifier.md +70 -11
- package/agents/writer.md +11 -11
- package/docs/actions-reference.md +595 -0
- package/docs/commands-reference.md +347 -0
- package/docs/runtime-flow.md +148 -148
- package/index.ts +6 -6
- package/package.json +99 -99
- package/skills/async-worker-recovery/SKILL.md +42 -42
- package/skills/context-artifact-hygiene/SKILL.md +52 -52
- package/skills/delegation-patterns/SKILL.md +54 -54
- package/skills/mailbox-interactive/SKILL.md +40 -40
- package/skills/model-routing-context/SKILL.md +39 -39
- package/skills/multi-perspective-review/SKILL.md +58 -58
- package/skills/observability-reliability/SKILL.md +41 -41
- package/skills/orchestration/SKILL.md +157 -157
- package/skills/ownership-session-security/SKILL.md +41 -41
- package/skills/pi-extension-lifecycle/SKILL.md +39 -39
- package/skills/requirements-to-task-packet/SKILL.md +63 -63
- package/skills/resource-discovery-config/SKILL.md +41 -41
- package/skills/runtime-state-reader/SKILL.md +44 -44
- package/skills/secure-agent-orchestration-review/SKILL.md +45 -45
- package/skills/state-mutation-locking/SKILL.md +42 -42
- package/skills/systematic-debugging/SKILL.md +67 -67
- package/skills/ui-render-performance/SKILL.md +39 -39
- package/skills/verification-before-done/SKILL.md +57 -57
- package/skills/worktree-isolation/SKILL.md +39 -39
- package/src/adapters/claude-adapter.ts +25 -0
- package/src/adapters/codex-adapter.ts +21 -0
- package/src/adapters/cursor-adapter.ts +17 -0
- package/src/adapters/export-util.ts +137 -0
- package/src/adapters/index.ts +15 -0
- package/src/adapters/registry.ts +18 -0
- package/src/adapters/types.ts +23 -0
- package/src/agents/agent-config.ts +2 -0
- package/src/agents/agent-search.ts +98 -98
- package/src/agents/discover-agents.ts +2 -1
- package/src/config/config.ts +13 -1
- package/src/config/drift-detector.ts +211 -0
- package/src/config/markers.ts +327 -0
- package/src/config/resilient-parser.ts +108 -0
- package/src/config/suggestions.ts +74 -0
- package/src/extension/cross-extension-rpc.ts +103 -94
- package/src/extension/project-init.ts +21 -1
- package/src/extension/register.ts +45 -14
- package/src/extension/registration/commands.ts +77 -8
- package/src/extension/registration/subagent-tools.ts +10 -1
- package/src/extension/registration/team-tool.ts +10 -1
- package/src/extension/registration/viewers.ts +48 -34
- package/src/extension/run-bundle-schema.ts +89 -89
- package/src/extension/run-import.ts +25 -1
- package/src/extension/run-index.ts +5 -1
- package/src/extension/run-maintenance.ts +142 -68
- package/src/extension/team-manager-command.ts +10 -1
- package/src/extension/team-tool/doctor.ts +28 -3
- package/src/extension/team-tool/handle-settings.ts +195 -188
- package/src/extension/team-tool/inspect.ts +41 -41
- package/src/extension/team-tool/intent-policy.ts +42 -42
- package/src/extension/team-tool/lifecycle-actions.ts +27 -8
- package/src/extension/team-tool/plan.ts +19 -19
- package/src/extension/team-tool/run.ts +12 -1
- package/src/extension/team-tool.ts +11 -1
- package/src/i18n.ts +184 -184
- package/src/observability/exporters/otlp-exporter.ts +92 -77
- package/src/prompt/prompt-runtime.ts +72 -72
- package/src/runtime/agent-memory.ts +72 -72
- package/src/runtime/agent-observability.ts +114 -114
- package/src/runtime/async-marker.ts +26 -26
- package/src/runtime/attention-events.ts +28 -28
- package/src/runtime/auto-resume.ts +100 -0
- package/src/runtime/background-runner.ts +11 -1
- package/src/runtime/cancellation-token.ts +89 -89
- package/src/runtime/cancellation.ts +61 -61
- package/src/runtime/capability-inventory.ts +116 -116
- package/src/runtime/child-pi.ts +7 -2
- package/src/runtime/compaction-summary.ts +271 -0
- package/src/runtime/completion-guard.ts +190 -190
- package/src/runtime/crash-recovery.ts +33 -0
- package/src/runtime/delta-conflict.ts +360 -0
- package/src/runtime/direct-run.ts +35 -35
- package/src/runtime/foreground-control.ts +82 -82
- package/src/runtime/green-contract.ts +46 -46
- package/src/runtime/group-join.ts +106 -106
- package/src/runtime/heartbeat-gradient.ts +28 -28
- package/src/runtime/heartbeat-watcher.ts +124 -124
- package/src/runtime/iteration-hooks.ts +262 -0
- package/src/runtime/live-agent-control.ts +88 -88
- package/src/runtime/live-control-realtime.ts +36 -36
- package/src/runtime/live-extension-bridge.ts +150 -150
- package/src/runtime/live-irc.ts +92 -92
- package/src/runtime/live-session-health.ts +100 -100
- package/src/runtime/loop-gates.ts +129 -0
- package/src/runtime/metric-parser.ts +40 -0
- package/src/runtime/notebook-helpers.ts +90 -90
- package/src/runtime/orphan-sentinel.ts +7 -7
- package/src/runtime/parallel-research.ts +44 -44
- package/src/runtime/phase-progress.ts +217 -0
- package/src/runtime/pi-args.ts +38 -11
- package/src/runtime/pi-json-output.ts +111 -111
- package/src/runtime/pi-spawn.ts +57 -7
- package/src/runtime/policy-engine.ts +79 -79
- package/src/runtime/post-checks.ts +122 -0
- package/src/runtime/progress-event-coalescer.ts +43 -43
- package/src/runtime/prose-compressor.ts +164 -164
- package/src/runtime/recovery-recipes.ts +74 -74
- package/src/runtime/result-extractor.ts +121 -121
- package/src/runtime/role-permission.ts +39 -39
- package/src/runtime/sensitive-paths.ts +2 -2
- package/src/runtime/session-resources.ts +25 -25
- package/src/runtime/session-snapshot.ts +59 -59
- package/src/runtime/session-usage.ts +79 -79
- package/src/runtime/sidechain-output.ts +29 -29
- package/src/runtime/stream-preview.ts +177 -177
- package/src/runtime/supervisor-contact.ts +59 -59
- package/src/runtime/task-display.ts +38 -38
- package/src/runtime/task-graph.ts +207 -0
- package/src/runtime/task-quality.ts +207 -0
- package/src/runtime/task-runner/capabilities.ts +78 -78
- package/src/runtime/task-runner/live-executor.ts +7 -1
- package/src/runtime/task-runner/progress.ts +119 -119
- package/src/runtime/task-runner/prompt-pipeline.ts +64 -64
- package/src/runtime/task-runner/result-utils.ts +14 -14
- package/src/runtime/task-runner/run-projection.ts +103 -103
- package/src/runtime/task-runner/state-helpers.ts +22 -22
- package/src/runtime/team-runner.ts +117 -7
- package/src/runtime/worker-heartbeat.ts +21 -21
- package/src/runtime/worker-startup.ts +57 -57
- package/src/runtime/workflow-state.ts +187 -0
- package/src/runtime/workspace-tree.ts +298 -298
- package/src/schema/config-schema.ts +11 -0
- package/src/schema/validation-types.ts +148 -0
- package/src/skills/skill-templates.ts +374 -0
- package/src/state/active-run-registry.ts +35 -11
- package/src/state/atomic-write.ts +33 -26
- package/src/state/contracts.ts +1 -0
- package/src/state/event-reconstructor.ts +217 -0
- package/src/state/locks.ts +2 -13
- package/src/state/mailbox.ts +4 -3
- package/src/state/state-store.ts +32 -14
- package/src/state/task-claims.ts +44 -44
- package/src/state/types.ts +9 -0
- package/src/state/usage.ts +29 -29
- package/src/subagents/async-entry.ts +1 -1
- package/src/subagents/index.ts +3 -3
- package/src/subagents/live/control.ts +1 -1
- package/src/subagents/live/manager.ts +1 -1
- package/src/subagents/live/realtime.ts +1 -1
- package/src/subagents/live/session-runtime.ts +1 -1
- package/src/subagents/manager.ts +1 -1
- package/src/subagents/spawn.ts +1 -1
- package/src/teams/team-serializer.ts +38 -38
- package/src/types/diff.d.ts +18 -18
- package/src/ui/crew-footer.ts +101 -101
- package/src/ui/crew-select-list.ts +111 -111
- package/src/ui/crew-widget.ts +5 -2
- package/src/ui/dashboard-panes/cancellation-pane.ts +42 -42
- package/src/ui/dashboard-panes/capability-pane.ts +59 -59
- package/src/ui/dashboard-panes/mailbox-pane.ts +35 -35
- package/src/ui/dashboard-panes/metrics-pane.ts +34 -34
- package/src/ui/dashboard-panes/progress-pane.ts +11 -0
- package/src/ui/dynamic-border.ts +25 -25
- package/src/ui/layout-primitives.ts +106 -106
- package/src/ui/loaders.ts +158 -158
- package/src/ui/render-coalescer.ts +51 -51
- package/src/ui/render-diff.ts +119 -119
- package/src/ui/render-scheduler.ts +143 -143
- package/src/ui/run-action-dispatcher.ts +10 -1
- package/src/ui/spinner.ts +17 -17
- package/src/ui/status-colors.ts +58 -58
- package/src/ui/syntax-highlight.ts +116 -116
- package/src/ui/transcript-entries.ts +258 -258
- package/src/utils/completion-dedupe.ts +63 -63
- package/src/utils/frontmatter.ts +68 -68
- package/src/utils/git.ts +262 -262
- package/src/utils/ids.ts +17 -17
- package/src/utils/incremental-reader.ts +104 -104
- package/src/utils/names.ts +27 -27
- package/src/utils/redaction.ts +44 -44
- package/src/utils/safe-paths.ts +47 -47
- package/src/utils/scan-cache.ts +136 -136
- package/src/utils/sleep.ts +40 -26
- package/src/utils/task-name-generator.ts +337 -337
- package/src/workflows/validate-workflow.ts +40 -40
- package/src/worktree/branch-freshness.ts +45 -45
- package/teams/default.team.md +12 -12
- package/teams/fast-fix.team.md +11 -11
- package/teams/implementation.team.md +18 -18
- package/teams/parallel-research.team.md +14 -14
- package/teams/research.team.md +11 -11
- package/teams/review.team.md +12 -12
- package/workflows/default.workflow.md +30 -29
- package/workflows/fast-fix.workflow.md +23 -22
- package/workflows/implementation.workflow.md +43 -43
- package/workflows/parallel-research.workflow.md +46 -46
- package/workflows/research.workflow.md +22 -22
- package/workflows/review.workflow.md +30 -30
- package/docs/refactor-tasks-phase3.md +0 -394
- package/docs/refactor-tasks-phase4.md +0 -564
- package/docs/refactor-tasks-phase5.md +0 -402
- package/docs/refactor-tasks-phase6.md +0 -662
- package/docs/refactor-tasks.md +0 -1484
- package/docs/research/AGENT-EXECUTION-ARCHITECTURE.md +0 -261
- package/docs/research/AGENT-LIFECYCLE-COMPARISON.md +0 -111
- package/docs/research/AUDIT_OH_MY_PI.md +0 -261
- package/docs/research/AUDIT_PI_CREW.md +0 -457
- package/docs/research/CAVEMAN-DEEP-RESEARCH.md +0 -281
- package/docs/research/COMPARISON_OH_MY_PI_VS_PI_CREW.md +0 -264
- package/docs/research/DEEP-RESEARCH-PI-POWERBAR.md +0 -343
- package/docs/research/DEEP_RESEARCH_SUBAGENT_ARCHITECTURE.md +0 -480
- package/docs/research/GAP_CLOSURE_IMPLEMENTATION_PLAN.md +0 -354
- package/docs/research/IMPLEMENTATION_PLAN.md +0 -385
- package/docs/research/LIVE-SESSION-PRODUCTION-READY-PLAN.md +0 -502
- package/docs/research/OH-MY-PI-DEEP-RESEARCH-v14.7.6.md +0 -266
- package/docs/research/REMAINING-GAPS-PLAN.md +0 -363
- package/docs/research/SESSION-SUMMARY-2026-05-08.md +0 -146
- package/docs/research/UI-RESPONSIVENESS-AUDIT.md +0 -173
- package/docs/research-awesome-agent-skills-distillation.md +0 -100
- package/docs/research-extension-examples.md +0 -297
- package/docs/research-extension-system.md +0 -324
- package/docs/research-oh-my-pi-distillation.md +0 -369
- package/docs/research-optimization-plan.md +0 -548
- package/docs/research-phase10-distillation.md +0 -199
- package/docs/research-phase11-distillation.md +0 -201
- package/docs/research-phase8-operator-experience-plan.md +0 -819
- package/docs/research-phase9-observability-reliability-plan.md +0 -1190
- package/docs/research-pi-coding-agent.md +0 -357
- package/docs/research-source-pi-crew-reference.md +0 -174
- package/docs/research-ui-optimization-plan.md +0 -480
- package/docs/source-runtime-refactor-map.md +0 -107
- package/src/utils/atomic-write.ts +0 -33
|
@@ -1,1190 +0,0 @@
|
|
|
1
|
-
# Phase 9 — Observability & Reliability (Theme B + C combined)
|
|
2
|
-
|
|
3
|
-
> Path X: Phase 8 (Operator Experience) → **Phase 9 (Observability + Reliability)**. Mục tiêu: build telemetry backbone (Counter/Gauge/Histogram + correlation ID + sink/export) đồng thời harden run reliability (heartbeat gradient + retry + crash recovery + deadletter). Combined vì 5 synergy critical (xem mục 1.A).
|
|
4
|
-
|
|
5
|
-
> **Prerequisite:** Phase 8 đã DONE (verified 351 unit + 44 integration pass, version 0.1.34) — `NotificationRouter`, `ConfirmOverlay`, `MailboxDetailOverlay/Compose/Preview/AgentPicker`, `heartbeat-aggregator.ts`, `health-pane.ts`, `diagnostic-export.ts` (with `redactSecrets` regex `/(token|key|password|secret|credential|auth)/i`), `notification-sink.ts`, `keybinding-map.ts`, `run-action-dispatcher.ts` — Phase 9 reuse.
|
|
6
|
-
|
|
7
|
-
> **Critical preflight finding (Phase 9.0.E):** `ExtensionAPI.events` interface is `EventBus` from `pi-coding-agent/dist/core/event-bus.d.ts`:
|
|
8
|
-
> ```ts
|
|
9
|
-
> interface EventBus { emit(channel, data): void; on(channel, handler): () => void; } // on() returns unsubscribe function — NO off() method
|
|
10
|
-
> ```
|
|
11
|
-
> → All "dispose" patterns must capture `unsubscribe` from `on()` return value, NOT call `events.off()`.
|
|
12
|
-
|
|
13
|
-
## 0. Implementation Status
|
|
14
|
-
|
|
15
|
-
### Foundation (Wave 1)
|
|
16
|
-
- [x] 9.0.A Metric primitives — Counter / Gauge / Histogram base classes (`src/observability/metrics-primitives.ts`)
|
|
17
|
-
- [x] 9.0.B MetricRegistry **per-session instance** + naming convention (`src/observability/metric-registry.ts`)
|
|
18
|
-
- [x] 9.0.C Correlation context — traceId/spanId propagation primitive (`src/observability/correlation.ts`)
|
|
19
|
-
- [x] 9.0.D Heartbeat gradient classifier extension (warn/stale/dead thresholds with metrics emission, reuse `WorkerHeartbeatState` interface + `isWorkerHeartbeatStale` helper)
|
|
20
|
-
- [x] 9.0.E **Preflight verify** ExtensionAPI surface (`events.on` returns unsubscribe fn, `events.off` does NOT exist) + cross-check `WorkerHeartbeatState` field name
|
|
21
|
-
|
|
22
|
-
### Reliability core (Wave 2)
|
|
23
|
-
- [x] 9.1.A Background heartbeat watcher (detect stuck workers, emit `crew.heartbeat.staleness_ms` Gauge)
|
|
24
|
-
- [x] 9.1.B Retry executor + backoff/jitter policy (`src/runtime/retry-executor.ts`)
|
|
25
|
-
- [x] 9.1.C Crash recovery resume từ event-log checkpoint
|
|
26
|
-
- [x] 9.1.D Deadletter queue writer + threshold alerts via NotificationRouter
|
|
27
|
-
|
|
28
|
-
### Telemetry pipeline (Wave 3)
|
|
29
|
-
- [x] 9.2.A Event-to-metric subscriber (subscribe `crew.*` events → registry counters)
|
|
30
|
-
- [x] 9.2.B Metric retention policy (sliding window aggregation 1h/1d configurable)
|
|
31
|
-
- [x] 9.2.C Histogram quantile calculator (p50/p95/p99 streaming) — t-digest or fixed buckets
|
|
32
|
-
- [x] 9.2.D Metric file sink JSONL với daily rotation (gated bởi `telemetry.enabled`)
|
|
33
|
-
|
|
34
|
-
### Export adapters (Wave 3 parallel)
|
|
35
|
-
- [x] 9.3.A Prometheus exposition format adapter (HTTP endpoint optional)
|
|
36
|
-
- [x] 9.3.B OTLP HTTP exporter (optional, opt-in)
|
|
37
|
-
- [x] 9.3.C Adapter abstraction (plugin pattern, extensible)
|
|
38
|
-
|
|
39
|
-
### UI & commands (Wave 4)
|
|
40
|
-
- [x] 9.4.A `team metrics` command — snapshot JSON, filter by name/runId
|
|
41
|
-
- [x] 9.4.B Metrics pane (pane index `6`) trong dashboard
|
|
42
|
-
- [x] 9.4.C Diagnostic export (Phase 8) include metrics snapshot
|
|
43
|
-
|
|
44
|
-
### Wiring & validation (Wave 5)
|
|
45
|
-
- [x] 9.5.A Wire register.ts — instantiate MetricRegistry, EventToMetric subscriber, RetryExecutor, BackgroundWatcher
|
|
46
|
-
- [x] 9.5.B Tests: unit + integration + perf
|
|
47
|
-
- [x] 9.5.C Migration guide: existing runs continue to work; opt-in for retry/recovery via config flag
|
|
48
|
-
|
|
49
|
-
## 1. Roadmap-Level Decisions
|
|
50
|
-
|
|
51
|
-
### 1.A Synergy Theme B + C — 5 critical integrations
|
|
52
|
-
|
|
53
|
-
| # | Touchpoint | Theme B contributes | Theme C contributes | Combined value |
|
|
54
|
-
|---|---|---|---|---|
|
|
55
|
-
| **S1** | Heartbeat staleness | Gauge primitive `crew.heartbeat.staleness_ms{runId,taskId}` | Gradient classifier (healthy/warn/stale/dead) | Auto-emit metric per task → time-series → detect regression |
|
|
56
|
-
| **S2** | Retry attempts | Histogram primitive `crew.task.retry_count{team}` | Retry executor + jitter backoff | Distribution analytics (p95 retries per team) |
|
|
57
|
-
| **S3** | Recovery trace | `traceId`/`spanId` correlation propagation | Recovery state machine (resume từ checkpoint) | Cross-component debug — subagent crash → recovery → resume fully traceable |
|
|
58
|
-
| **S4** | Deadletter alert | Counter `crew.task.deadletter_total{reason}` + threshold | Deadletter writer | Auto-alert via NotificationRouter khi rate > threshold |
|
|
59
|
-
| **S5** | Performance regression | Histogram quantile p95 over time | Stale duration tracking | Detect "Phase X deploy → p95 staleness +50%" tự động |
|
|
60
|
-
|
|
61
|
-
### 1.B Decisions
|
|
62
|
-
|
|
63
|
-
| # | Decision | Chosen | Rationale |
|
|
64
|
-
|---|---|---|---|
|
|
65
|
-
| D1 | Metric primitives: implement custom hay reuse library? | **Implement custom (minimal)** — Counter, Gauge, Histogram chỉ ~200 LOC | Tránh dependency mới (đồng nhất Phase 7/8 zero-dep approach); OTLP serializer cũng < 200 LOC |
|
|
66
|
-
| D2 | Histogram bucket strategy? | **Fixed exponential buckets** `[1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]` ms | Simple, predictable; no t-digest complexity; 95% use case là latency ms; user override qua config nếu cần |
|
|
67
|
-
| D3 | Correlation ID format? | **`{runId}:{taskId}:{spanCounter}`** (P1 default) | Human-readable, không cần UUID library, deterministic cho test, scope rõ ràng |
|
|
68
|
-
| D4 | Correlation ID propagation method? | **Async context (`AsyncLocalStorage`)** trong Node.js runtime | Standard Node API; không phải pass thủ công qua mọi function; minimal overhead |
|
|
69
|
-
| D5 | Retry executor: opt-in hay default-on? | **Opt-in** qua `reliability.autoRetry: false` mặc định | Risk High (touches state machine); user explicit consent; preserve current behavior bằng default |
|
|
70
|
-
| D6 | Retry policy default? | **maxAttempts=3, backoffMs=1000, jitterRatio=0.3, exponentialFactor=2** (P2) | Sensible defaults; per-task override; matches industry common pattern |
|
|
71
|
-
| D7 | Crash recovery: auto-resume vs prompt? | **Prompt via NotificationRouter** (P3) — Phase 8 ConfirmOverlay reused | User confirmation cho destructive resume action; tránh false-positive replay |
|
|
72
|
-
| D8 | Metric retention window default? | **1 hour streaming, 24 hour summary** (P4); persist daily JSONL | Cover 95% debugging; balance memory vs disk |
|
|
73
|
-
| D9 | Background watcher polling interval? | **5 seconds** default, configurable 1-60s (P8) | Responsive without burn CPU; setInterval not setTimeout chain |
|
|
74
|
-
| D10 | OTLP export priority? | **Implement nhưng disable mặc định** (P6) | Foundation cho team có observability stack; off by default tránh confused user |
|
|
75
|
-
| D11 | Deadletter alert threshold? | **>3 deadletter messages trong 1 hour** (P7) | Conservative; tránh false positive; configurable |
|
|
76
|
-
| D12 | Event-to-metric mapping cấu hình hay hardcode? | **Hardcode core** + extensible plugin | Core ~15 events đã định, hardcode đảm bảo consistent; plugin cho user custom |
|
|
77
|
-
| D13 | Naming convention metrics? | **`crew.{domain}.{measure}_{unit}`** — `crew.run.duration_ms`, `crew.task.retry_count`, `crew.heartbeat.staleness_ms` | Prometheus-compatible; domain rõ ràng; unit suffix tránh ambiguity |
|
|
78
|
-
| D14 | Metric sink file location? | **`<crewRoot>/state/metrics/{YYYY-MM-DD}.jsonl`** | Đồng nhất với Phase 8 notification sink pattern; daily rotation; configurable retention |
|
|
79
|
-
| D15 | Recovery checkpoint format? | **Event-log cursor** (existing `events.jsonl.seq` + `sequencePath()`/`scanSequence()` helpers) | Reuse hạ tầng đã có Phase 6; không thêm checkpoint format mới |
|
|
80
|
-
| D16 | Histogram quantile algorithm? | **Fixed buckets + linear interpolation** (P5) | Đơn giản; sufficient cho p50/p95/p99 với fixed buckets; t-digest defer Phase 10 nếu cần |
|
|
81
|
-
| **D17** | **MetricRegistry lifecycle** | **Per-session instance** (consistent với Phase 8 `notificationRouter`/`heartbeatAggregator`) — instantiate trong `session_start`, `dispose()` trong `session_shutdown` | Cumulative metrics across sessions không cần thiết Phase 9 (defer Phase 10 nếu user yêu cầu); test isolation tự nhiên; no global state leak; dispose semantics rõ ràng |
|
|
82
|
-
| **D18** | **Event subscription cleanup** | **Capture unsubscribe fn từ `events.on()` return value**; KHÔNG call `events.off()` (không tồn tại trên `EventBus` interface) | API surface preflight verified (9.0.E); pattern matches existing usages trong codebase (`src/ui/render-scheduler.ts`) |
|
|
83
|
-
| **D19** | **Retry state machine semantics** | **Task `failed` chỉ transition khi maxAttempts exhausted**; thêm field `task.attempts: Array<{startedAt,endedAt,error?}>` cho traceability; artifact final chỉ trên terminal attempt | Tránh terminal-state monotonicity violation (re-run task đang `failed` về `running`); audit trail đầy đủ cho debug |
|
|
84
|
-
| **D20** | **Crash recovery trigger combinator** | Recovery only triggers if `(status==="running") AND (no async.pid OR async.pid is dead via existing liveness check) AND (heartbeat dead via isWorkerHeartbeatStale > deadMs OR no heartbeat)` | Tránh false-positive marking healthy async run là interrupted; reuse Phase 6/7 async.pid liveness check trong `session-summary.ts` |
|
|
85
|
-
| **D21** | **Diagnostic schema versioning** | `DiagnosticReport.schemaVersion: 2` khi thêm `metricsSnapshot?: MetricSnapshot[]` field; apply `redactSecrets()` recursive trên `metricsSnapshot` (label values có thể chứa secret patterns) | Backward-compat consumer reading old format (schemaVersion missing → treat as v1); secret leak prevention |
|
|
86
|
-
| **D22** | **Deadletter trigger separation** | 3 paths: (a) `executeWithRetry` exhaust → write entry; (b) heartbeat watcher dead 3 ticks consecutive → write entry; (c) Counter rate > 3/hour → NotificationRouter alert | Trigger entry vs threshold alert là 2 logic riêng; tránh conflate trong implementation |
|
|
87
|
-
|
|
88
|
-
## 2. Phase Breakdown
|
|
89
|
-
|
|
90
|
-
### Phase 9.0 — Foundation (3.5 dev-days)
|
|
91
|
-
|
|
92
|
-
#### 9.0.A Metric primitives (1 dev-day)
|
|
93
|
-
|
|
94
|
-
**File mới:** `src/observability/metrics-primitives.ts`
|
|
95
|
-
|
|
96
|
-
```ts
|
|
97
|
-
export interface MetricLabels {
|
|
98
|
-
[key: string]: string | number;
|
|
99
|
-
}
|
|
100
|
-
|
|
101
|
-
export abstract class Metric {
|
|
102
|
-
constructor(public readonly name: string, public readonly description: string) {}
|
|
103
|
-
abstract snapshot(): MetricSnapshot;
|
|
104
|
-
}
|
|
105
|
-
|
|
106
|
-
export class Counter extends Metric {
|
|
107
|
-
private values = new Map<string, number>(); // labelKey → count
|
|
108
|
-
inc(labels: MetricLabels = {}, delta = 1): void { /* ... */ }
|
|
109
|
-
snapshot(): MetricSnapshot { return { type: "counter", name: this.name, values: [...this.values.entries()] }; }
|
|
110
|
-
}
|
|
111
|
-
|
|
112
|
-
export class Gauge extends Metric {
|
|
113
|
-
private values = new Map<string, number>();
|
|
114
|
-
set(labels: MetricLabels, value: number): void { /* ... */ }
|
|
115
|
-
add(labels: MetricLabels, delta: number): void { /* ... */ }
|
|
116
|
-
snapshot(): MetricSnapshot { /* ... */ }
|
|
117
|
-
}
|
|
118
|
-
|
|
119
|
-
export class Histogram extends Metric {
|
|
120
|
-
private buckets: number[]; // upper bounds, e.g. [1, 5, 10, 25, ...]
|
|
121
|
-
private observations = new Map<string, { counts: number[]; sum: number; count: number }>();
|
|
122
|
-
constructor(name: string, description: string, buckets?: number[]) {
|
|
123
|
-
super(name, description);
|
|
124
|
-
this.buckets = buckets ?? [1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000];
|
|
125
|
-
}
|
|
126
|
-
observe(labels: MetricLabels, value: number): void { /* ... */ }
|
|
127
|
-
quantile(labels: MetricLabels, q: number): number { /* linear interpolation */ }
|
|
128
|
-
snapshot(): MetricSnapshot { /* ... */ }
|
|
129
|
-
}
|
|
130
|
-
|
|
131
|
-
export interface MetricSnapshot {
|
|
132
|
-
type: "counter" | "gauge" | "histogram";
|
|
133
|
-
name: string;
|
|
134
|
-
values: unknown;
|
|
135
|
-
}
|
|
136
|
-
```
|
|
137
|
-
|
|
138
|
-
**Tests:** `test/unit/metrics-primitives.test.ts` — 12 cases (counter inc/labels, gauge set/add/labels, histogram observe/quantile p50/p95/p99/edge empty/edge single value).
|
|
139
|
-
|
|
140
|
-
#### 9.0.B MetricRegistry (0.75 dev-day) — **Per-session instance (D17)**
|
|
141
|
-
|
|
142
|
-
**File mới:** `src/observability/metric-registry.ts`
|
|
143
|
-
|
|
144
|
-
```ts
|
|
145
|
-
export class MetricRegistry {
|
|
146
|
-
private metrics = new Map<string, Metric>();
|
|
147
|
-
registerCounter(name: string, description: string): Counter { /* ... */ }
|
|
148
|
-
registerGauge(name: string, description: string): Gauge { /* ... */ }
|
|
149
|
-
registerHistogram(name: string, description: string, buckets?: number[]): Histogram { /* ... */ }
|
|
150
|
-
get(name: string): Metric | undefined { return this.metrics.get(name); }
|
|
151
|
-
snapshot(): MetricSnapshot[] { return [...this.metrics.values()].map((m) => m.snapshot()); }
|
|
152
|
-
dispose(): void { this.metrics.clear(); }
|
|
153
|
-
}
|
|
154
|
-
|
|
155
|
-
// Per-session factory — caller (register.ts) instantiates trong session_start, dispose trong session_shutdown.
|
|
156
|
-
// KHÔNG dùng singleton pattern (xem D17): tránh state leak cross-session, đảm bảo test isolation.
|
|
157
|
-
export function createMetricRegistry(): MetricRegistry { return new MetricRegistry(); }
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
**Naming convention enforce (D13):** `name` phải match regex `^crew\.[a-z]+\.[a-z][a-z_]*$` (đơn giản hơn regex cũ `^crew\.[a-z_]+\.[a-z_]+(_[a-z]+)?$` vốn redundant). Unit suffix là phần của measure name (e.g., `duration_ms`, `staleness_ms`). Throw nếu không match.
|
|
161
|
-
|
|
162
|
-
**Tests:** `test/unit/metric-registry.test.ts` — 6 cases (register, duplicate throws, snapshot all, naming validation, dispose clears state, get returns undefined sau dispose).
|
|
163
|
-
|
|
164
|
-
#### 9.0.C Correlation context (1 dev-day)
|
|
165
|
-
|
|
166
|
-
**File mới:** `src/observability/correlation.ts`
|
|
167
|
-
|
|
168
|
-
```ts
|
|
169
|
-
import { AsyncLocalStorage } from "node:async_hooks";
|
|
170
|
-
|
|
171
|
-
export interface CorrelationContext {
|
|
172
|
-
traceId: string; // {runId}:{taskId}:{spanCounter}
|
|
173
|
-
parentSpanId?: string;
|
|
174
|
-
spanId: string;
|
|
175
|
-
}
|
|
176
|
-
|
|
177
|
-
const storage = new AsyncLocalStorage<CorrelationContext>();
|
|
178
|
-
let spanCounter = 0;
|
|
179
|
-
|
|
180
|
-
export function withCorrelation<T>(ctx: CorrelationContext, fn: () => T): T {
|
|
181
|
-
return storage.run(ctx, fn);
|
|
182
|
-
}
|
|
183
|
-
|
|
184
|
-
export function getCurrentContext(): CorrelationContext | undefined {
|
|
185
|
-
return storage.getStore();
|
|
186
|
-
}
|
|
187
|
-
|
|
188
|
-
export function newSpanId(runId: string, taskId?: string): string {
|
|
189
|
-
spanCounter++;
|
|
190
|
-
return `${runId}:${taskId ?? "main"}:${spanCounter}`;
|
|
191
|
-
}
|
|
192
|
-
|
|
193
|
-
// Wrap event emission to inject correlation
|
|
194
|
-
export function correlatedEvent<T extends { runId?: string; data?: Record<string, unknown> }>(event: T): T {
|
|
195
|
-
const ctx = getCurrentContext();
|
|
196
|
-
if (!ctx) return event;
|
|
197
|
-
return { ...event, data: { ...event.data, traceId: ctx.traceId, spanId: ctx.spanId, parentSpanId: ctx.parentSpanId } };
|
|
198
|
-
}
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
**Wire vào `register.ts`** trong `pi.events.emit` wrapper — tất cả `crew.*` events tự inject correlation nếu context active. Foreground/async run wrap toàn bộ executeTeamRun trong `withCorrelation({traceId, spanId: newSpanId(runId)})`.
|
|
202
|
-
|
|
203
|
-
**Tests:** `test/unit/correlation.test.ts` — 5 cases (basic propagation, nested span, missing context graceful, async boundary preserve, parallel runs isolated).
|
|
204
|
-
|
|
205
|
-
#### 9.0.D Heartbeat gradient classifier (0.75 dev-day)
|
|
206
|
-
|
|
207
|
-
**File mới:** `src/runtime/heartbeat-gradient.ts`
|
|
208
|
-
|
|
209
|
-
```ts
|
|
210
|
-
import type { WorkerHeartbeatState } from "./worker-heartbeat.ts"; // Phase 6/7 file — actual interface name (NOT "WorkerHeartbeat")
|
|
211
|
-
|
|
212
|
-
export type HeartbeatLevel = "healthy" | "warn" | "stale" | "dead";
|
|
213
|
-
|
|
214
|
-
export interface GradientThresholds {
|
|
215
|
-
warnMs: number; // default 30_000 (30s)
|
|
216
|
-
staleMs: number; // default 60_000 (1min)
|
|
217
|
-
deadMs: number; // default 300_000 (5min)
|
|
218
|
-
}
|
|
219
|
-
|
|
220
|
-
export const DEFAULT_GRADIENT_THRESHOLDS: GradientThresholds = { warnMs: 30_000, staleMs: 60_000, deadMs: 300_000 };
|
|
221
|
-
|
|
222
|
-
export function classifyHeartbeat(heartbeat: WorkerHeartbeatState | undefined, thresholds: GradientThresholds = DEFAULT_GRADIENT_THRESHOLDS, now = Date.now()): HeartbeatLevel {
|
|
223
|
-
if (!heartbeat) return "dead";
|
|
224
|
-
if (heartbeat.alive === false) return "dead";
|
|
225
|
-
const lastSeen = Date.parse(heartbeat.lastSeenAt);
|
|
226
|
-
if (!Number.isFinite(lastSeen)) return "dead";
|
|
227
|
-
const elapsed = now - lastSeen;
|
|
228
|
-
if (elapsed >= thresholds.deadMs) return "dead";
|
|
229
|
-
if (elapsed >= thresholds.staleMs) return "stale";
|
|
230
|
-
if (elapsed >= thresholds.warnMs) return "warn";
|
|
231
|
-
return "healthy";
|
|
232
|
-
}
|
|
233
|
-
```
|
|
234
|
-
|
|
235
|
-
**Update `src/ui/heartbeat-aggregator.ts`** (Phase 8 file, 1612 bytes — verified existence) — backward-compat strategy:
|
|
236
|
-
- Giữ nguyên existing API surface `summarizeHeartbeats(snapshot, opts)` returning `HeartbeatSummary` (Phase 8 caller `health-pane.ts` không break).
|
|
237
|
-
- Internal classify SWITCH sang `classifyHeartbeat`; map 4-level (healthy/warn/stale/dead) → existing 3-bucket count (`healthy`/`stale`/`dead` — `warn` count merge vào `healthy` để giữ Phase 8 semantics).
|
|
238
|
-
- Optional new field `summary.gradient: { healthy, warn, stale, dead }` cho consumers Phase 9 (metrics-pane).
|
|
239
|
-
- Emit metrics khi `registry` param truyền vào (optional, không break Phase 8 caller):
|
|
240
|
-
- `metrics.gauge("crew.heartbeat.staleness_ms").set({runId, taskId}, elapsed)`
|
|
241
|
-
- `metrics.counter("crew.heartbeat.level_total").inc({runId, level})`
|
|
242
|
-
|
|
243
|
-
**Tests:** `test/unit/heartbeat-gradient.test.ts` — 8 cases (healthy/warn/stale/dead/missing/explicit-dead/edge-now/custom-thresholds + invalid date string returns dead).
|
|
244
|
-
|
|
245
|
-
#### 9.0.E Preflight ExtensionAPI surface verify (0.5 dev-day) — **NEW**
|
|
246
|
-
|
|
247
|
-
**Mục tiêu:** Trước khi Wave 2 wire `events?.on?.()` callbacks, confirm bằng test tự động:
|
|
248
|
-
|
|
249
|
-
**File mới:** `test/unit/extension-api-surface.test.ts` — verify hợp đồng:
|
|
250
|
-
1. `pi.events.on(channel, handler)` returns function (unsubscribe).
|
|
251
|
-
2. Calling unsubscribe stops handler invocation on subsequent emit.
|
|
252
|
-
3. Multiple `on()` calls cho cùng channel đều được gọi.
|
|
253
|
-
4. Confirm `events.off` không tồn tại (typeof check) — fail-fast nếu Pi upstream thay đổi API.
|
|
254
|
-
5. Verify `WorkerHeartbeatState` interface fields exist (`workerId`, `lastSeenAt`, `alive?`) — guard against rename.
|
|
255
|
-
|
|
256
|
-
**Output:** Block Wave 2 nếu test fail. Document trong PR description.
|
|
257
|
-
|
|
258
|
-
**Tests:** chính là content của file 9.0.E (5 cases).
|
|
259
|
-
|
|
260
|
-
---
|
|
261
|
-
|
|
262
|
-
### Phase 9.1 — Reliability Core (5 dev-days)
|
|
263
|
-
|
|
264
|
-
#### 9.1.A Background heartbeat watcher (1.5 dev-days)
|
|
265
|
-
|
|
266
|
-
**File mới:** `src/runtime/heartbeat-watcher.ts`
|
|
267
|
-
|
|
268
|
-
**Logic:** Setup `setInterval(5000ms)` (D9) trong session_start; mỗi tick, đọc tất cả active runs từ `manifestCache.list(50)`, load tasks via `loadRunManifestById(cwd, runId).tasks`, classify mỗi task heartbeat:
|
|
269
|
-
- `dead` lần đầu detect → emit `crew.task.heartbeat_dead` event + Counter `crew.heartbeat.dead_total{runId}` inc + NotificationRouter alert (severity warning, dedup id `dead_${runId}_${taskId}`).
|
|
270
|
-
- `dead` consecutive 3 ticks → trigger deadletter writer (xem 9.1.D path b — D22).
|
|
271
|
-
|
|
272
|
-
**Skeleton:**
|
|
273
|
-
|
|
274
|
-
```ts
|
|
275
|
-
import { loadRunManifestById } from "../state/state-store.ts";
|
|
276
|
-
import type { WorkerHeartbeatState } from "./worker-heartbeat.ts"; // actual interface name
|
|
277
|
-
import { classifyHeartbeat, DEFAULT_GRADIENT_THRESHOLDS, type HeartbeatLevel } from "./heartbeat-gradient.ts";
|
|
278
|
-
|
|
279
|
-
export class HeartbeatWatcher {
|
|
280
|
-
private timer?: ReturnType<typeof setInterval>;
|
|
281
|
-
private lastLevel = new Map<string, HeartbeatLevel>(); // `${runId}:${taskId}` → previous level
|
|
282
|
-
private consecutiveDead = new Map<string, number>(); // `${runId}:${taskId}` → consecutive dead tick count
|
|
283
|
-
constructor(
|
|
284
|
-
private opts: {
|
|
285
|
-
cwd: string;
|
|
286
|
-
pollIntervalMs?: number;
|
|
287
|
-
thresholds?: GradientThresholds;
|
|
288
|
-
manifestCache: ManifestCache;
|
|
289
|
-
registry: MetricRegistry;
|
|
290
|
-
router: NotificationRouter;
|
|
291
|
-
deadletterTickThreshold?: number; // default 3 (D22 path b)
|
|
292
|
-
onDead?: (runId: string, taskId: string, elapsed: number) => void;
|
|
293
|
-
onDeadletterTrigger?: (runId: string, taskId: string) => void;
|
|
294
|
-
}
|
|
295
|
-
) {}
|
|
296
|
-
start(): void {
|
|
297
|
-
this.timer = setInterval(() => this.tick(), this.opts.pollIntervalMs ?? 5000);
|
|
298
|
-
}
|
|
299
|
-
private tick(): void {
|
|
300
|
-
const thresholds = this.opts.thresholds ?? DEFAULT_GRADIENT_THRESHOLDS;
|
|
301
|
-
const tickThreshold = this.opts.deadletterTickThreshold ?? 3;
|
|
302
|
-
for (const run of this.opts.manifestCache.list(50)) {
|
|
303
|
-
if (run.status !== "running") continue;
|
|
304
|
-
const loaded = loadRunManifestById(this.opts.cwd, run.runId);
|
|
305
|
-
if (!loaded) continue;
|
|
306
|
-
for (const task of loaded.tasks) {
|
|
307
|
-
if (task.status !== "running" && task.status !== "queued") continue;
|
|
308
|
-
const key = `${run.runId}:${task.id}`;
|
|
309
|
-
const level = classifyHeartbeat(task.heartbeat, thresholds);
|
|
310
|
-
const prev = this.lastLevel.get(key);
|
|
311
|
-
this.lastLevel.set(key, level);
|
|
312
|
-
if (level === "dead" && prev !== "dead") {
|
|
313
|
-
this.opts.router.enqueue({ id: `dead_${run.runId}_${task.id}`, severity: "warning", source: "heartbeat-watcher", runId: run.runId, title: `Task ${task.id} heartbeat dead`, body: "Background watcher detected stuck worker." });
|
|
314
|
-
this.opts.registry.get("crew.heartbeat.dead_total")?.inc({ runId: run.runId });
|
|
315
|
-
this.opts.onDead?.(run.runId, task.id, 0);
|
|
316
|
-
}
|
|
317
|
-
if (level === "dead") {
|
|
318
|
-
const count = (this.consecutiveDead.get(key) ?? 0) + 1;
|
|
319
|
-
this.consecutiveDead.set(key, count);
|
|
320
|
-
if (count === tickThreshold) this.opts.onDeadletterTrigger?.(run.runId, task.id);
|
|
321
|
-
} else this.consecutiveDead.delete(key);
|
|
322
|
-
}
|
|
323
|
-
}
|
|
324
|
-
}
|
|
325
|
-
dispose(): void {
|
|
326
|
-
if (this.timer) clearInterval(this.timer);
|
|
327
|
-
this.timer = undefined;
|
|
328
|
-
this.lastLevel.clear();
|
|
329
|
-
this.consecutiveDead.clear();
|
|
330
|
-
}
|
|
331
|
-
}
|
|
332
|
-
```
|
|
333
|
-
|
|
334
|
-
**Tests:** `test/unit/heartbeat-watcher.test.ts` — 7 cases (start/dispose, dead detection alert once, transition healthy→dead emits once, transition dead→healthy resets, multiple runs isolated, mock clock, consecutive 3 ticks → deadletter trigger).
|
|
335
|
-
|
|
336
|
-
#### 9.1.B Retry executor (1.5 dev-days)
|
|
337
|
-
|
|
338
|
-
**File mới:** `src/runtime/retry-executor.ts`
|
|
339
|
-
|
|
340
|
-
```ts
|
|
341
|
-
export interface RetryPolicy {
|
|
342
|
-
maxAttempts: number; // default 3 (D6)
|
|
343
|
-
backoffMs: number; // default 1000
|
|
344
|
-
jitterRatio: number; // default 0.3 (±30%)
|
|
345
|
-
exponentialFactor: number; // default 2
|
|
346
|
-
retryableErrors?: string[]; // glob patterns; empty = all retryable
|
|
347
|
-
}
|
|
348
|
-
|
|
349
|
-
export const DEFAULT_RETRY_POLICY: RetryPolicy = { maxAttempts: 3, backoffMs: 1000, jitterRatio: 0.3, exponentialFactor: 2 };
|
|
350
|
-
|
|
351
|
-
export async function executeWithRetry<T>(
|
|
352
|
-
fn: (attempt: number) => Promise<T>,
|
|
353
|
-
policy: RetryPolicy = DEFAULT_RETRY_POLICY,
|
|
354
|
-
hooks?: { onAttemptFailed?: (attempt: number, error: Error, nextDelayMs: number) => void; onRetryGivenUp?: (attempts: number, error: Error) => void; signal?: AbortSignal }
|
|
355
|
-
): Promise<T> { /* exponential backoff with jitter */ }
|
|
356
|
-
|
|
357
|
-
function calculateDelay(attempt: number, policy: RetryPolicy): number {
|
|
358
|
-
const base = policy.backoffMs * Math.pow(policy.exponentialFactor, attempt - 1);
|
|
359
|
-
const jitter = (Math.random() * 2 - 1) * policy.jitterRatio * base;
|
|
360
|
-
return Math.max(0, base + jitter);
|
|
361
|
-
}
|
|
362
|
-
```
|
|
363
|
-
|
|
364
|
-
**Wire vào `executeTeamRun`** opt-in (D5 + D19 state-machine semantics):
|
|
365
|
-
- Read `loadConfig.config.reliability?.autoRetry` (default `false`, D5).
|
|
366
|
-
- Nếu true → wrap `runTeamTask(task)` với `executeWithRetry`.
|
|
367
|
-
- **State machine rules (D19):**
|
|
368
|
-
- Mỗi attempt → push entry `{ startedAt, endedAt, error? }` vào `task.attempts: Array<...>` (new field — schema additive).
|
|
369
|
-
- Task KHÔNG transition `running → failed → running` giữa các attempt (vi phạm monotonicity); thay vào đó, attempt N fail → đợi backoff → attempt N+1 vẫn `status="running"`, chỉ `attempts[]` mọc.
|
|
370
|
-
- Task transition `failed` CHỈ KHI maxAttempts exhausted; `task.error` reflect last error; artifact final chỉ finalize trên terminal attempt (không over-write per attempt).
|
|
371
|
-
- Idempotency requirement (risk Med-High): document trong release notes — `runTeamTask` phải idempotent hoặc user accept double-execute risk.
|
|
372
|
-
- Mỗi attempt → emit `crew.task.retry_attempt{runId,taskId,attempt}` Counter, `crew.task.retry_delay_ms{runId,taskId}` Histogram observe.
|
|
373
|
-
- Cuối cùng → record `crew.task.retry_count{runId,team}` Histogram observe (final attempt count).
|
|
374
|
-
|
|
375
|
-
**Schema update `src/schema/config-schema.ts`:**
|
|
376
|
-
```ts
|
|
377
|
-
reliability: Type.Optional(Type.Object({
|
|
378
|
-
autoRetry: Type.Optional(Type.Boolean()), // default false
|
|
379
|
-
retryPolicy: Type.Optional(Type.Object({
|
|
380
|
-
maxAttempts: Type.Optional(Type.Integer({ minimum: 1, maximum: 10 })),
|
|
381
|
-
backoffMs: Type.Optional(Type.Integer({ minimum: 100, maximum: 60000 })),
|
|
382
|
-
jitterRatio: Type.Optional(Type.Number({ minimum: 0, maximum: 1 })),
|
|
383
|
-
exponentialFactor: Type.Optional(Type.Number({ minimum: 1, maximum: 5 })),
|
|
384
|
-
retryableErrors: Type.Optional(Type.Array(Type.String())),
|
|
385
|
-
})),
|
|
386
|
-
autoRecover: Type.Optional(Type.Boolean()), // default false
|
|
387
|
-
deadletterThreshold: Type.Optional(Type.Integer({ minimum: 1 })), // default 3
|
|
388
|
-
})),
|
|
389
|
-
```
|
|
390
|
-
|
|
391
|
-
**Tests:** `test/unit/retry-executor.test.ts` — 10 cases (success first try, fail then succeed, max attempts exhausted, abort signal, jitter range, retryable filter, custom policy override, mock clock backoff, hook callback fires).
|
|
392
|
-
|
|
393
|
-
#### 9.1.C Crash recovery (1.5 dev-days)
|
|
394
|
-
|
|
395
|
-
**File mới:** `src/runtime/crash-recovery.ts`
|
|
396
|
-
|
|
397
|
-
**Logic:** session_start phát hiện run với status `running` từ session trước, **chỉ trigger recovery nếu thoả combinator (D20):**
|
|
398
|
-
- `(manifest.status === "running")`
|
|
399
|
-
- AND `(manifest.async?.pid === undefined OR pidIsDead(manifest.async.pid))` — reuse existing async.pid liveness check trong `src/extension/session-summary.ts`
|
|
400
|
-
- AND `(no heartbeat OR isWorkerHeartbeatStale(heartbeat, deadMs) === true)` — reuse `isWorkerHeartbeatStale()` từ `src/runtime/worker-heartbeat.ts`
|
|
401
|
-
|
|
402
|
-
Khi triggered:
|
|
403
|
-
1. Read event-log cursor via `scanSequence(eventsPath)` từ `src/state/event-log.ts` (Phase 6 helper) — tìm last completed event seq.
|
|
404
|
-
2. Compute "stale work":
|
|
405
|
-
- Tasks `running` nhưng heartbeat dead → mark `pending-recovery`.
|
|
406
|
-
- Tasks `completed`/`cancelled`/`failed` → preserve.
|
|
407
|
-
3. NotificationRouter prompt: `"Run X was interrupted. Resume from event N? (Y/N)"` (D7) qua Phase 8 ConfirmOverlay.
|
|
408
|
-
4. User confirm → reset stale tasks to `queued`, write resume event với metadata `{ recoveredFromSeq: N }`, emit `crew.run.resumed{runId, fromEventSeq}`.
|
|
409
|
-
5. User decline → mark run `cancelled` với reason `"interrupted-not-resumed"`.
|
|
410
|
-
|
|
411
|
-
**Skeleton:**
|
|
412
|
-
|
|
413
|
-
```ts
|
|
414
|
-
export interface RecoveryPlan {
|
|
415
|
-
runId: string;
|
|
416
|
-
resumableTasks: string[]; // taskIds to reset to queued
|
|
417
|
-
preservedTasks: string[]; // taskIds completed/cancelled (no change)
|
|
418
|
-
lastEventSeq: number;
|
|
419
|
-
}
|
|
420
|
-
|
|
421
|
-
export function detectInterruptedRuns(cwd: string, manifestCache: ManifestCache): RecoveryPlan[] { /* ... */ }
|
|
422
|
-
export async function applyRecoveryPlan(plan: RecoveryPlan, ctx: ExtensionContext, registry: MetricRegistry): Promise<void> { /* ... */ }
|
|
423
|
-
```
|
|
424
|
-
|
|
425
|
-
**Wire vào `register.ts:session_start`:**
|
|
426
|
-
```ts
|
|
427
|
-
if (loadedConfig.config.reliability?.autoRecover === true) {
|
|
428
|
-
const plans = detectInterruptedRuns(ctx.cwd, manifestCache);
|
|
429
|
-
for (const plan of plans) {
|
|
430
|
-
// Use NotificationRouter + ConfirmOverlay prompt
|
|
431
|
-
notificationRouter.enqueue({
|
|
432
|
-
severity: "warning",
|
|
433
|
-
source: "crash-recovery",
|
|
434
|
-
runId: plan.runId,
|
|
435
|
-
title: `Run ${plan.runId} was interrupted`,
|
|
436
|
-
body: `${plan.resumableTasks.length} tasks pending recovery. Open dashboard → confirm to resume.`,
|
|
437
|
-
id: `recovery_prompt_${plan.runId}`,
|
|
438
|
-
});
|
|
439
|
-
}
|
|
440
|
-
}
|
|
441
|
-
```
|
|
442
|
-
|
|
443
|
-
**Tests:** `test/integration/crash-recovery.test.ts` — 5 cases (no interrupted runs, single run resume, decline marks cancelled, multiple runs, completed tasks preserved).
|
|
444
|
-
|
|
445
|
-
#### 9.1.D Deadletter queue (0.5 dev-day)
|
|
446
|
-
|
|
447
|
-
**File mới:** `src/runtime/deadletter.ts`
|
|
448
|
-
|
|
449
|
-
**Logic (D22 — 3 separate trigger paths):**
|
|
450
|
-
- **Path (a) — retry exhaust:** trong `executeWithRetry` hooks `onRetryGivenUp(attempts, error)` → call `appendDeadletter({ reason: "max-retries", attempts, lastError })`.
|
|
451
|
-
- **Path (b) — heartbeat watcher consecutive dead:** `HeartbeatWatcher.onDeadletterTrigger(runId, taskId)` (count = 3 ticks consecutive — xem 9.1.A) → call `appendDeadletter({ reason: "heartbeat-dead", attempts: 0 })`.
|
|
452
|
-
- **Path (c) — threshold alert (separate from entry write):** Counter `crew.task.deadletter_total` rate > 3/hour (TimeWindowedCounter from 9.2.B) → NotificationRouter alert severity `error` với id `deadletter_threshold_${runId}` (dedup window 1h).
|
|
453
|
-
|
|
454
|
-
Tất cả 3 paths đều:
|
|
455
|
-
1. Append vào `<crewRoot>/state/runs/{runId}/deadletter.jsonl`.
|
|
456
|
-
2. Emit `crew.task.deadletter{runId,taskId,reason}` Counter inc.
|
|
457
|
-
|
|
458
|
-
```ts
|
|
459
|
-
export interface DeadletterEntry {
|
|
460
|
-
taskId: string;
|
|
461
|
-
runId: string;
|
|
462
|
-
reason: "max-retries" | "heartbeat-dead" | "manual";
|
|
463
|
-
attempts: number;
|
|
464
|
-
lastError?: string;
|
|
465
|
-
timestamp: string;
|
|
466
|
-
}
|
|
467
|
-
|
|
468
|
-
export function appendDeadletter(manifest: TeamRunManifest, entry: DeadletterEntry): void { /* JSONL append */ }
|
|
469
|
-
export function readDeadletter(manifest: TeamRunManifest): DeadletterEntry[] { /* read all */ }
|
|
470
|
-
```
|
|
471
|
-
|
|
472
|
-
**Tests:** `test/unit/deadletter.test.ts` — 4 cases (append, read, threshold trigger, persistence cross-session).
|
|
473
|
-
|
|
474
|
-
---
|
|
475
|
-
|
|
476
|
-
### Phase 9.2 — Telemetry Pipeline (4 dev-days)
|
|
477
|
-
|
|
478
|
-
#### 9.2.A Event-to-metric subscriber (1 dev-day)
|
|
479
|
-
|
|
480
|
-
**File mới:** `src/observability/event-to-metric.ts`
|
|
481
|
-
|
|
482
|
-
**Hardcoded mapping (D12):**
|
|
483
|
-
|
|
484
|
-
```ts
|
|
485
|
-
export function wireEventToMetrics(events: ExtensionAPI["events"], registry: MetricRegistry): { dispose: () => void } {
|
|
486
|
-
// Counters
|
|
487
|
-
const runCount = registry.registerCounter("crew.run.count", "Total runs by status");
|
|
488
|
-
const taskCount = registry.registerCounter("crew.task.count", "Total tasks by status");
|
|
489
|
-
const subagentCount = registry.registerCounter("crew.subagent.count", "Total subagent records by status");
|
|
490
|
-
const mailboxCount = registry.registerCounter("crew.mailbox.count", "Total mailbox messages by direction");
|
|
491
|
-
const deadletterCount = registry.registerCounter("crew.task.deadletter_total", "Deadletter triggers by reason");
|
|
492
|
-
|
|
493
|
-
// Gauges
|
|
494
|
-
const heartbeatStaleness = registry.registerGauge("crew.heartbeat.staleness_ms", "Heartbeat elapsed since last seen, milliseconds");
|
|
495
|
-
|
|
496
|
-
// Histograms
|
|
497
|
-
const runDuration = registry.registerHistogram("crew.run.duration_ms", "Run end-to-end duration, milliseconds");
|
|
498
|
-
const taskDuration = registry.registerHistogram("crew.task.duration_ms", "Task duration, milliseconds");
|
|
499
|
-
const retryCount = registry.registerHistogram("crew.task.retry_count", "Retries per task", [0, 1, 2, 3, 5, 10]);
|
|
500
|
-
const tokenUsage = registry.registerHistogram("crew.task.tokens_total", "Token usage per task");
|
|
501
|
-
|
|
502
|
-
const handlers: Array<[string, (data: any) => void]> = [
|
|
503
|
-
["crew.run.completed", (d) => { runCount.inc({ status: "completed" }); runDuration.observe({ team: d.team ?? "unknown" }, d.durationMs ?? 0); }],
|
|
504
|
-
["crew.run.failed", (d) => { runCount.inc({ status: "failed" }); }],
|
|
505
|
-
["crew.run.cancelled", (d) => { runCount.inc({ status: "cancelled" }); }],
|
|
506
|
-
["crew.subagent.completed", (d) => { subagentCount.inc({ status: d.status }); }],
|
|
507
|
-
["crew.mailbox.message", (d) => { mailboxCount.inc({ direction: d.direction }); }],
|
|
508
|
-
// ... etc
|
|
509
|
-
];
|
|
510
|
-
|
|
511
|
-
// D18: events.on() returns unsubscribe fn (EventBus interface). NO events.off() exists.
|
|
512
|
-
const unsubscribers: Array<() => void> = [];
|
|
513
|
-
for (const [event, handler] of handlers) {
|
|
514
|
-
const unsub = events?.on?.(event, handler);
|
|
515
|
-
if (unsub) unsubscribers.push(unsub);
|
|
516
|
-
}
|
|
517
|
-
return { dispose: () => { for (const unsub of unsubscribers) unsub(); unsubscribers.length = 0; } };
|
|
518
|
-
}
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
**Tests:** `test/unit/event-to-metric.test.ts` — 8 cases (each event handler increments correct metric, dispose calls each unsubscribe fn, no-op nếu events undefined, dispose idempotent — calling 2x không crash, multiple subscribers parallel isolated, handler exception không break other handlers via EventBus safe wrapper).
|
|
522
|
-
|
|
523
|
-
#### 9.2.B Metric retention (1 dev-day)
|
|
524
|
-
|
|
525
|
-
**File mới:** `src/observability/metric-retention.ts`
|
|
526
|
-
|
|
527
|
-
**Logic:** Streaming window 1h (D8) — mỗi metric value có timestamp; periodically (every 60s) → purge values older than window. Daily summary aggregation roll up vào persistent JSONL (9.2.D).
|
|
528
|
-
|
|
529
|
-
```ts
|
|
530
|
-
export class TimeWindowedCounter {
|
|
531
|
-
private events: { timestamp: number; labels: MetricLabels; delta: number }[] = [];
|
|
532
|
-
constructor(private windowMs: number = 3_600_000) {}
|
|
533
|
-
inc(labels: MetricLabels, delta = 1): void { /* push, then prune */ }
|
|
534
|
-
rate(labels: MetricLabels, durationMs: number): number { /* count events in last durationMs / durationMs */ }
|
|
535
|
-
}
|
|
536
|
-
```
|
|
537
|
-
|
|
538
|
-
**Wire MetricRegistry:** option `retentionMs` per metric — default 1h cho counter rate; gauge giữ latest value (no retention); histogram observations retain all (memory bounded by labels cardinality).
|
|
539
|
-
|
|
540
|
-
**Tests:** `test/unit/metric-retention.test.ts` — 5 cases (retain within window, prune outside, rate calculation, multiple labels isolated, mock clock).
|
|
541
|
-
|
|
542
|
-
#### 9.2.C Histogram quantile (1 dev-day)
|
|
543
|
-
|
|
544
|
-
**Update `metrics-primitives.ts`:** thêm method `quantile()`:
|
|
545
|
-
|
|
546
|
-
```ts
|
|
547
|
-
quantile(labels: MetricLabels, q: number): number {
|
|
548
|
-
const obs = this.observations.get(labelKey(labels));
|
|
549
|
-
if (!obs || obs.count === 0) return NaN;
|
|
550
|
-
const targetIdx = q * obs.count;
|
|
551
|
-
let cumulative = 0;
|
|
552
|
-
for (let i = 0; i < this.buckets.length; i++) {
|
|
553
|
-
cumulative += obs.counts[i];
|
|
554
|
-
if (cumulative >= targetIdx) {
|
|
555
|
-
const prevCum = cumulative - obs.counts[i];
|
|
556
|
-
const lower = i === 0 ? 0 : this.buckets[i - 1];
|
|
557
|
-
const upper = this.buckets[i];
|
|
558
|
-
// Linear interpolation within bucket
|
|
559
|
-
const fraction = (targetIdx - prevCum) / Math.max(1, obs.counts[i]);
|
|
560
|
-
return lower + fraction * (upper - lower);
|
|
561
|
-
}
|
|
562
|
-
}
|
|
563
|
-
return this.buckets[this.buckets.length - 1]; // overflow bucket
|
|
564
|
-
}
|
|
565
|
-
```
|
|
566
|
-
|
|
567
|
-
**Tests:** `test/unit/metrics-primitives.test.ts` mở rộng — quantile p50/p95/p99 với fixture data; edge empty, edge single value, edge all in one bucket.
|
|
568
|
-
|
|
569
|
-
#### 9.2.D Metric file sink (1 dev-day)
|
|
570
|
-
|
|
571
|
-
**File mới:** `src/observability/metric-sink.ts`
|
|
572
|
-
|
|
573
|
-
**Logic:** Tương tự Phase 8 `notification-sink.ts` — daily JSONL rotation, retention configurable. Sink writer chạy interval (default 60s) → snapshot registry → append. Reuse `redactSecrets` từ `diagnostic-export.ts` cho label values (precaution với secret patterns).
|
|
574
|
-
|
|
575
|
-
```ts
|
|
576
|
-
import { redactSecrets } from "../runtime/diagnostic-export.ts"; // Phase 8 helper
|
|
577
|
-
import { logInternalError } from "../utils/internal-error.ts";
|
|
578
|
-
|
|
579
|
-
export interface MetricSink {
|
|
580
|
-
writeSnapshot(snapshots: MetricSnapshot[]): void;
|
|
581
|
-
dispose(): void;
|
|
582
|
-
}
|
|
583
|
-
|
|
584
|
-
export interface MetricFileSinkOptions {
|
|
585
|
-
crewRoot: string;
|
|
586
|
-
registry: MetricRegistry;
|
|
587
|
-
retentionDays?: number; // default 7
|
|
588
|
-
intervalMs?: number; // default 60_000
|
|
589
|
-
}
|
|
590
|
-
|
|
591
|
-
export function createMetricFileSink(opts: MetricFileSinkOptions): MetricSink {
|
|
592
|
-
const dir = path.join(opts.crewRoot, "state", "metrics");
|
|
593
|
-
const retentionDays = opts.retentionDays ?? 7;
|
|
594
|
-
const writeSnapshot = (snapshots: MetricSnapshot[]): void => {
|
|
595
|
-
try {
|
|
596
|
-
const date = new Date().toISOString().slice(0, 10);
|
|
597
|
-
rotateOldFiles(dir, retentionDays);
|
|
598
|
-
fs.mkdirSync(dir, { recursive: true });
|
|
599
|
-
const redacted = redactSecrets(snapshots);
|
|
600
|
-
fs.appendFileSync(path.join(dir, `${date}.jsonl`), `${JSON.stringify({ exportedAt: new Date().toISOString(), snapshots: redacted })}\n`, "utf-8");
|
|
601
|
-
} catch (e) { logInternalError("metric-sink.write", e); }
|
|
602
|
-
};
|
|
603
|
-
const timer = setInterval(() => writeSnapshot(opts.registry.snapshot()), opts.intervalMs ?? 60_000);
|
|
604
|
-
return { writeSnapshot, dispose: () => clearInterval(timer) };
|
|
605
|
-
}
|
|
606
|
-
```
|
|
607
|
-
|
|
608
|
-
**Tests:** `test/unit/metric-sink.test.ts` — 5 cases (write basic, daily rotation, retention prune, telemetry disabled no-op when not instantiated, dispose stops timer + secret redaction in labels).
|
|
609
|
-
|
|
610
|
-
---
|
|
611
|
-
|
|
612
|
-
### Phase 9.3 — Export Adapters (3 dev-days)
|
|
613
|
-
|
|
614
|
-
#### 9.3.A Prometheus exposition format (1 dev-day)
|
|
615
|
-
|
|
616
|
-
**File mới:** `src/observability/exporters/prometheus-exporter.ts`
|
|
617
|
-
|
|
618
|
-
```ts
|
|
619
|
-
export function formatPrometheus(snapshots: MetricSnapshot[]): string {
|
|
620
|
-
const lines: string[] = [];
|
|
621
|
-
for (const snap of snapshots) {
|
|
622
|
-
lines.push(`# HELP ${snap.name} ${snap.description ?? ""}`);
|
|
623
|
-
lines.push(`# TYPE ${snap.name} ${snap.type}`);
|
|
624
|
-
// Format values per type with labels: name{label="value"} value timestamp
|
|
625
|
-
// ...
|
|
626
|
-
}
|
|
627
|
-
return lines.join("\n") + "\n";
|
|
628
|
-
}
|
|
629
|
-
```
|
|
630
|
-
|
|
631
|
-
**Optional HTTP endpoint:** `team metrics --serve --port 9091` command starts simple `http.createServer` exposing `/metrics` endpoint. Off by default.
|
|
632
|
-
|
|
633
|
-
**Tests:** `test/unit/prometheus-exporter.test.ts` — 6 cases (counter format, gauge format, histogram format with buckets, labels escaping, empty registry, special chars).
|
|
634
|
-
|
|
635
|
-
#### 9.3.B OTLP HTTP exporter (1.5 dev-days, OPTIONAL — disable mặc định D10)
|
|
636
|
-
|
|
637
|
-
**File mới:** `src/observability/exporters/otlp-exporter.ts`
|
|
638
|
-
|
|
639
|
-
**Logic:** Convert MetricSnapshot → OTLP JSON format (HTTP/protobuf alt); POST đến endpoint config. Buffer batch 60s.
|
|
640
|
-
|
|
641
|
-
```ts
|
|
642
|
-
export interface OTLPExporterOptions {
|
|
643
|
-
endpoint: string; // e.g. http://collector:4318/v1/metrics
|
|
644
|
-
headers?: Record<string, string>;
|
|
645
|
-
intervalMs?: number; // default 60_000
|
|
646
|
-
timeoutMs?: number; // default 10_000
|
|
647
|
-
}
|
|
648
|
-
|
|
649
|
-
export class OTLPExporter {
|
|
650
|
-
constructor(private opts: OTLPExporterOptions, private registry: MetricRegistry) {}
|
|
651
|
-
start(): void { /* setInterval push */ }
|
|
652
|
-
private async push(): Promise<void> {
|
|
653
|
-
const otlp = convertToOTLP(this.registry.snapshot());
|
|
654
|
-
try {
|
|
655
|
-
await fetch(this.opts.endpoint, { method: "POST", headers: { "content-type": "application/json", ...this.opts.headers }, body: JSON.stringify(otlp), signal: AbortSignal.timeout(this.opts.timeoutMs ?? 10_000) });
|
|
656
|
-
} catch (e) { logInternalError("otlp-export", e); }
|
|
657
|
-
}
|
|
658
|
-
dispose(): void { /* clearInterval */ }
|
|
659
|
-
}
|
|
660
|
-
|
|
661
|
-
function convertToOTLP(snapshots: MetricSnapshot[]): unknown { /* OpenTelemetry JSON spec */ }
|
|
662
|
-
```
|
|
663
|
-
|
|
664
|
-
**Schema config:**
|
|
665
|
-
```ts
|
|
666
|
-
otlp: Type.Optional(Type.Object({
|
|
667
|
-
enabled: Type.Optional(Type.Boolean()),
|
|
668
|
-
endpoint: Type.String(),
|
|
669
|
-
headers: Type.Optional(Type.Record(Type.String(), Type.String())),
|
|
670
|
-
intervalMs: Type.Optional(Type.Integer({ minimum: 5000 })),
|
|
671
|
-
})),
|
|
672
|
-
```
|
|
673
|
-
|
|
674
|
-
**Tests:** `test/unit/otlp-exporter.test.ts` — 5 cases (format conversion, push success mock fetch, push timeout, dispose stops, disabled no-op).
|
|
675
|
-
|
|
676
|
-
#### 9.3.C Adapter abstraction (0.5 dev-day)
|
|
677
|
-
|
|
678
|
-
**File mới:** `src/observability/exporters/adapter.ts`
|
|
679
|
-
|
|
680
|
-
```ts
|
|
681
|
-
export interface MetricExporter {
|
|
682
|
-
name: string;
|
|
683
|
-
push(snapshots: MetricSnapshot[]): Promise<void>;
|
|
684
|
-
dispose(): void;
|
|
685
|
-
}
|
|
686
|
-
|
|
687
|
-
export class CompositeExporter implements MetricExporter {
|
|
688
|
-
name = "composite";
|
|
689
|
-
constructor(private exporters: MetricExporter[]) {}
|
|
690
|
-
async push(snapshots: MetricSnapshot[]): Promise<void> {
|
|
691
|
-
await Promise.allSettled(this.exporters.map((e) => e.push(snapshots)));
|
|
692
|
-
}
|
|
693
|
-
dispose(): void { for (const e of this.exporters) e.dispose(); }
|
|
694
|
-
}
|
|
695
|
-
```
|
|
696
|
-
|
|
697
|
-
**Tests:** `test/unit/composite-exporter.test.ts` — 3 cases (push parallel, dispose all, error in one doesn't break others).
|
|
698
|
-
|
|
699
|
-
---
|
|
700
|
-
|
|
701
|
-
### Phase 9.4 — UI & Commands (3 dev-days)
|
|
702
|
-
|
|
703
|
-
#### 9.4.A `team metrics` command (1 dev-day)
|
|
704
|
-
|
|
705
|
-
**Update `src/extension/team-tool/api.ts`:** thêm operation `metrics-snapshot`:
|
|
706
|
-
|
|
707
|
-
```ts
|
|
708
|
-
if (operation === "metrics-snapshot") {
|
|
709
|
-
const filter = typeof cfg.filter === "string" ? cfg.filter : undefined; // glob pattern
|
|
710
|
-
const snapshots = getMetricRegistry().snapshot();
|
|
711
|
-
const filtered = filter ? snapshots.filter((s) => globMatch(s.name, filter)) : snapshots;
|
|
712
|
-
return result(JSON.stringify(filtered, null, 2), { action: "api", status: "ok" });
|
|
713
|
-
}
|
|
714
|
-
```
|
|
715
|
-
|
|
716
|
-
**Slash command:** `/team-metrics [filter]` → wraps API call, prints formatted output.
|
|
717
|
-
|
|
718
|
-
**Tests:** `test/unit/team-tool-metrics.test.ts` — 3 cases (snapshot all, filter glob, empty registry).
|
|
719
|
-
|
|
720
|
-
#### 9.4.B Metrics dashboard pane (1 dev-day)
|
|
721
|
-
|
|
722
|
-
**File mới:** `src/ui/dashboard-panes/metrics-pane.ts`
|
|
723
|
-
|
|
724
|
-
**Render:** top 10 metrics by value, sparkline cho histogram p95 trend (last 60min stored in retention store).
|
|
725
|
-
|
|
726
|
-
```ts
|
|
727
|
-
export interface MetricsPaneOptions {
|
|
728
|
-
registry: MetricRegistry;
|
|
729
|
-
maxCounters?: number; // default 10
|
|
730
|
-
}
|
|
731
|
-
|
|
732
|
-
// Signature consistent với Phase 8 panes — `(snapshot, opts?)`
|
|
733
|
-
export function renderMetricsPane(snapshot: RunUiSnapshot | undefined, opts: MetricsPaneOptions): string[] {
|
|
734
|
-
if (!snapshot) return ["Metrics pane: snapshot unavailable"];
|
|
735
|
-
const metrics = opts.registry.snapshot();
|
|
736
|
-
const counters = metrics.filter((m) => m.type === "counter").slice(0, opts.maxCounters ?? 10);
|
|
737
|
-
const lines: string[] = ["Metrics top 10 counters:"];
|
|
738
|
-
for (const c of counters) {
|
|
739
|
-
// Format: name{labels}: value
|
|
740
|
-
// ...
|
|
741
|
-
}
|
|
742
|
-
return lines;
|
|
743
|
-
}
|
|
744
|
-
```
|
|
745
|
-
|
|
746
|
-
**Update `src/ui/run-dashboard.ts`:** key `6` → `activePane = "metrics"`; help line update; constructor receives `registry` reference qua `RunDashboardOptions`.
|
|
747
|
-
|
|
748
|
-
**Tests:** `test/unit/metrics-pane.test.ts` — 4 cases.
|
|
749
|
-
|
|
750
|
-
#### 9.4.C Diagnostic export include metrics (0.5 dev-day) — **Schema version bump (D21)**
|
|
751
|
-
|
|
752
|
-
**Update `src/runtime/diagnostic-export.ts`** (Phase 8 file, 4303 bytes — verified):
|
|
753
|
-
|
|
754
|
-
```ts
|
|
755
|
-
// Schema additive — backward-compat for consumers reading old DiagnosticReport
|
|
756
|
-
export interface DiagnosticReport {
|
|
757
|
-
schemaVersion?: number; // NEW v2 — undefined treated as v1
|
|
758
|
-
runId: string;
|
|
759
|
-
exportedAt: string;
|
|
760
|
-
manifest: TeamRunManifest;
|
|
761
|
-
tasks: TeamTaskState[];
|
|
762
|
-
recentEvents: TeamEvent[];
|
|
763
|
-
heartbeat: HeartbeatSummary;
|
|
764
|
-
agents: unknown[];
|
|
765
|
-
envRedacted: Record<string, string>;
|
|
766
|
-
metricsSnapshot?: MetricSnapshot[]; // NEW — optional, only set when registry available
|
|
767
|
-
}
|
|
768
|
-
|
|
769
|
-
// In exportDiagnostic(): apply redactSecrets() recursive on metricsSnapshot label values
|
|
770
|
-
// before writing — secret patterns (token/key/password/secret/credential/auth) có thể xuất hiện
|
|
771
|
-
// trong label values hoặc histogram metadata.
|
|
772
|
-
```
|
|
773
|
-
|
|
774
|
-
**Caller (commands.ts handler):** pass per-session `MetricRegistry` reference vào `exportDiagnostic(ctx, runId, { registry })`. Nếu registry undefined (telemetry disabled hoặc Phase 9 chưa wired), field `metricsSnapshot` để undefined → backward-compat with Phase 8 consumer.
|
|
775
|
-
|
|
776
|
-
**Tests:** `test/unit/diagnostic-export.test.ts` extend — 2 cases:
|
|
777
|
-
1. Verify `metricsSnapshot` included khi registry passed; `schemaVersion === 2`.
|
|
778
|
-
2. Verify secret labels redacted (e.g., metric `crew.api.key_calls{auth_token="abc"}` → `auth_token: "***"`).
|
|
779
|
-
|
|
780
|
-
---
|
|
781
|
-
|
|
782
|
-
### Phase 9.5 — Wiring & Tests (3 dev-days)
|
|
783
|
-
|
|
784
|
-
#### 9.5.A Wire register.ts (1 dev-day) — **Per-session pattern (D17)**
|
|
785
|
-
|
|
786
|
-
**Update `src/extension/register.ts`:**
|
|
787
|
-
```ts
|
|
788
|
-
import { createMetricRegistry } from "../observability/metric-registry.ts"; // factory, not singleton
|
|
789
|
-
import { wireEventToMetrics } from "../observability/event-to-metric.ts";
|
|
790
|
-
import { HeartbeatWatcher } from "../runtime/heartbeat-watcher.ts";
|
|
791
|
-
import { detectInterruptedRuns } from "../runtime/crash-recovery.ts";
|
|
792
|
-
import { createMetricFileSink } from "../observability/metric-sink.ts";
|
|
793
|
-
|
|
794
|
-
// Module-scope state cho session (consistent với notificationRouter pattern Phase 8):
|
|
795
|
-
let metricRegistry: MetricRegistry | undefined;
|
|
796
|
-
let eventMetricSub: { dispose: () => void } | undefined;
|
|
797
|
-
let metricSink: MetricSink | undefined;
|
|
798
|
-
let heartbeatWatcher: HeartbeatWatcher | undefined;
|
|
799
|
-
|
|
800
|
-
const configureObservability = (ctx: ExtensionContext): void => {
|
|
801
|
-
// Dispose existing per-session resources first (idempotent)
|
|
802
|
-
heartbeatWatcher?.dispose();
|
|
803
|
-
metricSink?.dispose();
|
|
804
|
-
eventMetricSub?.dispose();
|
|
805
|
-
metricRegistry?.dispose();
|
|
806
|
-
|
|
807
|
-
const config = loadConfig(ctx.cwd).config;
|
|
808
|
-
if (config.observability?.enabled === false) {
|
|
809
|
-
metricRegistry = undefined; eventMetricSub = undefined; metricSink = undefined; heartbeatWatcher = undefined;
|
|
810
|
-
return;
|
|
811
|
-
}
|
|
812
|
-
|
|
813
|
-
metricRegistry = createMetricRegistry();
|
|
814
|
-
eventMetricSub = wireEventToMetrics(pi.events, metricRegistry);
|
|
815
|
-
if (config.telemetry?.enabled !== false) {
|
|
816
|
-
metricSink = createMetricFileSink({ crewRoot: projectCrewRoot(ctx.cwd), registry: metricRegistry, retentionDays: config.observability?.metricRetentionDays ?? 7 });
|
|
817
|
-
}
|
|
818
|
-
heartbeatWatcher = new HeartbeatWatcher({
|
|
819
|
-
cwd: ctx.cwd,
|
|
820
|
-
pollIntervalMs: config.observability?.pollIntervalMs ?? 5000,
|
|
821
|
-
manifestCache: getManifestCache(ctx.cwd),
|
|
822
|
-
registry: metricRegistry,
|
|
823
|
-
router: notificationRouter!, // Phase 8 router required
|
|
824
|
-
onDeadletterTrigger: (runId, taskId) => {
|
|
825
|
-
// Path (b) D22 — call deadletter writer
|
|
826
|
-
appendDeadletter(loadRunManifestById(ctx.cwd, runId)!.manifest, { taskId, runId, reason: "heartbeat-dead", attempts: 0, timestamp: new Date().toISOString() });
|
|
827
|
-
},
|
|
828
|
-
});
|
|
829
|
-
heartbeatWatcher.start();
|
|
830
|
-
|
|
831
|
-
if (config.reliability?.autoRecover === true) {
|
|
832
|
-
const plans = detectInterruptedRuns(ctx.cwd, getManifestCache(ctx.cwd));
|
|
833
|
-
for (const plan of plans) {
|
|
834
|
-
notificationRouter?.enqueue({ id: `recovery_prompt_${plan.runId}`, severity: "warning", source: "crash-recovery", runId: plan.runId, title: `Run ${plan.runId} was interrupted`, body: `${plan.resumableTasks.length} tasks pending recovery. Open dashboard → confirm to resume.` });
|
|
835
|
-
}
|
|
836
|
-
}
|
|
837
|
-
};
|
|
838
|
-
|
|
839
|
-
// session_start hook:
|
|
840
|
-
pi.on("session_start", (ctx) => {
|
|
841
|
-
currentCtx = ctx;
|
|
842
|
-
configureNotifications(ctx); // Phase 8
|
|
843
|
-
configureObservability(ctx); // Phase 9 NEW
|
|
844
|
-
// ... rest
|
|
845
|
-
});
|
|
846
|
-
|
|
847
|
-
// session_shutdown hook (extends Phase 8 cleanupRuntime):
|
|
848
|
-
pi.on("session_shutdown", () => {
|
|
849
|
-
// Phase 9 cleanup (per-session, in reverse setup order)
|
|
850
|
-
heartbeatWatcher?.dispose(); heartbeatWatcher = undefined;
|
|
851
|
-
metricSink?.dispose(); metricSink = undefined;
|
|
852
|
-
eventMetricSub?.dispose(); eventMetricSub = undefined;
|
|
853
|
-
metricRegistry?.dispose(); metricRegistry = undefined;
|
|
854
|
-
// Phase 8 cleanup
|
|
855
|
-
notificationRouter?.dispose();
|
|
856
|
-
notificationSink?.dispose();
|
|
857
|
-
// ...
|
|
858
|
-
});
|
|
859
|
-
```
|
|
860
|
-
|
|
861
|
-
**Wrap executeTeamRun với correlation (9.0.C):**
|
|
862
|
-
```ts
|
|
863
|
-
const traceId = newSpanId(runId); // {runId}:main:1 from spanCounter
|
|
864
|
-
withCorrelation({ traceId, spanId: traceId }, async () => {
|
|
865
|
-
await executeTeamRun(...);
|
|
866
|
-
});
|
|
867
|
-
```
|
|
868
|
-
|
|
869
|
-
**Pass `registry` reference downstream:**
|
|
870
|
-
- `metricRegistry` exposed qua `RegisterTeamCommandsDeps` interface (commands.ts) cho dashboard pane + diagnostic export.
|
|
871
|
-
- `dispatchDiagnosticExport(ctx, runId, { registry: metricRegistry })` để 9.4.C có thể inject metrics snapshot.
|
|
872
|
-
|
|
873
|
-
#### 9.5.B Tests + smoke (2 dev-days)
|
|
874
|
-
|
|
875
|
-
**Unit (mới ~70 cases):**
|
|
876
|
-
- metrics-primitives.test.ts (12)
|
|
877
|
-
- metric-registry.test.ts (6)
|
|
878
|
-
- correlation.test.ts (5)
|
|
879
|
-
- heartbeat-gradient.test.ts (8)
|
|
880
|
-
- heartbeat-watcher.test.ts (6)
|
|
881
|
-
- retry-executor.test.ts (10)
|
|
882
|
-
- deadletter.test.ts (4)
|
|
883
|
-
- event-to-metric.test.ts (8)
|
|
884
|
-
- metric-retention.test.ts (5)
|
|
885
|
-
- metric-sink.test.ts (5)
|
|
886
|
-
- prometheus-exporter.test.ts (6)
|
|
887
|
-
- otlp-exporter.test.ts (5)
|
|
888
|
-
- composite-exporter.test.ts (3)
|
|
889
|
-
- team-tool-metrics.test.ts (3)
|
|
890
|
-
- metrics-pane.test.ts (4)
|
|
891
|
-
|
|
892
|
-
**Integration (mới ~7 cases):**
|
|
893
|
-
- `crash-recovery.test.ts` — 5 sub-cases.
|
|
894
|
-
- `retry-executor-roundtrip.test.ts` — task fail 2x, succeed 3rd → metric counter records 3 attempts.
|
|
895
|
-
- `heartbeat-watcher-deadletter.test.ts` — 3 dead detections in 1h → deadletter triggered + alert.
|
|
896
|
-
- `metric-pipeline-end-to-end.test.ts` — emit events → snapshot via team-metrics → values match.
|
|
897
|
-
- `correlation-cross-component.test.ts` — start run → subagent spawn → mailbox event — all events share traceId.
|
|
898
|
-
- `prometheus-export.test.ts` — start run, fetch /metrics endpoint, verify format.
|
|
899
|
-
- `otlp-export-mock.test.ts` — mock collector, verify POST body schema.
|
|
900
|
-
|
|
901
|
-
**Smoke manual (10 scenarios):**
|
|
902
|
-
1. Run team, finish → `/team-metrics` shows `crew.run.count{status=completed}=1`.
|
|
903
|
-
2. Filter: `/team-metrics crew.task.*` shows only task metrics.
|
|
904
|
-
3. Set `reliability.autoRetry=true`, fail task 2x → metric `retry_count` shows 3 attempts.
|
|
905
|
-
4. Kill foreground process mid-run → reopen session → confirm prompt → resume → tasks continue.
|
|
906
|
-
5. Set `reliability.autoRecover=false` → kill process → reopen → no prompt → run cancelled.
|
|
907
|
-
6. Heartbeat stuck > 5min → notification toast → metric `heartbeat.dead_total` inc.
|
|
908
|
-
7. Trigger 4 deadletter messages → alert toast severity error.
|
|
909
|
-
8. `<crewRoot>/state/metrics/{date}.jsonl` populated after 60s.
|
|
910
|
-
9. `/team-metrics` filter on Counter histogram quantile p95.
|
|
911
|
-
10. OTLP export enabled with mock collector → verify push every 60s.
|
|
912
|
-
|
|
913
|
-
## 3. Wave Organization
|
|
914
|
-
|
|
915
|
-
```
|
|
916
|
-
Wave 1 (sequential, 4 days) — Foundation must come first
|
|
917
|
-
└─ 9.0 (.A → .B → .C → .D → .E preflight)
|
|
918
|
-
|
|
919
|
-
Wave 2 (parallel, 5 days) — depends on Wave 1
|
|
920
|
-
├─ 9.1.A Heartbeat watcher
|
|
921
|
-
├─ 9.1.B Retry executor
|
|
922
|
-
└─ 9.1.D Deadletter (depends on 9.1.B + 9.1.A)
|
|
923
|
-
⤷ 9.1.C Crash recovery (depends on 9.0.C correlation)
|
|
924
|
-
|
|
925
|
-
Wave 3 (parallel, 4 days) — depends on Wave 1
|
|
926
|
-
├─ 9.2.A Event-to-metric subscriber
|
|
927
|
-
├─ 9.2.B Metric retention
|
|
928
|
-
├─ 9.2.C Histogram quantile (extends 9.0.A)
|
|
929
|
-
└─ 9.2.D Metric sink
|
|
930
|
-
|
|
931
|
-
Wave 4 (parallel, 3 days) — depends on Wave 3
|
|
932
|
-
├─ 9.3.A Prometheus exporter
|
|
933
|
-
├─ 9.3.B OTLP exporter (optional)
|
|
934
|
-
├─ 9.3.C Adapter abstraction
|
|
935
|
-
└─ 9.4.A team metrics command
|
|
936
|
-
⤷ 9.4.B Metrics dashboard pane
|
|
937
|
-
⤷ 9.4.C Diagnostic include metrics
|
|
938
|
-
|
|
939
|
-
Wave 5 (sequential, 3 days)
|
|
940
|
-
├─ 9.5.A Wire register.ts
|
|
941
|
-
└─ 9.5.B Tests + smoke validation
|
|
942
|
-
```
|
|
943
|
-
|
|
944
|
-
**Total estimate: 19.5-22.5 dev-days** (Theme B+C combined; Wave 1 +0.5d for 9.0.E preflight).
|
|
945
|
-
|
|
946
|
-
## 4. Files Affected
|
|
947
|
-
|
|
948
|
-
### New (33 files — +1 cho 9.0.E preflight test)
|
|
949
|
-
| Path | Purpose | Est LOC |
|
|
950
|
-
|---|---|---|
|
|
951
|
-
| `src/observability/metrics-primitives.ts` | Counter/Gauge/Histogram base | ~200 |
|
|
952
|
-
| `src/observability/metric-registry.ts` | Singleton registry | ~120 |
|
|
953
|
-
| `src/observability/correlation.ts` | AsyncLocalStorage context | ~80 |
|
|
954
|
-
| `src/observability/event-to-metric.ts` | Event subscriber → metrics | ~150 |
|
|
955
|
-
| `src/observability/metric-retention.ts` | Time-windowed counter | ~80 |
|
|
956
|
-
| `src/observability/metric-sink.ts` | JSONL sink + rotation | ~100 |
|
|
957
|
-
| `src/observability/exporters/prometheus-exporter.ts` | Prometheus format | ~120 |
|
|
958
|
-
| `src/observability/exporters/otlp-exporter.ts` | OTLP HTTP exporter (optional) | ~180 |
|
|
959
|
-
| `src/observability/exporters/adapter.ts` | Composite + interface | ~60 |
|
|
960
|
-
| `src/runtime/heartbeat-gradient.ts` | Classifier function (uses `WorkerHeartbeatState`) | ~60 |
|
|
961
|
-
| `src/runtime/heartbeat-watcher.ts` | Background poller (per-session, reuse loadRunManifestById + classifyHeartbeat) | ~170 |
|
|
962
|
-
| `test/unit/extension-api-surface.test.ts` | **9.0.E preflight** — verify `events.on()` returns unsubscribe + `events.off` does NOT exist + `WorkerHeartbeatState` fields | ~110 |
|
|
963
|
-
| `src/runtime/retry-executor.ts` | Backoff + jitter | ~120 |
|
|
964
|
-
| `src/runtime/crash-recovery.ts` | Detect + apply plan | ~180 |
|
|
965
|
-
| `src/runtime/deadletter.ts` | Append + read JSONL | ~80 |
|
|
966
|
-
| `src/ui/dashboard-panes/metrics-pane.ts` | Metrics pane renderer | ~80 |
|
|
967
|
-
| `test/unit/metrics-primitives.test.ts` | | ~250 |
|
|
968
|
-
| `test/unit/metric-registry.test.ts` | | ~100 |
|
|
969
|
-
| `test/unit/correlation.test.ts` | | ~120 |
|
|
970
|
-
| `test/unit/heartbeat-gradient.test.ts` | | ~140 |
|
|
971
|
-
| `test/unit/heartbeat-watcher.test.ts` | | ~170 |
|
|
972
|
-
| `test/unit/retry-executor.test.ts` | | ~220 |
|
|
973
|
-
| `test/unit/deadletter.test.ts` | | ~90 |
|
|
974
|
-
| `test/unit/event-to-metric.test.ts` | | ~180 |
|
|
975
|
-
| `test/unit/metric-retention.test.ts` | | ~110 |
|
|
976
|
-
| `test/unit/metric-sink.test.ts` | | ~120 |
|
|
977
|
-
| `test/unit/prometheus-exporter.test.ts` | | ~150 |
|
|
978
|
-
| `test/unit/otlp-exporter.test.ts` | | ~140 |
|
|
979
|
-
| `test/unit/composite-exporter.test.ts` | | ~80 |
|
|
980
|
-
| `test/unit/team-tool-metrics.test.ts` | | ~80 |
|
|
981
|
-
| `test/unit/metrics-pane.test.ts` | | ~80 |
|
|
982
|
-
| `test/integration/crash-recovery.test.ts` | | ~200 |
|
|
983
|
-
| `test/integration/retry-executor-roundtrip.test.ts` | | ~150 |
|
|
984
|
-
| `test/integration/heartbeat-watcher-deadletter.test.ts` | | ~150 |
|
|
985
|
-
| `test/integration/metric-pipeline-end-to-end.test.ts` | | ~180 |
|
|
986
|
-
| `test/integration/correlation-cross-component.test.ts` | | ~150 |
|
|
987
|
-
| `test/integration/prometheus-export.test.ts` | | ~120 |
|
|
988
|
-
| `test/integration/otlp-export-mock.test.ts` | | ~140 |
|
|
989
|
-
|
|
990
|
-
### Modified (10 files)
|
|
991
|
-
| Path | Change |
|
|
992
|
-
|---|---|
|
|
993
|
-
| `src/extension/register.ts` | Wire registry, event-metric subscriber, heartbeat watcher, retry/recovery, OTLP exporter |
|
|
994
|
-
| `src/extension/team-tool/api.ts` | Thêm operation `metrics-snapshot` |
|
|
995
|
-
| `src/extension/registration/commands.ts` | Slash command `/team-metrics`; recovery confirm flow |
|
|
996
|
-
| `src/runtime/team-runner.ts` | Optional `executeWithRetry` wrap khi `autoRetry=true` |
|
|
997
|
-
| `src/runtime/task-runner.ts` | Emit retry attempt events; correlation context wrap |
|
|
998
|
-
| `src/ui/heartbeat-aggregator.ts` (Phase 8) | Switch internal classifier sang `heartbeat-gradient.ts`; emit metrics |
|
|
999
|
-
| `src/ui/run-dashboard.ts` | Pane `6` metrics; help line |
|
|
1000
|
-
| `src/runtime/diagnostic-export.ts` (Phase 8) | Include `metricsSnapshot` field |
|
|
1001
|
-
| `src/schema/config-schema.ts` | Thêm `reliability` + `otlp` sections |
|
|
1002
|
-
| `src/config/{config.ts,defaults.ts}` | Parse + defaults |
|
|
1003
|
-
| `package.json` | Bump `0.1.34` → `0.1.35` |
|
|
1004
|
-
|
|
1005
|
-
## 5. Risk Assessment
|
|
1006
|
-
|
|
1007
|
-
| Risk | Likelihood | Impact | Mitigation |
|
|
1008
|
-
|---|---|---|---|
|
|
1009
|
-
| Correlation propagation chạm hầu hết module | High | Med | AsyncLocalStorage tự động — không phải pass thủ công; test isolation cross-async boundary |
|
|
1010
|
-
| `executeWithRetry` double-execute task on poorly-idempotent ops | Med | **High** | Default off (D5); D19 state-machine rules (no transition `failed → running`); user explicit opt-in; documentation warn idempotency requirement |
|
|
1011
|
-
| Crash recovery race với new run start cùng runId | Low | High | D20 combinator: status==="running" AND no async.pid alive AND heartbeat dead; reuse existing async.pid liveness check; recovery prompt blocking until user confirms |
|
|
1012
|
-
| Heartbeat watcher poll burns CPU | Low | Low | 5s default conservative; configurable; only iterate active runs (`status === "running"`) |
|
|
1013
|
-
| MetricRegistry memory leak với high-cardinality labels | Med | Med | Cap label count per metric (warn ở 1000); document anti-pattern |
|
|
1014
|
-
| OTLP export network failure spam logs | Low | Low | Swallow errors via `logInternalError`; circuit-breaker after 5 consecutive fails |
|
|
1015
|
-
| Histogram quantile inaccurate với fixed buckets | Med | Low | Document approximation; allow custom buckets per metric |
|
|
1016
|
-
| Background watcher leak nếu session_shutdown miss | Low | Med | Per-session pattern (D17) — dispose ordering tested in 9.5.B; idempotent dispose |
|
|
1017
|
-
| `events.jsonl` corruption blocks recovery | Low | High | Recovery validate seq monotonic via `scanSequence`; fallback "cancel run" if event log unreadable |
|
|
1018
|
-
| Metric sink file lock contention | Low | Low | `appendFileSync` synchronous within process; cross-process not supported (document) |
|
|
1019
|
-
| Retry policy over-aggressive → task storm | Med | Med | Default maxAttempts=3 conservative; jitter prevent thundering herd |
|
|
1020
|
-
| Deadletter false positive on transient errors | Med | Med | Threshold default 3 attempts; user override per task; deadletter reversible (manual reset) |
|
|
1021
|
-
| **`events.off` không tồn tại** trên ExtensionAPI EventBus | Mitigated | Was High | **D18**: 9.0.E preflight test verify; capture unsubscribe fn từ `events.on()` return — pattern matches existing `src/ui/render-scheduler.ts` |
|
|
1022
|
-
| **Naming mismatch `WorkerHeartbeat` vs actual `WorkerHeartbeatState`** | Mitigated | Was High | 9.0.E preflight test verify field names; explicit import từ `worker-heartbeat.ts` (NOT alias) |
|
|
1023
|
-
| **MetricRegistry singleton state leak across sessions** | Mitigated | Was Med | **D17**: per-session instance pattern; dispose trong session_shutdown |
|
|
1024
|
-
| **DiagnosticReport schema breaking** (extra `metricsSnapshot` field) | Mitigated | Was Med | **D21**: `schemaVersion: 2` bump; field optional (undefined for v1 readers); secret redaction recursive |
|
|
1025
|
-
| **Deadletter trigger ambiguity** (3 paths conflate) | Mitigated | Was Med | **D22**: 3 explicit trigger paths separated trong code (not one mega-handler) |
|
|
1026
|
-
| **Recovery race với existing async.pid liveness check** | Mitigated | Was High | **D20** combinator reuses existing logic; new path không override existing async.pid check |
|
|
1027
|
-
|
|
1028
|
-
## 6. Testing Strategy
|
|
1029
|
-
|
|
1030
|
-
**Unit-level (~70 cases):** xem mục 9.5.B chi tiết.
|
|
1031
|
-
|
|
1032
|
-
**Integration (~7 scenarios):** xem mục 9.5.B.
|
|
1033
|
-
|
|
1034
|
-
**Performance budget:**
|
|
1035
|
-
- Counter inc < 1μs.
|
|
1036
|
-
- Histogram observe < 5μs.
|
|
1037
|
-
- Registry snapshot full < 50ms cho 100 metrics.
|
|
1038
|
-
- Heartbeat watcher tick < 100ms cho 50 active runs.
|
|
1039
|
-
- Retry backoff jitter calculation < 1μs.
|
|
1040
|
-
- Crash recovery detection < 200ms cho 50 runs.
|
|
1041
|
-
|
|
1042
|
-
**Property-based (optional):**
|
|
1043
|
-
- Histogram quantile monotonicity (q1 < q2 ⇒ result(q1) ≤ result(q2)).
|
|
1044
|
-
- Retry executor convergence (eventually success or give up within maxAttempts).
|
|
1045
|
-
|
|
1046
|
-
**Smoke manual (10 scenarios):** xem mục 9.5.B.
|
|
1047
|
-
|
|
1048
|
-
## 7. Open Questions (Pre-decide before Wave 1)
|
|
1049
|
-
|
|
1050
|
-
| P | Câu hỏi | Default đề xuất | Tác động |
|
|
1051
|
-
|---|---|---|---|
|
|
1052
|
-
| **P1** | Correlation ID format? | `{runId}:{taskId}:{spanCounter}` (D3) | Human-readable, deterministic |
|
|
1053
|
-
| **P2** | Retry policy default config | `maxAttempts=3, backoffMs=1000, jitterRatio=0.3` (D6) | Industry standard |
|
|
1054
|
-
| **P3** | Crash recovery: auto-resume vs prompt? | **Prompt** via Phase 8 ConfirmOverlay (D7) | Avoid replay risk |
|
|
1055
|
-
| **P4** | Metric retention window default | 1h streaming, 24h JSONL (D8) | Cover 95% debug needs |
|
|
1056
|
-
| **P5** | Histogram bucket strategy | Fixed exponential (D2) | Simple, predictable |
|
|
1057
|
-
| **P6** | OTLP export priority | Implement, default-off (D10) | Enable team có observability stack |
|
|
1058
|
-
| **P7** | Deadletter threshold default | >3 messages/hour alert (D11) | Conservative, false-positive minimal |
|
|
1059
|
-
| **P8** | Background watcher polling interval | 5s default, 1-60s configurable (D9) | Balance responsiveness vs CPU |
|
|
1060
|
-
|
|
1061
|
-
**All P1-P8 decisions defaulted in D-table (mục 1.B).** User có thể override qua config nhưng default sane.
|
|
1062
|
-
|
|
1063
|
-
## 8. Dependencies & Sequencing
|
|
1064
|
-
|
|
1065
|
-
```
|
|
1066
|
-
Phase 7 (DONE) ──► Phase 8 (Operator UX) ──► Phase 9 Wave 1 (Foundation)
|
|
1067
|
-
│ │
|
|
1068
|
-
│ ┌────────┼────────┐
|
|
1069
|
-
▼ ▼ ▼ ▼
|
|
1070
|
-
ConfirmOverlay reuse → 9.1 Reliability 9.2 Telemetry 9.3 Exporters
|
|
1071
|
-
NotificationRouter reuse │ │ │
|
|
1072
|
-
diagnostic-export extend └──────────────┼──────────────┘
|
|
1073
|
-
▼
|
|
1074
|
-
9.4 UI/Commands
|
|
1075
|
-
│
|
|
1076
|
-
▼
|
|
1077
|
-
9.5 Wiring + Tests
|
|
1078
|
-
```
|
|
1079
|
-
|
|
1080
|
-
**Hard prerequisites Phase 8:**
|
|
1081
|
-
- ✅ `NotificationRouter` (Phase 8.3.A) — used by 9.1.A/9.1.C/9.1.D for alerts.
|
|
1082
|
-
- ✅ `ConfirmOverlay` (Phase 8.0) — used by 9.1.C recovery prompt.
|
|
1083
|
-
- ✅ `diagnostic-export.ts` (Phase 8.2.D) — extended in 9.4.C.
|
|
1084
|
-
|
|
1085
|
-
**Parallelization opportunity:** Wave 2 vs Wave 3 có thể chạy song song (chỉ share Wave 1 foundation).
|
|
1086
|
-
|
|
1087
|
-
## 9. Effort Summary
|
|
1088
|
-
|
|
1089
|
-
| Wave | Items | Dev-days | Parallelizable |
|
|
1090
|
-
|---|---|---|---|
|
|
1091
|
-
| 1 | 9.0.A → B → C → D → **E preflight** | 4 | No (sequential foundation; 9.0.E gates Wave 2) |
|
|
1092
|
-
| 2 | 9.1.A + 9.1.B + 9.1.C + 9.1.D | 5 | Partial (4 streams, .C depends .A/.B) |
|
|
1093
|
-
| 3 | 9.2.A + 9.2.B + 9.2.C + 9.2.D | 4 | Yes (4 streams, low overlap) |
|
|
1094
|
-
| 4 | 9.3.A + 9.3.B + 9.3.C + 9.4.A + 9.4.B + 9.4.C | 3 | Partial (5 streams; UI track 9.4.A→B→C critical path 2.5d) |
|
|
1095
|
-
| 5 | 9.5.A + 9.5.B | 3 | No |
|
|
1096
|
-
| **Total** | **19 sub-phases** | **19.5-22.5** | — |
|
|
1097
|
-
|
|
1098
|
-
**So với Phase 8 (14-18 dev-days):** Phase 9 lớn hơn ~25%, risk cao hơn vì touches state machine.
|
|
1099
|
-
|
|
1100
|
-
## 10. Acceptance Checklist (Wave 5 exit criteria)
|
|
1101
|
-
|
|
1102
|
-
- [x] Tất cả checkbox 9.0 → 9.5 (bao gồm 9.0.E preflight) tick `[x]`.
|
|
1103
|
-
- [x] `npm test` pass: **389 unit** + **45 integration**, 0 fail (2026-04-29).
|
|
1104
|
-
- [x] `npm run typecheck` clean.
|
|
1105
|
-
- [x] Manual smoke 10 scenarios pass.
|
|
1106
|
-
- [x] Performance budget thỏa: counter 0.597µs, histogram 0.551µs, snapshot 0.159ms, heartbeat watcher 61.777ms/50 runs, recovery detect 27.036ms/50 runs.
|
|
1107
|
-
- [x] No regression: Phase 7+8 tests vẫn pass (full suite clean).
|
|
1108
|
-
- [x] Config breaking? **No.** Schema additive (`reliability`, `otlp`, `observability` sections optional).
|
|
1109
|
-
- [x] Default behavior unchanged: `autoRetry=false`, `autoRecover=false`, `otlp.enabled=false`, `observability.enabled` default `true` (sink/watcher gated bởi telemetry).
|
|
1110
|
-
- [ ] Bump package version for next release (current workspace remained on `0.1.35`; release not requested in this Phase 9 implementation turn).
|
|
1111
|
-
- [x] Migration guide trong README/release notes section.
|
|
1112
|
-
- [x] **D18 verified**: 0 `events.off?.` references in Phase 9 code; all subscriptions use returned unsubscribe fn.
|
|
1113
|
-
- [x] **D17 verified**: 0 module-level `globalRegistry`/singleton patterns; all observability state per-session, disposed in session_shutdown.
|
|
1114
|
-
- [x] **D21 verified**: DiagnosticReport schemaVersion=2 khi metricsSnapshot present; schemaVersion undefined cho Phase 8 reports.
|
|
1115
|
-
- [x] **No listener leak** test: 3x session_start/shutdown cycles → 0 residual subscriptions on `pi.events`.
|
|
1116
|
-
|
|
1117
|
-
## 11. Out of Scope (defer Phase 10+)
|
|
1118
|
-
|
|
1119
|
-
- Multi-host metric aggregation (cluster-wide registry).
|
|
1120
|
-
- Slack/Discord webhook adapter (router supports custom sink, not built-in).
|
|
1121
|
-
- t-digest histogram algorithm (defer; fixed buckets sufficient).
|
|
1122
|
-
- Tracing UI (only metrics + correlation propagation in 9; trace viewer Phase 10).
|
|
1123
|
-
- Auto-tuning retry policy (ML-based) — stay manual config Phase 9.
|
|
1124
|
-
- Metric drift detection / anomaly alert beyond simple threshold.
|
|
1125
|
-
- Custom event-to-metric mapping via DSL (hardcoded core only).
|
|
1126
|
-
- pprof profiling export.
|
|
1127
|
-
- Cross-language metric sharing (Pi-only Phase 9).
|
|
1128
|
-
|
|
1129
|
-
## 12. Path X Roadmap Summary
|
|
1130
|
-
|
|
1131
|
-
| Phase | Theme | Effort | Status |
|
|
1132
|
-
|---|---|---|---|
|
|
1133
|
-
| 6 | `.crew/` migration + autonomous policy | ~12d | ✅ DONE |
|
|
1134
|
-
| 7 | UI Optimization (snapshot cache + render scheduler + 4 panes) | ~18d | ✅ DONE |
|
|
1135
|
-
| **8** | **Operator Experience (Theme A)** | **14-18d** | ✅ **DONE** (verified 351 unit + 44 integration pass, version 0.1.34, all 17 sub-phases shipped) |
|
|
1136
|
-
| **9** | **Observability + Reliability (Theme B+C)** | **19.5-22.5d** | ✅ **IMPLEMENTED** (verified 389 unit + 45 integration pass in workspace) |
|
|
1137
|
-
| 10+ | TBD: Performance baseline (Theme D), distributed coordination, multi-host | — | Future |
|
|
1138
|
-
|
|
1139
|
-
**Path X total to Phase 9 done: ~63-67 dev-days** (Phase 6+7+8 done = 44d; Phase 9 = 19.5-22.5d remaining).
|
|
1140
|
-
|
|
1141
|
-
## 13. Implementation Kickoff Checklist (Pre-Wave 1)
|
|
1142
|
-
|
|
1143
|
-
Trước khi bắt đầu Wave 1 Phase 9, verify:
|
|
1144
|
-
|
|
1145
|
-
- [x] Phase 8 đã ship (`NotificationRouter`, `ConfirmOverlay`, `MailboxDetailOverlay/Compose/Preview/AgentPicker`, `heartbeat-aggregator.ts`, `health-pane.ts`, `diagnostic-export.ts`, `notification-sink.ts` available — verified existence + tests pass).
|
|
1146
|
-
- [x] `npm test` baseline pass (351 unit + 44 integration từ Phase 8 — verified 2026-04-29).
|
|
1147
|
-
- [x] `npm run typecheck` clean (verified Phase 8).
|
|
1148
|
-
- [x] P1-P8 defaults reviewed (mục 7) — đã default trong D-table.
|
|
1149
|
-
- [x] Branch mới skipped intentionally — user requested no separate branch.
|
|
1150
|
-
- [x] Read `src/state/event-log.ts` để hiểu sequence cursor pattern — confirmed `seq` metadata + `sequencePath()` + `scanSequence()` + `sequenceCache` infrastructure present.
|
|
1151
|
-
- [x] Read `src/runtime/worker-heartbeat.ts` để identify actual interface name — confirmed `WorkerHeartbeatState` (NOT "WorkerHeartbeat") + helper `isWorkerHeartbeatStale`.
|
|
1152
|
-
- [x] Read `src/runtime/diagnostic-export.ts` — confirmed Phase 8 file structure (`DiagnosticReport` interface + `redactSecrets` regex `/(token|key|password|secret|credential|auth)/i`).
|
|
1153
|
-
- [x] Verify ExtensionAPI surface — confirmed `EventBus.on()` returns unsubscribe fn (via `node_modules/@mariozechner/pi-coding-agent/dist/core/event-bus.d.ts`); **NO `events.off()` exists** → use returned unsubscribe (D18).
|
|
1154
|
-
- [x] Read `src/runtime/team-runner.ts:executeTeamRun` để identify correlation wrap point.
|
|
1155
|
-
- [x] Confirm Node.js >= 20 (AsyncLocalStorage stable since Node 16; package engines require Node >=20).
|
|
1156
|
-
- [x] Decide nếu OTLP export ship trong Phase 9 hay defer Phase 10 (shipped default-off per D10).
|
|
1157
|
-
- [x] **Wave 1 entry gate: 9.0.E preflight test pass** — block Wave 2 nếu fail.
|
|
1158
|
-
|
|
1159
|
-
**Sẵn sàng triển khai Phase 9 Path X. Phase 8 verified DONE.**
|
|
1160
|
-
|
|
1161
|
-
---
|
|
1162
|
-
|
|
1163
|
-
**Note on Theme B vs Theme C balance:** Phase 9 này combine 2 themes vì 5 synergy critical (mục 1.A). Nếu trong quá trình Wave 2/3 phát hiện effort blow up, có thể split:
|
|
1164
|
-
- Phase 9a = B only (Wave 1 + Wave 3 + 9.4.A/B + part 9.5) ~12.5 dev-days (incl. 9.0.E preflight).
|
|
1165
|
-
- Phase 9b = C only (Wave 1 reuse + Wave 2 + part 9.4.C + part 9.5) ~10 dev-days.
|
|
1166
|
-
|
|
1167
|
-
Decision split chỉ đưa ra khi có data thực tế từ Wave 1 progress.
|
|
1168
|
-
|
|
1169
|
-
---
|
|
1170
|
-
|
|
1171
|
-
## Appendix A — Review Fixes Applied (2026-04-29)
|
|
1172
|
-
|
|
1173
|
-
Plan đã được update post-review với các blocking issues đã giải quyết:
|
|
1174
|
-
|
|
1175
|
-
| Issue | Fix | Reference |
|
|
1176
|
-
|---|---|---|
|
|
1177
|
-
| `WorkerHeartbeat` vs actual `WorkerHeartbeatState` | Replace tất cả references; explicit import | 9.0.D, 9.1.A, D-decisions |
|
|
1178
|
-
| `events.off?.()` không tồn tại trên EventBus | Use `events.on()` returned unsubscribe fn pattern | 9.2.A, D18, 9.0.E preflight |
|
|
1179
|
-
| MetricRegistry singleton dispose semantics ambiguous | Per-session instance pattern (consistent Phase 8) | 9.0.B, 9.5.A, D17 |
|
|
1180
|
-
| 9.0.E preflight ExtensionAPI verify thiếu | Added new sub-phase + test file | 9.0.E (NEW) |
|
|
1181
|
-
| Retry executor state-machine semantics chưa rõ | Document attempts[] + no `failed → running` transition | 9.1.B, D19 |
|
|
1182
|
-
| Crash recovery race với async.pid liveness | Combinator clause uses existing logic | 9.1.C, D20 |
|
|
1183
|
-
| Deadletter trigger 3 paths conflate | Separate explicit paths (a/b/c) | 9.1.D, D22 |
|
|
1184
|
-
| DiagnosticReport schema breaking | schemaVersion: 2 + redactSecrets recursive | 9.4.C, D21 |
|
|
1185
|
-
| `renderMetricsPane` signature lệch Phase 8 pattern | Change to `(snapshot, opts: { registry })` | 9.4.B |
|
|
1186
|
-
| Naming convention regex redundant | Tighten `^crew\.[a-z]+\.[a-z][a-z_]*$` | 9.0.B, D13 |
|
|
1187
|
-
| 9.1.A `for (const task of /* loaded.tasks */)` placeholder | Resolved với `loadRunManifestById(...).tasks` | 9.1.A skeleton |
|
|
1188
|
-
| 9.5.A wire pseudocode `..., registry` placeholder | Spec rõ `MetricFileSinkOptions` interface | 9.2.D, 9.5.A |
|
|
1189
|
-
| Phase 8 status label "NEXT" nhưng đã DONE | Update Path X table → ✅ DONE | Section 12 |
|
|
1190
|
-
| Acceptance no-listener-leak test thiếu | Added 3x cycle test | Section 10 |
|