npm - hatch3r - Versions diffs - 1.9.0 → 2.0.0 - Mend

hatch3r 1.9.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (288) hide show

package/dist/content/commands/hatch3r-healthcheck.md CHANGED Viewed

@@ -1,31 +1,84 @@
 ---
 id: hatch3r-healthcheck
 type: command
-orchestrator: false
+orchestrator: true
+agentPipeline: [hatch3r-implementer, hatch3r-ui, hatch3r-security]
 description: Open a QA and reliability epic surveying coverage gaps, flaky tests, and regression blind spots with one testing sub-issue per module plus cross-module wiring audit
 tags: [maintenance]
 quality_charter: agents/shared/quality-charter.md
 efficiency_patterns: agents/shared/efficiency-patterns.md
 cache_friendly: true
 parallel_tool_default: true
+efficiency_tier: deep
+triage_tiers: [2, 3]
+supports_resume: true
+sub_agents_spawned:
+  count: 3
+  rationale: Module-taxonomy discovery and audit-sub-issue authoring delegate to `hatch3r-implementer`; the two cross-cutting QA axes fan out in parallel to `hatch3r-ui` (CQ1 — accessibility / axe-core / design-token / four-state coverage gaps) and `hatch3r-security` (CQ3 — dependency-CVE + supply-chain regression risks). Fan-out is disjoint across the two audit axes; serialization would not preserve P8 B2 task decomposition. Cost-dominance per CONSTITUTION §2 P8 — token cost never serializes independent work.
 ---
 ## §0 Detect Ambiguity (P8 B1)
-Before any action, scan the user's request and provided context for unresolved questions in scope, acceptance criteria, irreversibility, or constraint conflicts (contradictory inputs, missing target, unknown convention). If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md` — do not proceed under silent assumption. This is the default path, not an exception. Acceptable to proceed without asking ONLY when scope is single-target, single-concern, and the brief alone is testable. Any residual ambiguity discovered mid-workflow invokes the same protocol.
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → §0 Detect Ambiguity (P8 B1). Triggers: contradictory inputs, missing target, unknown convention.
 ## Agent Pipeline
-This command creates QA audit issues and epics. It does not spawn implementation sub-agents.
+This command discovers the module taxonomy via static analysis, then delegates issue-body authoring and two cross-cutting audit axes to parallel sub-agents via the Task tool. Pipeline:
 | Stage | Agent(s) | Parallel | Required |
 |-------|----------|----------|----------|
 | 1. Context & Pre-flight | Orchestrator (inline) | No | Yes |
-| 2. Issue Creation | Orchestrator (GitHub MCP) | No | Yes |
-| 3. Board Sync | Orchestrator (Projects v2 sync) | No | Yes |
+| 2. Module Audit Authoring | `hatch3r-implementer` (one Task call per module sub-issue body) | Yes (across modules) | Yes |
+| 3. Cross-Cutting QA Axes | `hatch3r-ui` (CQ1) + `hatch3r-security` (CQ3, supply-chain slice) (parallel sub-issue authoring) | Yes | Yes |
+| 4. Issue Creation | Orchestrator (GitHub MCP) | No | Yes |
+| 5. Board Sync | Orchestrator (Projects v2 sync) | No | Yes |
+**Parallel-safety conditions** (per `rules/hatch3r-agent-orchestration.md` §Parallel Safety): every parallel fan-out above holds all three — read-only or disjoint writes, deterministic aggregation, no shared mutable state.
 All issue operations MUST follow the Projects v2 Enforcement rules defined in `hatch3r-board-shared`.
+Sub-agent fan-out scales with module count per `rules/fan-out-discipline.md` (P8 B2). For each discovered module, a `hatch3r-implementer` Task call authors that module's audit sub-issue body in parallel; the two cross-cutting audits (`hatch3r-ui` for CQ1 accessibility coverage, `hatch3r-security` for the CQ3 supply-chain slice) run as one parallel batch.
+## Triage
+Classify the healthcheck request before fan-out:
+- **Tier 2 (standard)**: single repository with discovered module count <=8; parallel module sub-agents bounded by `max_phase4_parallel`.
+- **Tier 3 (deep)**: monorepo with module count >8 OR cross-module wiring depth >=3; same fan-out shape, longer review loop.
+Tier is derived from Module Discovery output (Step 2). Tier 1 is not supported — single-target QA fixes belong to `hatch3r-quick-change`.
+### Pre-Execution Cost Preview
+Before the first sub-agent dispatch (Step 4 module audit-authoring fan-out), surface the cost preview so a wide module fan-out is never started blind. Emit the `cost_estimate` block per `rules/hatch3r-cost-visibility.md` Pre-Execution Estimate, calibrated to the Tier derived from module count:
+```yaml
+cost_estimate:
+  expected_sa_count: <module count + 2 cross-cutting axes; Tier 2 ~module-count<=8, Tier 3 module-count>8, bounded by max_phase4_parallel per batch>
+  estimated_input_tokens_static_frame: <int>
+  estimated_web_research_queries: <int>
+  triage_tier: standard | deep
+  estimated_duration_min: <int>
+```
+Post-execution actuals + delta land in the Step 6 finalization summary's Fan-out + Cost section per `rules/hatch3r-cost-visibility.md` Post-Execution Actuals. Token telemetry sources from `src/pipeline/observability.ts`.
+### Effort Override (Decision 17)
+Auto-tiering derives from discovered module count, which can misclassify — a monorepo with many small modules over-scored, or a dense single-package repo under-scored. The user override is the recovery path mandated by hatch3r's universal `--effort` override contract ("User overridable via `--effort` flag"):
+- `--effort=standard|deep` forces the named tier, bypassing the module-count auto-classification. `--effort=light` is rejected — Tier 1 is unsupported here (single-target QA fixes route to `hatch3r-quick-change`).
+- The override wins over the auto-detected tier; record both the auto-detected tier and the override in the run context so the Cost estimate block reports the budget delta.
+- No override passed → the module-count auto-classification stands.
+## Confidence Propagation Contract
+Every sub-agent delegation prompt in this command MUST include the confidence expression requirement below (verbatim). Sub-agents are invoked with the `quality_charter: agents/shared/quality-charter.md` reference in their frontmatter, but the orchestrator repeats the directive to override runtime prompt defaults per the charter §1 rule.
+> Confidence expression requirement: rate every recommendation and finding as high/medium/low confidence per the quality charter (`agents/shared/quality-charter.md`). High = verified against current code. Medium = pattern-based, not fully verified. Low = best judgment, recommend human review.
+Downstream propagation: every authored module-audit sub-issue body and each cross-cutting axis finding MUST carry a high/medium/low confidence rating sourced from the authoring sub-agent. Dropping the signal between stages is a gate failure.
 # Healthcheck — Full Product QA & Testing Audit
 Create a healthcheck epic on **{owner}/{repo}** with one sub-issue per logical project module, plus cross-module wiring and vision/roadmap alignment audits. Each sub-issue is a deep static-analysis audit task that, when picked up by the board workflow, produces a findings epic with actionable sub-issues for achieving full QA and testing coverage. The command only creates the initial audit epic — it does NOT execute any audits.
@@ -313,6 +366,53 @@ All issue and epic operations in this command MUST follow the Projects v2 Enforc
 ---
+## Resumability (Decision 27/30)
+healthcheck is long-running — module discovery (Step 2) seeds a per-module hatch3r-implementer fan-out for audit sub-issue authoring (Step 4) bounded by `max_phase4_parallel`, alongside parallel hatch3r-ui (CQ1) + hatch3r-security (CQ3 supply-chain slice) cross-cutting axes (Step 5), then Step 6 batch-creates GitHub issues and Step 7 syncs Projects v2 board state. Per hatch3r's workspace-checkpointed resumability contract, checkpoint progress so an interrupted run re-enters at the last completed step rather than re-creating issues or re-running implementers for modules already audited.
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Checkpoint Contract. Per-command slots: workspace `.healthcheck-workspace/`; step range the Step 1 → Step 7 progression; `wave` = per-module implementer-batch index across modules and the cross-cutting axes batch; snapshot/rollback paths any module-audit-spec writes under `docs/audits/`. Write points: after Step 2 module discovery locks `discoveredModules`, after each Step 4 implementer batch returns per `max_phase4_parallel` slot (so completed audit-sub-issue bodies survive a crash and are not re-authored), after the Step 5 cross-cutting axes batch returns, after each Step 6 GitHub issue create call records its `issueId` in `createdIssueIds` (so already-created issues survive a crash and are not re-created — the resume path skips issues with an entry in `createdIssueIds`), after Step 6 epic-link creation, and after Step 7 Projects v2 board sync completes.
+---
+## Per-Turn Pipeline-State Header (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Per-Turn Pipeline-State Header. Phase mapping for healthcheck: `1` = scope + maturity-tier detection, `2` = specialist sub-agent dispatch across health dimensions, `3` = severity-graded aggregation + finding-registry update, `4` = epic/issue write + iteration-summary.
+## End-of-Turn Delegation Attestation (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → End-of-Turn Delegation Attestation. Per-command mutated-file slot: findings epic, child issues, registry updates.
+## Iteration Summary (mandatory output)
+Emit the canonical 9-section iteration summary per `rules/hatch3r-iteration-summary.md` as the final user-facing output. The validation gate at `.claude/rules/capability-lifecycle.md` blocks SUCCESS declarations without this block (CONSTITUTION §6 Decision 23).
+The 9 sections:
+1. **Request** — verbatim restatement of the user's ask in one sentence.
+2. **Fan-out + Cost** — `sub_agents_spawned: { count, rationale }` plus the `cost_estimate` / `cost_actuals` / `delta` blocks (see Cost Visibility below).
+3. **Web Research** — every URL fetched with access date + trust tier per `agents/shared/rigor-contract.md` (0 acceptable when no research was needed).
+4. **Files Mutated** — list with diff summary (lines added / removed / files created).
+5. **Gates Passed / Failed** — explicit list per `.claude/rules/capability-lifecycle.md` Gate Checklist.
+6. **Pillar Impact Attribution** — `progress_toward_pillar: <axis>.<pillar_id>+<delta>` per CONSTITUTION §6 Decision 17.
+7. **Verification Commands** — exact commands run with exit codes plus key output lines (≤200 chars).
+8. **Open Questions / Blockers** — explicit `None` if fully closed.
+9. **Learnings Captured** — IDs of any learnings written to `.hatch3r/learnings/` this run per `rules/hatch3r-learning-system.md`.
+### Cost Visibility (Decision 24)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Cost Estimate for the 5-field `cost_estimate` schema and the post-execution `cost_actuals` + `delta` contract; both land in Section 2 above.
+## Cost estimate (Decision 24)
+This command emits cost transparency per `rules/hatch3r-cost-visibility.md` and CONSTITUTION §6 Decision 24/29:
+- **Pre-execution `cost_estimate`** — emitted in the Pre-Execution Cost Preview above before the first module audit-authoring dispatch (Step 4).
+- **Post-execution `cost_actuals` + `delta`** — appended to the Step 6 finalization summary's Fan-out + Cost section per `rules/hatch3r-iteration-summary.md` §2.
+Per-tier `expected_sa_count` calibration (from frontmatter `sub_agents_spawned.count: 3`, which is the static floor; actual fan-out scales with discovered module count per `rules/fan-out-discipline.md` P8 B2): one `hatch3r-implementer` Task per module sub-issue body + `hatch3r-ui` (CQ1) + `hatch3r-security` (CQ3 supply-chain slice) for the two cross-cutting axes. Tier 2 (module count ≤8) and Tier 3 (module count >8) both bound the parallel module batch by `max_phase4_parallel`. Deltas beyond 25% absolute value carry `flagged_for_review: true`. Token telemetry sources from `src/pipeline/observability.ts`; estimation primitives from `src/pipeline/costEstimator.ts`.
+---
 ## Error Handling
 - `search_issues` failure: retry once, then warn and proceed (assume no existing healthcheck).

package/dist/content/commands/hatch3r-incident-response.md ADDED Viewed

@@ -0,0 +1,228 @@
+---
+id: hatch3r-incident-response
+type: command
+orchestrator: true
+agentPipeline: [hatch3r-incident-responder, hatch3r-reliability]
+description: Drive a live production incident through a structured lifecycle -- triage + topology, bounded-autonomy mitigation, stakeholder communication, then a blameless post-mortem with runbook -- via delegated sub-agents.
+tags: [devops, orchestration]
+quality_charter: agents/shared/quality-charter.md
+efficiency_patterns: agents/shared/efficiency-patterns.md
+cache_friendly: true
+parallel_tool_default: true
+efficiency_tier: standard
+triage_tiers: [1, 2, 3]
+sub_agents_spawned:
+  count: 2
+  rationale: One hatch3r-incident-responder specialist drives the live lifecycle (triage → bounded-autonomy mitigation → communication → blameless post-mortem); one hatch3r-reliability specialist runs the post-incident telemetry/SLO reconstruction in parallel once the incident is stabilized. Tier 1 spawns only the incident-response specialist (count 1); a security-suspected incident adds hatch3r-security. Cost-dominance per CONSTITUTION §2 P8 — token cost never serializes independent work.
+---
+## §0 Detect Ambiguity (P8 B1)
+Before any action, scan the incident report for unresolved questions in scope, impact, irreversibility, or constraint conflicts (user-facing vs internal-only, blast radius unknown, rollback safety unverified, stakeholder-notification scope unspecified, or a mitigation that writes data / changes a schema with downstream consumers). If any are found, ask via the platform-native question tool per `agents/shared/user-question-protocol.md` — do not proceed under silent assumption. This is the default path, not an exception. Live incidents are high-blast-radius, so irreversibility detection on every proposed mitigation is mandatory. Residual ambiguity discovered mid-incident invokes the same protocol.
+## Agent Pipeline
+| Stage | Agent(s) | Parallel | Required |
+|-------|----------|----------|----------|
+| 1. Triage + topology + mitigate + communicate | `hatch3r-incident-responder` (executes `skills/hatch3r-incident-response/SKILL.md` Steps 1-3) | No | Yes |
+| 2. Post-incident telemetry/SLO reconstruction | `hatch3r-reliability` (CQ4) | Yes (with Stage 3 drafting) | Tier 3, or any P0/P1 with an SLO-burn |
+| 3. Blameless post-mortem + runbook + follow-ups | `hatch3r-incident-responder` (SKILL.md Steps 5-6) | No | When post-mortem required (P0/P1) |
+**Parallel-safety conditions** (per `rules/hatch3r-agent-orchestration.md` §Parallel Safety): the Stage 2 reliability reconstruction is read-only against telemetry while Stage 3 drafts the post-mortem — disjoint writes, deterministic aggregation (the reconstruction feeds the post-mortem root-cause section), no shared mutable state.
+---
+# Incident Response — Triage, Mitigate, Communicate, Learn
+Drives a live production incident end-to-end through delegated sub-agents. The orchestrator never edits files or applies mitigations inline; it delegates the live lifecycle to `hatch3r-incident-responder`, runs the post-incident reliability reconstruction in parallel, and integrates the blameless post-mortem.
+The detailed runbook — severity table, Bounded Autonomy & Escalation matrix, Telemetry Sources adapter, topology-capture, and the six-step post-mortem template — lives in `skills/hatch3r-incident-response/SKILL.md`. This command orchestrates that runbook through sub-agents; it does not restate it.
+**When to use this command vs. the `hatch3r-incident-response` skill vs. the `hatch3r-incident-responder` agent:**
+- Use this **command** when: a live incident is open and the response is nontrivial (multi-service blast radius, a mitigation that needs a human gate, or a P0/P1 requiring incident-command discipline and a post-incident reliability reconstruction).
+- Use the **skill** directly when: you are running the runbook yourself inline and want the step-by-step procedure without sub-agent delegation overhead.
+- Use the **agent** directly when: another orchestrator (e.g. a reviewer pass) needs the incident-response specialist for post-incident reconstruction only.
+---
+## Token-Saving Directives
+1. **Read telemetry once per scope.** The incident-response specialist captures the topology + telemetry snapshot once (Stage 1); pass it into the Stage 2 reliability prompt rather than re-querying.
+2. **Targeted reads only.** Read only files on the failure path identified during triage — not the full codebase.
+3. **Structured output only.** Every sub-agent prompt requires structured markdown output — no prose dumps.
+---
+## Confidence Propagation Contract
+Every sub-agent delegation prompt in this command MUST include the confidence expression requirement below (verbatim). Sub-agents carry the `quality_charter: agents/shared/quality-charter.md` reference in frontmatter, but the orchestrator repeats the directive to override runtime prompt defaults per the charter §1 rule.
+> Confidence expression requirement: rate every recommendation and finding as high/medium/low confidence per the quality charter (`agents/shared/quality-charter.md`). High = verified against live telemetry. Medium = topology/pattern-based, not directly reproduced. Low = best judgment, recommend human review.
+Downstream propagation: every status update, every mitigation gate, and the post-mortem root-cause section MUST carry a high/medium/low rating sourced from the upstream sub-agent. Dropping the signal between stages is a gate failure. A Low-confidence root cause blocks closing the incident.
+---
+## Workflow
+Execute these steps in order. **Do not skip any step.** Ask the user at every checkpoint marked ASK, using the platform-native question tool per `agents/shared/user-question-protocol.md`.
+## Step 0: Triage
+Classify the incident before delegating, using the `skills/hatch3r-incident-response/SKILL.md` Step 1 severity table:
+- **Tier 1 (P3 / minor):** single contained flow, reversible mitigation, no stakeholder paging. Spawn only `hatch3r-incident-responder`; skip Stage 2 reliability reconstruction. Post-mortem optional (recommended only if recurrence-prone).
+- **Tier 2 (P2 / partial degradation):** limited blast radius, reversible mitigation acceptable with a diff preview. Spawn `hatch3r-incident-responder`; run the post-mortem (Stage 3). Add Stage 2 reliability reconstruction if an SLO burned.
+- **Tier 3 (P0/P1 / major incident):** outage, security incident, or wide blast radius. Full pipeline — incident-response specialist with incident-command discipline (no autonomous mutation on P0; human gate on P0/P1 mitigations), parallel `hatch3r-reliability` reconstruction, and a mandatory blameless post-mortem.
+Severity-to-tier is recomputed as blast radius is confirmed: an unconfirmed blast radius classifies upward (P3→P2, P2→P1), never downward.
+### Step 0.5: Emit Pre-Execution Cost Preview
+Before the first sub-agent dispatch (Step 1), surface the cost preview so a delegated incident response is never started blind. Emit the `cost_estimate` block per `rules/hatch3r-cost-visibility.md` Pre-Execution Estimate, calibrated to the Step 0 tier:
+```yaml
+cost_estimate:
+  expected_sa_count: <Tier 1 ~1, Tier 2 ~1-2, Tier 3 ~2 (3 if security-suspected)>
+  estimated_input_tokens_static_frame: <int>
+  estimated_web_research_queries: <int>      # 0 when no research is needed
+  triage_tier: light | standard | deep
+  estimated_duration_min: <int>
+```
+Post-execution actuals + delta land in the iteration summary's Fan-out + Cost section per `rules/hatch3r-cost-visibility.md` Post-Execution Actuals. Token telemetry sources from `src/pipeline/observability.ts`; estimation primitives from `src/pipeline/costEstimator.ts`.
+### Effort Override (Decision 17)
+Auto-tiering can misclassify — a contained nuisance scored Deep, or a creeping outage scored Light. The user override is the recovery path mandated by hatch3r's universal `--effort` override contract ("User overridable via `--effort` flag"):
+- `--effort=light|standard|deep` forces the named tier, bypassing the Step 0 auto-classification.
+- The override wins over the auto-detected tier; record both so the cost estimate block reports the budget delta.
+- The override does NOT suppress the severity-upgrade safety rule: a `--effort=light` run whose blast radius confirms P0/P1 still runs the Tier-3 incident-command discipline (no autonomous mutation on P0; human gate on mitigation). Safety dominates the cost override.
+- No override passed → the Step 0 auto-classification stands.
+---
+### Step 1: Triage + Mitigate + Communicate (Live Lifecycle)
+Spawn `hatch3r-incident-responder` via the Task tool (`subagent_type: "generalPurpose"`) to execute `skills/hatch3r-incident-response/SKILL.md` Steps 1-4 (classify severity, capture topology, mitigate under the Bounded Autonomy & Escalation matrix, communicate to stakeholders).
+The specialist prompt MUST include: the incident brief (symptoms, detection time, observed impact, affected environment, any recent deploys/config changes), the Step 0 tier + severity, all `scope: always` rule directives from `rules/`, a `correlation_id` (UUID v4 per `rules/hatch3r-agent-orchestration.md` → Correlation ID), the confidence expression requirement above, and the bounded-autonomy gate contract (verbatim):
+> Bounded-autonomy gate: prefer the reversible mitigation (flag flip, kill-switch, config revert, scale-up, deploy rollback) over an irreversible one. Emit a diff preview (exact command/flag/config delta) before executing any auto-applied mutation. On a P0 incident, do NOT self-execute — investigate, build the timeline, propose the diff, and return for human approval. On P1, auto-apply only high-confidence reversible actions with a diff preview; medium/low-confidence or irreversible actions escalate to a human gate. Record every action in the incident timeline with actor, timestamp, and gate decision.
+**ASK (mitigation gate — fires on every P0, and on any P1/irreversible action):** "Incident severity {P0-P3}. Proposed mitigation: {one-line + diff preview} (confidence {high/medium/low}, reversible: {yes/no}). Apply? (apply / adjust mitigation / escalate to on-call / investigate further)". For reversible high-confidence mitigations on P2/P3, the specialist may auto-apply with a diff preview and report it — no ASK required.
+After the specialist returns, verify the mitigation against telemetry (error rate dropped, affected flow recovered) before declaring the incident stabilized. If the mitigation introduced a new issue, roll it back immediately and re-derive — per the skill's Error Handling.
+---
+### Step 2: Post-Incident Reliability Reconstruction (Tier 3 / SLO-burn; parallel with Step 3)
+Once the incident is stabilized, spawn `hatch3r-reliability` via the Task tool to reconstruct which CQ4 floors held at incident time — SLO burn, span coverage on the failing path, RED/USE signal availability, resilience-pattern presence on the implicated outbound call. This runs read-only against telemetry, in parallel with the Step 3 post-mortem drafting.
+The reliability prompt MUST include: the stabilized incident summary + topology map from Step 1, the failing service + route, all `scope: always` rule directives, the `correlation_id`, and the confidence expression requirement. Its output feeds the post-mortem's root-cause and action-item sections (e.g. "readiness probe gated on liveness signal — add dependency-health gate" as a follow-up).
+Skip this stage for Tier 1, and for Tier 2 incidents where no SLO burned.
+---
+### Step 3: Blameless Post-Mortem + Runbook + Follow-Ups
+Spawn `hatch3r-incident-responder` to execute `skills/hatch3r-incident-response/SKILL.md` Steps 5-6: write the blameless post-mortem (summary, timeline, root cause, impact, action items, lessons), author an alert-linked runbook for the failure mode, and file one follow-up issue per action item via the project's platform CLI.
+The specialist prompt MUST include: the Step 1 timeline + mitigation record, the Step 2 reliability reconstruction (when run), all `scope: always` rule directives, the `correlation_id`, the confidence expression requirement, and the blameless-post-mortem contract (verbatim):
+> Blameless post-mortem contract: assume every responder acted on the best information available. Focus on contributing causes, not individual fault. Do not name individuals as the cause. The root-cause section carries a confidence rating; a Low-confidence root cause keeps the post-mortem open (do not declare the incident closed). Strip secrets, PII, and proprietary code from the document.
+Skip the post-mortem for Tier 1 incidents unless the failure mode is recurrence-prone.
+---
+### Step 4: Summary + Git Action
+1. Present a concise completion summary:
+```
+Incident Response Complete:
+  Severity:        {P0-P3}
+  Blast radius:    {impacted node | upstream callers | downstream deps}
+  Mitigation:      {one-line — reversible/irreversible, gate decision}
+  Recovery:        {telemetry-verified: error rate dropped / flow recovered}
+  Post-mortem:     {path/issue — blameless, root cause confidence high/medium/low}
+  Follow-ups:      {N issues filed}
+  Confidence:      {high/medium/low — overall incident verdict}
+```
+2. **ASK:** "Incident stabilized and post-mortem drafted. How should I handle the post-mortem + follow-up artifacts in git? (a) commit only, (b) commit and push, (c) skip git — leave in working tree". Applied mitigations on live infrastructure are NOT a git action — they are already recorded in the incident timeline.
+Commit message format: `docs: post-mortem for {incident-slug}` (post-mortem + runbook are documentation/follow-up artifacts). For pushes, fall back to `git push -u origin {branch}` when no upstream exists.
+---
+## Per-Turn Pipeline-State Header (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Per-Turn Pipeline-State Header. Phase mapping for incident-response: `1` = triage + topology + mitigate + communicate (incident-response specialist), `2` = post-incident reliability reconstruction (reliability), `3` = blameless post-mortem + runbook + follow-ups (incident-response specialist), `4` = summary + git + iteration-summary. Tier 1 runs are exempt per the Tier 1 exemption.
+## End-of-Turn Delegation Attestation (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → End-of-Turn Delegation Attestation. Per-command mutated-file slot: post-mortem document, runbook, follow-up issue drafts, config/flag diffs authored for review. This command has no Tier-1 inline carve-out for file mutations: post-mortem and runbook authoring always flow through the `hatch3r-incident-responder` sub-agent.
+## Iteration Summary (mandatory output)
+Emit the canonical 9-section iteration summary per `rules/hatch3r-iteration-summary.md` as the final user-facing output. The validation gate at `.claude/rules/capability-lifecycle.md` blocks SUCCESS declarations without this block (CONSTITUTION §6 Decision 23).
+The 9 sections:
+1. **Request** — verbatim restatement of the user's ask in one sentence.
+2. **Fan-out + Cost** — `sub_agents_spawned: { count, rationale }` plus the `cost_estimate` / `cost_actuals` / `delta` blocks (see Cost Visibility below).
+3. **Web Research** — every URL fetched with access date + trust tier per `agents/shared/rigor-contract.md` (0 acceptable when no research was needed).
+4. **Files Mutated** — list with diff summary (lines added / removed / files created).
+5. **Gates Passed / Failed** — explicit list per `.claude/rules/capability-lifecycle.md` Gate Checklist.
+6. **Pillar Impact Attribution** — `progress_toward_pillar: <axis>.<pillar_id>+<delta>` per CONSTITUTION §6 Decision 17.
+7. **Verification Commands** — exact commands run with exit codes plus key output lines (≤200 chars).
+8. **Open Questions / Blockers** — explicit `None` if fully closed.
+9. **Learnings Captured** — IDs of any learnings written to `.hatch3r/learnings/` this run per `rules/hatch3r-learning-system.md`.
+### Cost Visibility (Decision 24)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Cost Estimate for the 5-field `cost_estimate` schema and the post-execution `cost_actuals` + `delta` contract; both land in Section 2 above.
+---
+## Error Handling
+- **Cannot reproduce the incident locally:** use production telemetry to build the timeline per the skill's Error Handling; record the local-reproduction gap as a post-mortem action item.
+- **Mitigation introduces a new issue:** roll back the mitigation immediately, reassess, apply a more targeted fix; document both the original incident and the mitigation regression in the post-mortem.
+- **Specialist sub-agent failure (Step 1):** the incident is live — surface the partial state and **ASK** immediately (provide missing context / escalate to on-call human / abort delegation and hand the live incident to the operator). Never silently retry a live-mitigation step.
+- **Root cause unconfirmed (all hypotheses Low-confidence):** do not close the incident. State the verdict ("Root cause unconfirmed; top hypothesis confidence=low") and keep the post-mortem open with an investigation action item.
+- **Root cause spans multiple services or teams:** document the cross-service dependency chain, assign follow-ups to the responsible teams, and recommend a joint post-mortem per the skill's Error Handling.
+- **Suspected security breach surfaced mid-incident:** add `hatch3r-security` to the pipeline for the threat assessment; this command retains ownership of the timeline and mitigation discipline.
+## Resumability (Decision 27/30)
+A live incident is long-running and a responder hand-off mid-incident is common, so checkpoint the lifecycle — a resumed run re-enters at the last completed stage rather than re-applying a mitigation already executed or re-filing follow-up issues already filed. Applied live-infra mitigations are recorded in the incident timeline, not the checkpoint, so resumption never re-executes a flag flip or rollback.
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Checkpoint Contract. Per-command slots: workspace `.incident-workspace/`; step range the Step 0 → Step 4 progression; `wave` = the post-mortem drafting iteration; snapshot/rollback paths every authored artifact (post-mortem document, runbook, follow-up drafts). Write points: after the Step 0 triage, after the Step 1 mitigation record (the mitigation timeline is the source of truth — the checkpoint references it, never re-executes it), after the Step 2 reliability reconstruction, and after the Step 3 post-mortem + follow-ups.
+## Guardrails
+- **Reversibility-first.** Prefer reversible mitigations; an irreversible action escalates one severity band and always routes to a human gate.
+- **No autonomous mutation on P0.** P0 incidents: investigate, build the timeline, propose the diff, page for approval — never self-execute.
+- **Diff preview before apply.** Any auto-applied mutation emits the exact change before execution, never after.
+- **Always delegate.** All file mutation (post-mortem, runbook, follow-up drafts) flows through `hatch3r-incident-responder` via the Task tool — no inline edits from the orchestrator turn.
+- **Blameless post-mortems.** Never assign individual blame; focus on contributing causes.
+- **Confidence propagation.** Every status update, mitigation gate, and post-mortem root-cause section carries a confidence rating from the upstream sub-agent. Dropping the signal is a gate failure.
+- **Hygiene.** Strip secrets, PII, and proprietary code from the post-mortem, the incident channel, and logs.
+- **This command composes existing hatch3r artifacts** (`hatch3r-incident-responder` agent + skill, `hatch3r-reliability`) — it orchestrates the runbook through sub-agents; it does not replace the skill or restate the runbook.
+---
+## References
+- `skills/hatch3r-incident-response/SKILL.md` — the runbook this command orchestrates (severity table, Bounded Autonomy & Escalation matrix, Telemetry Sources, topology capture, six-step post-mortem); accessed 2026-06-02, trust tier: official-docs (in-repo canonical).
+- `agents/hatch3r-incident-responder.md` — the specialist this command delegates the live lifecycle and post-mortem to; accessed 2026-06-02, trust tier: official-docs (in-repo canonical).
+- `commands/hatch3r-bug-pipeline.md` — orchestrator command structure + Per-Turn Header / Delegation Attestation / Iteration Summary / Cost Visibility block patterns mirrored here; accessed 2026-06-02, trust tier: official-docs (in-repo canonical).
+- PagerDuty — "Incident Response Documentation: Severity Levels" (https://response.pagerduty.com/before/severity_levels/) — accessed 2026-06-02, PagerDuty, **official-docs**. Source for the severity-to-response escalation mapping (SEV-1/SEV-2 → major-incident response with incident-commander paging) that the Step 0 tiering and Step 1 mitigation gate map onto the skill's P0-P3 table.
+- Atlassian — "The Atlassian Incident Management Handbook" (https://www.atlassian.com/incident-management/handbook) — accessed 2026-06-02, Atlassian, **official-docs**. Source for incident-command authority (single owner empowered to coordinate, page, and gate) and the blameless-post-mortem-for-SEV2+ practice with a post-incident review within 24-48 hours encoded in Step 3.

package/dist/content/commands/hatch3r-migration-plan.md CHANGED Viewed

@@ -9,15 +9,17 @@ quality_charter: agents/shared/quality-charter.md
 efficiency_patterns: agents/shared/efficiency-patterns.md
 cache_friendly: true
 parallel_tool_default: true
+efficiency_tier: deep
 triage_tiers: [1, 2, 3]
+supports_resume: true
 sub_agents_spawned:
   count: 3
-  rationale: Two parallel hatch3r-researcher modes (changelog-analysis + breaking-change-inventory) in Step 3 followed by a hatch3r-architect for codebase impact mapping and a hatch3r-docs-writer for the plan; serialization only on the research → impact-mapping dependency edge.
+  rationale: Two parallel hatch3r-researcher modes (changelog-analysis + breaking-change-inventory) in Step 3 followed by a hatch3r-architect for codebase impact mapping and a hatch3r-docs-writer for the plan; serialization only on the research → impact-mapping dependency edge. Cost-dominance per CONSTITUTION §2 P8 — token cost never serializes independent work.
 ---
 ## §0 Detect Ambiguity (P8 B1)
-Before any action, scan the user's request and provided context for unresolved questions in scope, acceptance criteria, irreversibility, or constraint conflicts (contradictory inputs, missing target, unknown convention). If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md` — do not proceed under silent assumption. This is the default path, not an exception. Acceptable to proceed without asking ONLY when scope is single-target, single-concern, and the brief alone is testable. Any residual ambiguity discovered mid-workflow invokes the same protocol.
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → §0 Detect Ambiguity (P8 B1). Triggers: contradictory inputs, missing target, unknown convention.
 ## Agent Pipeline
@@ -27,6 +29,8 @@ Before any action, scan the user's request and provided context for unresolved q
 | 2. Impact Analysis | `hatch3r-architect` | No | Yes |
 | 3. Plan Generation | `hatch3r-docs-writer` | No | Yes |
+**Parallel-safety conditions** (per `rules/hatch3r-agent-orchestration.md` §Parallel Safety): every parallel fan-out above holds all three — read-only or disjoint writes, deterministic aggregation, no shared mutable state.
 # Migration Plan — Dependency or Framework Upgrade from Assessment to Phased Execution
 Take a dependency or framework upgrade target and produce a complete migration plan (`docs/migrations/`), rollback procedures for each phase, and structured `todo.md` entries ready for `hatch3r-board-fill`. Spawns parallel researcher sub-agents (dependency changelog analysis, breaking change inventory) followed by an architect for codebase impact mapping, then a docs-writer for plan generation. AI proposes all outputs; user confirms before any files are written. Optionally chains into `hatch3r-board-fill` to create GitHub issues immediately.
@@ -46,6 +50,12 @@ Take a dependency or framework upgrade target and produce a complete migration p
 ---
+## Confidence Propagation Contract
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Confidence Propagation Contract. Readiness kind: plan.
+---
 ## Workflow
 Execute these steps in order. **Do not skip any step.** Ask the user at every checkpoint marked with ASK.
@@ -60,6 +70,25 @@ Classify the migration request before delegating:
 If Tier 1, run a condensed pipeline that skips the architect when no breaking changes exist. If Tier 2, run the standard pipeline below. If Tier 3, run the full pipeline including incremental-vs-direct trade-off analysis and confirm phasing with the user before writing files.
+### Step 0.5: Emit Pre-Execution Cost Preview
+Before the first researcher dispatch (Step 3), surface the cost preview so a multi-researcher migration-planning run is never started blind. Emit the `cost_estimate` block per `rules/hatch3r-cost-visibility.md` Pre-Execution Estimate, calibrated to the Step 0 triage tier:
+```yaml
+cost_estimate:
+  expected_sa_count: <triage tier → Tier 1 ~1, Tier 2 ~3, Tier 3 up to 3>
+  estimated_input_tokens_static_frame: <int>
+  estimated_web_research_queries: <int>
+  triage_tier: light | standard | deep
+  estimated_duration_min: <int>
+```
+Post-execution actuals + delta land in the iteration summary's Fan-out + Cost section per `rules/hatch3r-cost-visibility.md` Post-Execution Actuals. Token telemetry sources from `src/pipeline/observability.ts`.
+### Effort Override (Decision 17)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Effort Override (Decision 17). Misclassification example: a minor version bump scored as Deep, or a framework migration scored as Light.
 ---
 ### Step 1: Gather Migration Target
@@ -101,7 +130,7 @@ After the migration brief is confirmed, probe for missing context. Analyze the b
    - **Bundle/binary size**: Are there known size regressions in the target version?
    - **Type system**: Does the upgrade introduce stricter types or remove type exports?
-Skip dimensions that the migration brief already addresses clearly.
+Skip dimensions that the migration brief already addresses with a stated answer.
 **ASK:** "Before research begins, I have {N} questions to confirm coverage of all migration dimensions:
 {numbered question list — each with the dimension label and why the answer matters}
@@ -343,6 +372,53 @@ If yes, instruct the user to invoke the `hatch3r-board-fill` command. Board-fill
 ---
+## Resumability (Decision 27/30)
+migration-plan is long-running — a Tier 3 multi-major-version or framework migration fans out two parallel hatch3r-researcher modes (dependency-changelog, breaking-change-inventory) in Step 3, then runs hatch3r-architect for codebase impact mapping (Step 4) and hatch3r-docs-writer for phased plan generation (Step 5). Per hatch3r's workspace-checkpointed resumability contract, checkpoint progress so an interrupted run re-enters at the last completed step rather than re-running the changelog research and re-deriving the breaking-change inventory.
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Checkpoint Contract. Per-command slots: workspace `.migration-plan-workspace/`; step range the Step 0 → Step 8 progression; `wave` = researcher-batch index across the 2 parallel modes; snapshot/rollback paths `docs/migrations/`, `docs/adr/`, and `todo.md`. Write points: after Step 1 migration-target context locks, after Step 2 scope ASK, after the Step 3 two-researcher fan-out returns, after Step 4 architect impact-mapping returns, after Step 5 docs-writer plan synthesis is confirmed by ASK, after each Step 6 file write (`docs/migrations/`, `docs/adr/`), after Step 7 todo.md phased-entry generation, and after the optional Step 8 chain-to-`hatch3r-board-fill` handoff.
+---
+## Per-Turn Pipeline-State Header (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Per-Turn Pipeline-State Header. Phase mapping for migration-plan: `1` = source/target intake + scope detection, `2` = researcher sub-agent dispatch (consumer enumeration, expand-contract phasing), `3` = plan synthesis + rollback drafting, `4` = plan write + iteration-summary. Tier 1 runs are exempt per the Tier 1 exemption.
+## End-of-Turn Delegation Attestation (Bypass Protection)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → End-of-Turn Delegation Attestation. Per-command mutated-file slot: plan document, phase specs, rollback scripts.
+## Iteration Summary (mandatory output)
+Emit the canonical 9-section iteration summary per `rules/hatch3r-iteration-summary.md` as the final user-facing output. The validation gate at `.claude/rules/capability-lifecycle.md` blocks SUCCESS declarations without this block (CONSTITUTION §6 Decision 23).
+The 9 sections:
+1. **Request** — verbatim restatement of the user's ask in one sentence.
+2. **Fan-out + Cost** — `sub_agents_spawned: { count, rationale }` plus the `cost_estimate` / `cost_actuals` / `delta` blocks (see Cost Visibility below).
+3. **Web Research** — every URL fetched with access date + trust tier per `agents/shared/rigor-contract.md` (0 acceptable when no research was needed).
+4. **Files Mutated** — list with diff summary (lines added / removed / files created).
+5. **Gates Passed / Failed** — explicit list per `.claude/rules/capability-lifecycle.md` Gate Checklist.
+6. **Pillar Impact Attribution** — `progress_toward_pillar: <axis>.<pillar_id>+<delta>` per CONSTITUTION §6 Decision 17.
+7. **Verification Commands** — exact commands run with exit codes plus key output lines (≤200 chars).
+8. **Open Questions / Blockers** — explicit `None` if fully closed.
+9. **Learnings Captured** — IDs of any learnings written to `.hatch3r/learnings/` this run per `rules/hatch3r-learning-system.md`.
+### Cost Visibility (Decision 24)
+> Orchestration boilerplate: see `commands/shared/orchestration-frame.md` → Cost Estimate for the 5-field `cost_estimate` schema and the post-execution `cost_actuals` + `delta` contract; both land in Section 2 above.
+## Cost estimate (Decision 24)
+This command emits cost transparency per `rules/hatch3r-cost-visibility.md` and CONSTITUTION §6 Decision 24/29:
+- **Pre-execution `cost_estimate`** — emitted in Step 0.5 before the first researcher dispatch.
+- **Post-execution `cost_actuals` + `delta`** — appended to the iteration summary's Fan-out + Cost section per `rules/hatch3r-iteration-summary.md` §2.
+Per-tier `expected_sa_count` calibration (from frontmatter `sub_agents_spawned.count: 3` × tier heuristic in `rules/hatch3r-cost-visibility.md` Pre-Execution Estimate): Tier 1 ≈ 1 (two researchers, architect skipped when no breaking changes); Tier 2 ≈ 3 (two parallel researchers + architect, docs-writer); Tier 3 up to 3 (same fan-out, deeper changelog + incremental-vs-direct analysis). Deltas beyond 25% absolute value carry `flagged_for_review: true`. Token telemetry sources from `src/pipeline/observability.ts`; estimation primitives from `src/pipeline/costEstimator.ts`.
+---
 ## Error Handling
 - **No changelog available:** Fall back to git diff of the source repository between version tags. If unavailable, rely on community migration guide researcher output only and warn the user that the breaking change inventory may be incomplete.