npm - @brunosps00/dev-workflow - Versions diffs - 0.11.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/scaffold/skills/dw-incident-response/SKILL.md ADDED Viewed

@@ -0,0 +1,164 @@
+---
+name: dw-incident-response
+description: Use when a production incident is reported, when writing a postmortem, or when an on-call handoff is needed. Five-phase guided workflow (triage → investigation → resolution → communication → postmortem) with checkpoints between phases and structured output files persisted to .dw/incidents/.
+---
+# Incident Response
+> **Inspired by** [`wilsto/claude-code-starter-kit/incident-response`](https://github.com/wilsto/claude-code-starter-kit) (MIT). Five-phase workflow structure and runbook templates adapted from that skill; specifics rewritten for dev-workflow's `.dw/` namespace and command surface.
+> wilsto credits the original `wshobson/agents` plugin `incident-response` (v1.3.0). Attribution chain: wshobson → wilsto → dev-workflow.
+## When to use
+- A production incident is declared (SEV-1 through SEV-3).
+- You need to write a postmortem after an incident.
+- You're generating or updating a runbook for a service.
+- You need an on-call handoff template.
+- `/dw-bugfix` detects severity `critical` + production marker — auto-escalates here.
+## Key concepts
+### Severity classification
+| Severity | Criteria | Response time | Example |
+|----------|----------|---------------|---------|
+| **SEV-1 (Critical)** | Service down, data loss, security breach | Immediate (page) | Payment system offline |
+| **SEV-2 (Major)** | Significant degradation, partial outage | < 30 min | API latency 10× normal |
+| **SEV-3 (Minor)** | Limited impact, workaround exists | < 4 hours | One endpoint returning 500s |
+| **SEV-4 (Low)** | Cosmetic, non-urgent | Next business day | Dashboard chart broken |
+See `references/severity-and-triage.md` for full criteria and triage commands per stack.
+### Behavioral rules
+1. **Execute phases in order** — never skip a phase.
+2. **Write output files after each phase** — they are the record of truth for the next phase.
+3. **STOP at checkpoints** — wait for user confirmation before proceeding.
+4. **Halt on failure** — if a step fails, do not continue to the next phase.
+5. **File-based context** — read previous phase outputs rather than relying on conversation memory.
+## Entry questions
+Before starting any phase:
+1. **What's happening?** Describe the incident in 1–2 sentences. What's broken and what's the user impact?
+2. **Severity?** SEV-1 / SEV-2 / SEV-3 / SEV-4 per the table above.
+3. **Mode?**
+   - **Full workflow** — all 5 phases (triage → postmortem).
+   - **Postmortem only** — incident already resolved; skip to Phase 5.
+   - **Runbook generation** — produce a runbook template for a service (no live incident).
+## The five phases
+Each phase writes to `.dw/incidents/<YYYY-MM-DD>-<slug>/`. Slug is auto-generated from incident title (kebab-case, ≤30 chars).
+### Phase 1 — Detection & Triage
+**Output:** `.dw/incidents/<date>-<slug>/01-triage.md`
+Steps:
+1. Classify severity using the table above.
+2. Assess blast radius: which services, how many users affected, revenue impact if known.
+3. Identify immediate mitigation: rollback, feature flag toggle, traffic redirect.
+See `references/severity-and-triage.md` for diagnostic commands per stack (Kubernetes, Docker, generic HTTP).
+**Checkpoint:** present triage summary. Wait for user confirmation before moving to investigation.
+### Phase 2 — Investigation & Root Cause
+**Output:** `.dw/incidents/<date>-<slug>/02-investigation.md`
+Steps:
+1. Build timeline: when did it start? What changed?
+2. Correlate signals: metrics spike + deploy + error logs.
+3. Hypothesis testing: one theory at a time; verify each before moving on.
+4. Identify root cause: not the first symptom, but the underlying assumption that broke.
+Common forensic tools:
+- `git bisect` for regressions.
+- Recent deploy log: `git log --oneline --since="24 hours ago"`.
+- For live monitoring during investigation, `/dw-debug-protocol` flaky-investigation patterns apply.
+**Checkpoint:** present root-cause hypothesis. Wait for user confirmation before applying fix.
+### Phase 3 — Resolution & Recovery
+**Output:** `.dw/incidents/<date>-<slug>/03-resolution.md`
+Steps:
+1. Apply fix: hotfix branch → fast PR (via `/dw-generate-pr`) → deploy.
+2. Verify: health checks green, error rate back to baseline.
+3. Monitor 30 minutes post-fix for SEV-1/2 to confirm stability.
+**Checkpoint:** confirm full recovery before drafting communications.
+### Phase 4 — Communication
+**Output:** `.dw/incidents/<date>-<slug>/04-communication.md`
+Two communications generated using the templates in `references/communication-templates.md`:
+- **Initial notification** (sent during the incident; updated every 30 min for SEV-1/2).
+- **Resolution notification** (sent when phase 3 confirms recovery).
+### Phase 5 — Postmortem
+**Output:** `.dw/incidents/<date>-<slug>/05-postmortem.md`
+Generate a **blameless** postmortem using `references/postmortem-template.md`. Sections:
+- Summary (2–3 sentences).
+- Timeline (per-minute events from alert to all-clear).
+- Root cause (technical, no blame).
+- Impact (users affected, revenue, error-budget consumed).
+- What went well / What went wrong.
+- Action items (owner + due date + priority — see `references/blameless-discipline.md` for the quality bar).
+**Quality bar for action items:** see `references/blameless-discipline.md`. "Improve monitoring" does NOT count. "Add Datadog SLO alert at p99 > 800ms with on-call routing by 2026-06-01, owner: @bruno" counts.
+## Required reading by context
+| Doing what | Read |
+|------------|------|
+| Live incident — triage | `references/severity-and-triage.md` |
+| Writing the postmortem | `references/postmortem-template.md` + `references/blameless-discipline.md` |
+| Drafting incident communications | `references/communication-templates.md` |
+| Generating a runbook (no live incident) | `references/runbook-templates.md` |
+| On-call handoff document | `references/runbook-templates.md` (handoff section) |
+## Common pitfalls
+Detailed in `references/blameless-discipline.md`:
+1. **Skipping triage** — jumping to debug without assessing severity/blast-radius wastes the wrong hours.
+2. **Blame culture** — postmortems focused on "who did it" hide mistakes and incidents recur.
+3. **No action items** — postmortem filed, forgotten, same incident in 3 months.
+4. **Communicating too late** — users discover the outage before the team acknowledges; trust erodes.
+## Integration with dev-workflow commands
+- `/dw-bugfix` with severity `critical` + production marker → offers to escalate here.
+- `/dw-autopilot --incident "X"` → runs this workflow end-to-end for declared incidents.
+- `/dw-analyze-project` reads `.dw/incidents/` on next execution to surface recurring failure patterns. 3+ incidents touching the same area → flag as "structural problem; needs design review" and propose constitution principles based on observed patterns.
+- `/dw-generate-pr` is the fix-deployment path during Phase 3.
+- `/dw-adr` is the right tool when the postmortem leads to a deliberate architectural change.
+## Output directory layout
+```
+.dw/incidents/
+├── 2026-05-12-checkout-payment-outage/
+│   ├── 01-triage.md
+│   ├── 02-investigation.md
+│   ├── 03-resolution.md
+│   ├── 04-communication.md
+│   └── 05-postmortem.md
+└── 2026-05-08-search-index-stale/
+    └── 05-postmortem.md     # postmortem-only mode
+```
+Files are committed to the repo alongside code — incidents are part of the project history, not ephemeral chat.
+## Why this skill exists
+dev-workflow's existing surface is "build feature → ship." Nothing covered "production broke, what now?" Teams improvised postmortems, action items got lost, and the same bug recurred. This skill closes that loop: structured response in the moment, blameless reflection after, and cross-incident learning that feeds back into the project's constitution.

package/scaffold/skills/dw-incident-response/references/blameless-discipline.md ADDED Viewed

@@ -0,0 +1,126 @@
+# Blameless discipline — the principles behind postmortems
+The postmortem template imposes structure. This reference imposes the discipline that makes the structure useful.
+## Why blameless
+Postmortems with blame produce three failure modes:
+1. **Hidden mistakes** — people learn to omit information that might reflect badly on them.
+2. **Compliance theater** — "lessons learned" becomes a ritual, not a tool.
+3. **Recurring incidents** — the same kind of failure happens 6 months later because the underlying system never changed.
+Blameless framing forces the conversation toward what's actually fixable: systems, processes, assumptions, tooling. Individual mistakes are the entry point; the question is always "why did the system make this mistake easy/likely?"
+## The 5-whys protocol
+Stack five "why" questions until you reach an assumption or design choice that's actually changeable:
+> **Symptom:** payment endpoint returned 500 for 47 minutes.
+> **Why?** The new deploy introduced a regression in the order serializer.
+> **Why?** The serializer started reading a field that the upstream API stopped sending.
+> **Why?** The upstream API changed its response format two weeks ago and we didn't know.
+> **Why?** We don't have contract tests against the upstream API.
+> **Why?** Contract tests were considered too expensive to set up; the trade-off was never revisited.
+**Root cause:** absence of contract tests for upstream APIs (a deliberate-but-stale design decision), not "the developer didn't check the response format."
+If you reach "operator error" at any why, you stopped too early. Operator error happens because a system permits it.
+## Quality bar for action items
+The single most common postmortem failure mode: action items written vaguely so they can be "done" without changing anything.
+### Bad action items
+- "Improve monitoring."
+- "Add more tests."
+- "Document the runbook."
+- "Make the system more resilient."
+- "Better communication during incidents."
+These are wishes, not actions. They can't be tracked. They will not happen.
+### Good action items
+Every action item has THREE components: owner, due date, measurable outcome.
+| Action | Owner | Due | Measurable outcome |
+|--------|-------|-----|-------------------|
+| Add Datadog SLO alert at p99 > 800ms with PagerDuty routing to the payments on-call schedule | @bruno | 2026-06-01 | Alert exists in Datadog UI and fires successfully in test |
+| Add idempotency-key header to `POST /api/orders` with 24h deduplication window | @maria | 2026-05-25 | Two identical requests with same key return the same order ID; verified by integration test |
+| Open ADR documenting decision to add circuit breaker on Stripe API calls | @bruno | 2026-05-19 | ADR exists in `.dw/spec/<prd>/adrs/`, status: Accepted |
+**Test:** can a third party verify the action item is done without asking the owner? If yes, it's good. If no, rewrite.
+## Cognitive analysis — beyond the trigger
+Two questions to push past "the code was wrong":
+### 1. Why didn't we catch this earlier?
+This is about the blind spot. The bug existed; we didn't see it. What's the gap in our:
+- Tests (unit, integration, contract, E2E)?
+- Monitoring (metric, alert, dashboard)?
+- Process (code review, deploy checks, canary)?
+- Documentation (knew but forgot, never knew)?
+Fix the blind spot, not just the bug.
+### 2. What ELSE could be hiding behind this gap?
+If contract tests against the upstream API are missing → the payment incident is one instance. What ELSE depends on that API in ways the missing tests would catch?
+The action item is to add the contract tests, not "fix the payment serializer."
+## Cross-incident learning
+`/dw-analyze-project` reads `.dw/incidents/` on subsequent runs. Patterns to watch:
+- **3+ incidents in the same module** (e.g., billing): structural problem. Open a design-review issue, not another action item.
+- **3+ incidents with the same root-cause class** (e.g., contract drift, missing idempotency): a constitution principle is needed (`/dw-adr` + add to `.dw/constitution.md`).
+- **Time clustering** (multiple incidents during the same week): possible stress on the team's review/deploy capacity — process issue, not technical.
+These patterns are MORE valuable than individual postmortems. Single incidents tell you what broke; patterns tell you what's fragile.
+## Common pitfalls (the four big ones)
+### Pitfall 1: Skipping triage
+**Symptom:** jumping straight to debugging without assessing severity and blast radius.
+**Consequence:** wrong priority — might fix a low-impact bug while a high-impact issue festers.
+**Fix:** always classify severity first. Two minutes of triage saves hours of misguided investigation.
+### Pitfall 2: Blame culture
+**Symptom:** postmortem focuses on "who did it" instead of "why did the system allow it?"
+**Consequence:** people hide mistakes; incidents recur.
+**Fix:** blameless framing. Focus on systemic fixes — better monitoring, safer deploys, guardrails, removed footguns.
+### Pitfall 3: No action items
+**Symptom:** postmortem written, filed, forgotten.
+**Consequence:** same incident in 3 months.
+**Fix:** every postmortem has concrete action items with owners, due dates, measurable outcomes (see "Quality bar" above). Track them in the same system as feature work.
+### Pitfall 4: Communicating too late
+**Symptom:** users discover the outage before the team acknowledges it.
+**Consequence:** trust erosion + support ticket flood.
+**Fix:** first communication within 15 min for SEV-1/2, even if it's "We're investigating." Status page updates every 30 min until resolved.
+## When the discipline bends
+- **Internal-only tools:** the communication discipline can be lighter (no public status page).
+- **Compliance-driven postmortems** (SOC 2, HIPAA): may require additional fields or sign-off chains beyond the template. Add them as a project-specific extension.
+- **Trivial near-misses** (caught in staging before production): consider a "lite" postmortem — 1 page with timeline + lesson + action items. Not full structure.
+In all bend cases, document the deviation in the postmortem itself. "Skipped public communication because internal-only tool" is fine; just say it.
+## Reference reading
+The discipline above is distilled from:
+- Google SRE Book — [Incident Response](https://sre.google/sre-book/managing-incidents/) and [Postmortem Culture](https://sre.google/sre-book/postmortem-culture/).
+- Etsy — [Debriefing Facilitation Guide](https://github.com/etsy/DebriefingFacilitationGuide) (the original blameless postmortem playbook).
+- PagerDuty — [Incident Response Documentation](https://response.pagerduty.com/).
+The dev-workflow version adapts these to a single-team or small-org scale; large enterprises may need more layers.

package/scaffold/skills/dw-incident-response/references/communication-templates.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Communication templates
+Two messages get drafted during Phase 4: an initial notification (sent during the incident) and a resolution notification (sent at all-clear). Both go through the same channels — status page, support team, internal Slack/Teams, customer email if applicable.
+## Initial notification
+Sent as soon as the incident is acknowledged. Update every 30 min for SEV-1/2 until resolved.
+```markdown
+## Incident: <Title>
+**Severity:** SEV-<X>
+**Status:** Investigating | Mitigated | Resolved
+**Impact:** <What users experience — be specific>
+**Started:** YYYY-MM-DD HH:MM UTC
+**Next update:** HH:MM UTC
+We are aware of <symptom> affecting <surface>. The team is actively investigating.
+<Optional: known workaround if any.>
+We will post the next update by HH:MM UTC.
+```
+### Update cadence
+- **SEV-1:** every 15 min until mitigated, then every 30 min until resolved.
+- **SEV-2:** every 30 min throughout.
+- **SEV-3:** every 2 hours.
+- **SEV-4:** initial + resolution; no interim updates.
+### Status progression
+The `Status` field moves through three values:
+1. **Investigating** — we know it's happening; cause unknown.
+2. **Mitigated** — symptoms are reduced or contained; root cause may still be unresolved.
+3. **Resolved** — symptoms gone, monitoring shows baseline, no follow-up expected.
+### Tone
+- Direct. No "we're experiencing a slight disruption."
+- Specific. Say which feature; don't generalize.
+- Honest. If you don't know the cause, say so.
+- No blame. Don't mention which team owns the area in public-facing comms.
+## Resolution notification
+Sent when Phase 3 confirms recovery.
+```markdown
+## Resolved: <Title>
+**Duration:** Xh Ym (from HH:MM to HH:MM UTC)
+**Root cause:** <1-2 sentences, technical but accessible>
+**Fix applied:** <what was done — link to deploy / PR if public-facing isn't a concern>
+**Postmortem:** Will be published within 48 hours at <link>
+We apologize for the disruption. Customers affected by <specific issue> can <specific recovery action if applicable>.
+Questions: <support@example.com> or <#incident-channel>.
+```
+### What to include
+- Specific duration in minutes/hours.
+- Root cause that's accurate but readable by non-engineers.
+- Concrete remedy (refund window, retry-now action, etc.) if customers need to do something.
+- Postmortem link or commitment to publish.
+### What NOT to include
+- Names of individuals.
+- Detailed stack traces or internal tool names.
+- Speculation on prevention before the postmortem is done.
+- Apologies that promise "this will never happen again" — incidents do recur; promise to learn.
+## Internal-only versions
+For internal tools without external users, drop the apology section and add:
+```markdown
+## Internal incident — <Title>
+**Severity:** SEV-X
+**Affected team(s):** <list>
+**Duration:** Xh Ym
+**Cause:** <technical>
+**Fix:** <commit/deploy>
+**Action items:** see postmortem at <link>
+```
+Shorter, more technical, no need to be customer-friendly.
+## Status page updates
+If using a status page (Statuspage, Cachet, Better Uptime, etc.):
+- Component-level granularity: mark which specific service is degraded, not "platform down."
+- Match the severity to the visual indicator (degraded vs partial outage vs major outage).
+- Close the incident on the status page only after the resolution notification confirms recovery.
+## When the communication discipline bends
+- **Internal-only tools:** initial + resolution only; no 30-min updates needed.
+- **Security incidents:** legal/compliance may need to approve every external message. Build that handoff into Phase 4.
+- **Customer-specific incidents** (affects 1-2 enterprise customers, not the whole user base): direct contact to those customers replaces public status page.
+Document the deviation in `04-communication.md`.

package/scaffold/skills/dw-incident-response/references/postmortem-template.md ADDED Viewed

@@ -0,0 +1,133 @@
+# Postmortem template — blameless, action-oriented
+Use this template for the Phase 5 output (`05-postmortem.md`). Every field is mandatory unless explicitly marked optional.
+The discipline behind the template lives in `blameless-discipline.md` — read both.
+## Template
+```markdown
+# Postmortem: <Incident title>
+**Date:** YYYY-MM-DD
+**Duration:** Xh Ym (from detection to all-clear)
+**Severity:** SEV-X
+**Authors:** @name1, @name2
+**Status:** Draft | Reviewed | Published
+## Summary
+[2-3 sentences. What happened? What was the user-visible impact? How was it resolved?
+No blame, no causes yet — just the observable narrative.]
+## Timeline
+All times UTC. Aim for ±5 min granularity; tighter for SEV-1/2.
+| Time (UTC) | Event |
+|------------|-------|
+| HH:MM | First alert fired (or first user report) |
+| HH:MM | On-call acknowledged |
+| HH:MM | War room opened / incident channel created |
+| HH:MM | Initial triage complete; severity confirmed |
+| HH:MM | Hypothesis: <X> |
+| HH:MM | Hypothesis rejected — observed <Y> |
+| HH:MM | Root cause identified |
+| HH:MM | Fix applied: <commit SHA / deploy>
+| HH:MM | Verified recovery (health checks green, error rate baseline)
+| HH:MM | All-clear confirmed; communications sent |
+## Root cause
+[Technical explanation. Walk through the chain:
+1. What was the proximate cause? (the immediate trigger)
+2. What was the underlying assumption that broke?
+3. Why didn't existing safeguards catch it?
+No blame. No "X did Y wrong." Frame as: "the system allowed Z because A was assumed."]
+## Impact
+- **Users affected:** N (counting method: <how we estimated>)
+- **Geography:** [if applicable]
+- **Revenue impact:** $X estimated (calculation: <method>)
+- **Error budget consumed:** X% of monthly SLO
+- **Data impact:** [data lost / corrupted? recoverable?]
+## Detection
+- **How was it detected?** [alert / user report / scheduled check]
+- **Time to detection (TTD):** X minutes
+- **Was the detection signal adequate?** [yes/no — and what would improve it]
+## What went well
+Concrete things to keep:
+- [item — specific, not "we worked as a team"]
+- [item]
+- [item]
+## What went wrong
+Process / tooling / assumptions that failed:
+- [item — specific failure mode, no blame]
+- [item]
+- [item]
+## Where we got lucky
+[Optional. What worked by accident that we shouldn't rely on next time?
+e.g., "The bug only fired during business hours because of <X>; could have been overnight and worse."]
+## Action items
+Quality bar: each item has **owner**, **due date**, and **measurable outcome**. "Improve monitoring" does NOT count. Read `blameless-discipline.md` for the bar.
+| Action | Owner | Due date | Priority | Tracking |
+|--------|-------|----------|----------|----------|
+| Add Datadog SLO alert at p99 > 800ms with PagerDuty routing | @bruno | 2026-06-01 | P1 | Linear ENG-1234 |
+| Add idempotency key to /api/orders POST | @maria | 2026-05-25 | P1 | GitHub #5678 |
+| Update runbook with new mitigation steps | @bruno | 2026-05-19 | P2 | Linear ENG-1235 |
+| Constitution principle: every state-changing endpoint requires audit log | @team | 2026-06-15 | P3 | ADR-042 |
+## Constitution implications
+[Did this incident reveal a principle the team should commit to? If yes, an ADR
+should be opened via `/dw-adr`. Examples:
+- "Every write to billing tables requires audit log entry"
+- "Database migrations must run in a separate maintenance window"
+- "External API calls must have circuit breakers"
+Link the ADR slug here.]
+## Cross-incident pattern
+[Have we seen this category of incident before? If yes, list the prior incidents.
+3+ similar incidents = structural problem; needs design review, not just an action item.]
+## References
+- Incident channel: [Slack link]
+- Diagnostic dashboards: [Grafana / Datadog links]
+- Affected commits: [SHA range]
+- Related ADRs: [list]
+- Related runbook: `<path>` (updated in this incident if applicable)
+```
+## Quality checklist before publishing
+- [ ] Summary is 2-3 sentences — not a wall of text.
+- [ ] Timeline events have specific times (not "around lunch time").
+- [ ] Root cause section explains the ASSUMPTION, not just the trigger.
+- [ ] Every action item has owner + due date + measurable outcome.
+- [ ] Action items are filed in the actual tracker (Linear/Jira/GitHub) with links.
+- [ ] No names appear in "what went wrong" — only systems, processes, and assumptions.
+- [ ] Cross-incident pattern section reviewed against `.dw/incidents/` for prior occurrences.
+- [ ] If a constitution principle emerged, ADR opened via `/dw-adr` and linked.
+## Publishing
+- Postmortems for SEV-1/2 are shared with stakeholders within 48 hours.
+- Postmortems for SEV-3 are reviewed in the next team retro.
+- Action items track in the project's normal tooling — not in the postmortem file.
+- The postmortem file in `.dw/incidents/` is the canonical write-up; everything else (Slack threads, status pages) eventually points back here.