npm - @brunosps00/dev-workflow - Versions diffs - 0.11.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/scaffold/skills/dw-incident-response/references/runbook-templates.md ADDED Viewed

@@ -0,0 +1,169 @@
+# Runbook templates + on-call handoff
+Two contexts where this reference applies:
+1. **Generating a runbook** (entry-mode 3 in the skill — no live incident, just producing operational docs).
+2. **On-call handoff** at the end of a shift.
+## Runbook: service outage template
+```markdown
+## Runbook — <Service Name> outage
+### Quick diagnosis (< 5 min)
+1. Health check: `curl -sf https://<service>/health | jq .`
+2. Pod status (K8s): `kubectl get pods -l app=<service>`
+3. Recent deploys: `kubectl rollout history deployment/<service>`
+4. Recent logs: `kubectl logs -l app=<service> --tail=50 --since=5m`
+5. Metrics dashboard: <Grafana / Datadog link>
+### Common failure modes
+| Symptom | Likely cause | Fix |
+|---------|--------------|-----|
+| OOMKilled | Memory leak or under-sized pod | Scale up replicas; increase memory limit; investigate leak |
+| CrashLoopBackOff | Config error, missing env var | Check logs for startup errors; verify ConfigMap/Secret |
+| ImagePullBackOff | Bad image tag or registry auth | Verify tag exists; check registry credentials |
+| 503s from healthy pods | Downstream dependency down | Check the dependency (DB, cache, queue) |
+| Slow responses (high p99) | DB connection saturation; N+1 query; cold cache | Check DB connections; tail slow query log; check cache hit rate |
+| Latency spike after deploy | New code path slower than expected | Roll back; profile the new code path |
+### Rollback procedure
+```bash
+# K8s
+kubectl rollout undo deployment/<service>
+# Confirm
+kubectl rollout status deployment/<service>
+```
+### Escalation chain
+- **L1 (on-call engineer):** <name / PagerDuty schedule>
+- **L2 (team lead):** <name>
+- **L3 (infrastructure / platform team):** <team channel>
+- **L4 (CTO / VP Eng):** <name> (SEV-1 only)
+### Known dependencies
+- Database: <DB name + version>
+- Cache: <Redis / Memcached>
+- Queue: <SQS / RabbitMQ / Kafka>
+- External APIs: <list with SLAs>
+If a dependency is down, the service may degrade gracefully — see `<degraded-mode>` section in the architecture doc.
+### Related runbooks
+- `<dependency-A>` runbook
+- `<related-service-B>` runbook
+```
+## Runbook: database incident template
+```markdown
+## Runbook — Database (<engine, version>) incident
+### Quick diagnosis (Postgres examples; adapt for MySQL/etc.)
+```sql
+-- Active connections vs baseline
+SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
+-- Expected baseline: <N>; alert if > <2N>
+-- Long-running queries
+SELECT pid, now() - query_start AS duration, query
+FROM pg_stat_activity
+WHERE state = 'active' AND now() - query_start > interval '30 seconds'
+ORDER BY duration DESC;
+-- Lock waits
+SELECT * FROM pg_locks WHERE NOT granted;
+-- Replication lag (if applicable)
+SELECT * FROM pg_stat_replication;
+```
+### Common DB failure modes
+| Symptom | Likely cause | Mitigation |
+|---------|--------------|------------|
+| Connection pool exhaustion | App leak; spike in traffic | Restart app pods to release connections; scale app; investigate leak |
+| Lock contention | Long-running transaction blocking writers | Identify holder via pg_stat_activity; terminate if safe (`pg_terminate_backend`) |
+| Disk full | Unbounded growth; WAL retention | Free space; review retention; vacuum |
+| Replication lag | Network or replica overload | Check replica health; review primary write rate |
+| Slow query (p99 spike) | Missing index; bad plan after stats change | `EXPLAIN ANALYZE` the slow query; check `pg_stat_user_indexes` |
+### Emergency procedures
+- **Terminate a stuck query:** `SELECT pg_terminate_backend(<pid>);` — only after confirming it's safe (it won't roll back distributed transactions cleanly).
+- **Failover to replica:** `<runbook-specific commands>`.
+- **Restore from snapshot:** see `<DR runbook>` — only for data corruption / loss.
+### Escalation
+- **DB on-call:** <name / schedule>
+- **DBA team:** <channel>
+- **Vendor support:** <ticket portal> (RDS, CloudSQL, etc.)
+```
+## On-call handoff template
+End of every shift. Sent in the team's on-call channel + emailed to the incoming engineer.
+```markdown
+## On-call handoff — <Date>
+**Outgoing:** @<name>
+**Incoming:** @<name>
+### Active incidents
+- [None] OR list with status, severity, channel link
+### Ongoing investigations (carried over)
+- **<service/area>:** <one-line status, what's been tried, next step>
+- **<issue>:** <what's blocking>
+### Recent changes (last 24h)
+- **<service>:** deploy at HH:MM UTC — <commit/PR>
+- **<config>:** change at HH:MM — <what>
+- **<infra>:** change at HH:MM — <what>
+### Known issues (workarounds active)
+- **<symptom>:** workaround: <action>. Tracking: <issue link>.
+### Upcoming events
+- **<date HH:MM UTC>:** scheduled maintenance — <service> — owner <name>
+- **<date>:** traffic spike expected (launch, marketing campaign, etc.)
+### Notes for the new shift
+- [free-form: anything the incoming engineer should know that didn't fit above]
+```
+## Generating runbooks proactively
+When the skill is invoked in entry-mode 3 (runbook generation, no live incident):
+1. Ask: "Which service / area is the runbook for?"
+2. Look up the service in `.dw/intel/` — pull the stack, dependencies, recent deploys.
+3. Use the appropriate template above (service outage / DB / on-call handoff).
+4. Fill in concrete commands, dashboard links, on-call info.
+5. Save to `.dw/runbooks/<service-name>.md`.
+6. Suggest adding the runbook path to the relevant `.dw/rules/<service>.md`.
+Runbooks should be **executable** — every command listed must work without modification when copy-pasted by a tired engineer at 3am.
+## Maintenance cadence
+- Runbooks reviewed quarterly.
+- Updated immediately after any incident where the runbook was wrong or missing.
+- Owner per runbook = the team that owns the service.
+A stale runbook is worse than no runbook (false confidence). If you can't keep one current, delete it and rely on the incident-response workflow alone.

package/scaffold/skills/dw-incident-response/references/severity-and-triage.md ADDED Viewed

@@ -0,0 +1,186 @@
+# Severity classification + triage commands
+## Severity criteria — extended version
+### SEV-1 (Critical)
+**Trigger:** any of
+- Production service fully down or returning >50% errors.
+- Data loss in progress or already occurred.
+- Security breach detected (active credential leak, RCE, unauthorized data access).
+- Compliance violation (PCI, HIPAA, GDPR exposure).
+**Response:** page on-call immediately. CEO/CTO notified within 30 min if customer-facing. War room opened.
+### SEV-2 (Major)
+**Trigger:** any of
+- Significant feature degradation affecting >25% of users.
+- Latency 10× normal or higher.
+- Background jobs queue backing up beyond SLA.
+- Partial outage of a critical dependency (auth, payments, search).
+**Response:** on-call investigates within 30 min. Stakeholders notified within 1 hour.
+### SEV-3 (Minor)
+**Trigger:** any of
+- Single endpoint returning errors with a workaround available.
+- Performance regression affecting <25% of users.
+- Non-critical feature broken.
+**Response:** investigated within 4 hours. Communicated in standup/Slack channel.
+### SEV-4 (Low)
+**Trigger:** any of
+- Cosmetic issue.
+- Non-user-facing bug.
+- Minor UX papercut.
+**Response:** logged as a normal bug. Fixed in next routine deploy.
+## Blast radius assessment
+Before declaring severity, answer:
+1. **Which services** are affected?
+2. **How many users** are seeing the issue? Estimate from error rate × DAU.
+3. **What's the revenue impact** per hour (if customer-facing)?
+4. **What downstream systems** depend on the affected service?
+5. **Is the issue spreading** or contained?
+Document each in `01-triage.md`. The numbers go into the postmortem.
+## Triage commands by stack
+### Kubernetes
+```bash
+# Find non-running pods
+kubectl get pods -A | grep -v -E 'Running|Completed'
+# Pods consuming most memory
+kubectl top pods --sort-by=memory --all-namespaces | head -20
+# Recent logs from a specific service
+kubectl logs -l app=<service> --tail=100 --since=10m
+# Recent deploys
+kubectl rollout history deployment/<service>
+# Rollback the most recent deploy
+kubectl rollout undo deployment/<service>
+```
+### Docker / Docker Compose
+```bash
+# Containers that exited
+docker ps -a --filter "status=exited" --format "table {{.Names}}\t{{.Status}}"
+# Recent container logs
+docker logs <container> --tail=100 --since=10m
+# Container resource usage
+docker stats --no-stream
+# Recent compose deploys (assumes versioned compose files)
+git log --oneline --since="24 hours ago" -- docker-compose.yml
+```
+### Generic (any service with HTTP health endpoint)
+```bash
+# Health check
+curl -sf https://<host>/health | jq .
+# Compare healthy vs unhealthy timing
+time curl -sf https://<host>/health
+time curl -sf https://<host>/api/<critical-endpoint>
+# Recent application deploys (any git-tracked deployment)
+git log --oneline --since="2 hours ago"
+# Recent infra changes
+git -C /path/to/infra log --oneline --since="24 hours ago"
+```
+### Database (Postgres)
+```sql
+-- Active connections (compare against normal baseline)
+SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
+-- Long-running queries (>30s)
+SELECT pid, now() - query_start AS duration, query, state
+FROM pg_stat_activity
+WHERE state = 'active' AND now() - query_start > interval '30 seconds'
+ORDER BY duration DESC;
+-- Lock contention
+SELECT blocked_locks.pid AS blocked_pid,
+       blocking_locks.pid AS blocking_pid,
+       blocked_activity.query AS blocked_query,
+       blocking_activity.query AS blocking_query
+FROM pg_catalog.pg_locks blocked_locks
+JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
+JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
+  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
+  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
+  AND blocking_locks.pid != blocked_locks.pid
+JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
+WHERE NOT blocked_locks.granted;
+-- Terminate a stuck query (last resort; investigate first)
+SELECT pg_terminate_backend(<pid>);
+```
+### Generic application metrics
+If your APM provides these (Datadog, New Relic, Sentry):
+- Error rate per endpoint over the last hour vs the prior week's baseline.
+- p50/p95/p99 latency per endpoint.
+- Throughput (requests per second).
+- Saturation (queue depth, connection pool usage).
+The Google SRE "Four Golden Signals" (latency, errors, traffic, saturation) cover most cases.
+## Immediate mitigation patterns
+Before debugging root cause, can you reduce blast radius now? Try in order:
+1. **Rollback the most recent deploy** if timing matches.
+2. **Toggle feature flag** for the affected feature.
+3. **Redirect traffic** away from the unhealthy instance / region.
+4. **Rate-limit** the abusive client or queue depth.
+5. **Scale up** if it's resource starvation (CPU, memory, connection pool).
+If none apply, escalate to investigation phase. Don't burn time forcing mitigation that doesn't fit.
+## What to record in `01-triage.md`
+```markdown
+# Triage — <incident title>
+## Detected
+- **Time:** YYYY-MM-DD HH:MM UTC
+- **Detected by:** alert / user report / monitoring dashboard / customer support
+- **Severity:** SEV-X
+## Symptoms
+- [observable behavior]
+- [error rates, latency, etc.]
+## Blast radius
+- Services: [list]
+- Users affected: [estimate]
+- Revenue impact: [if known]
+- Downstream impact: [systems that depend on this]
+## Initial mitigation
+- [what was tried]
+- [what worked / didn't]
+## Next steps
+Proceeding to Phase 2 (investigation) after user confirmation.
+```

package/scaffold/skills/dw-llm-eval/SKILL.md ADDED Viewed

@@ -0,0 +1,148 @@
+---
+name: dw-llm-eval
+description: Use when authoring or reviewing AI/LLM features (chat, RAG, summarization, classifiers, agents) — enforces an oracle ladder (climb from exact match up to LLM-as-judge), reference-dataset discipline, judge calibration (Spearman ≥0.80), and trajectory-vs-outcome agent eval so AI features ship with measurable behavior instead of "looks good to me" QA.
+---
+# LLM Evaluation
+> Adapted patterns from [`langchain-ai/agentevals`](https://github.com/langchain-ai/agentevals) (MIT) for trajectory-match modes, plus general LLM-eval discipline from OpenAI evals cookbook, Anthropic's evals guidance, and the broader open evaluations literature. Material rewritten in our voice.
+## When this skill applies
+- Any feature that uses an LLM in production: chat, summarization, classification, RAG (retrieval-augmented generation), agents, tool-use, structured extraction, code generation.
+- `/dw-create-tasks` when the PRD mentions an AI feature — eval planning becomes a mandatory subtask.
+- `/dw-code-review` when the diff touches AI feature code paths.
+- `/dw-run-qa --ai` when validating an AI feature against its reference dataset.
+If the feature is fully deterministic (no LLM in the loop), use `dw-testing-discipline` instead — Iron rules and 25 anti-patterns. This skill is specifically for entropy-tolerant systems.
+## First principle
+> Tests for deterministic code assert exact outputs.
+> Tests for LLM features assert behaviors within tolerance.
+> The discipline is choosing the right tolerance — and proving it's not "anything passes."
+## The oracle ladder
+Five rungs, climb from CHEAPEST/STRICTEST to MOST EXPENSIVE/SUBJECTIVE. Always start at the bottom; only climb when the lower rung can't cover the case.
+| Rung | What it checks | Cost | When to use |
+|------|----------------|------|-------------|
+| 1. **Exact match** | `output === expected` | ~free | Structured outputs (function calls, JSON with stable shape, classifications) |
+| 2. **Schema validation** | Output matches JSON schema / type contract | ~free | Output shape matters; specific values vary |
+| 3. **Outcome state** | Side effect produced the expected change (DB row, file written, tool called) | cheap | Agents, tool-use, RAG with concrete answers |
+| 4. **LLM-as-judge** | A different model grades the output against a rubric | medium ($$$) | Subjective quality (helpfulness, tone, faithfulness) where no rule can decide |
+| 5. **Human review** | Domain expert scores | expensive | Calibration of rung 4; high-stakes outputs; edge cases |
+**Rule:** never reach for rung 4 before checking if rungs 1-3 can cover the case. Every rung up costs an order of magnitude more (latency, money, calibration effort) — and adds entropy.
+See `references/oracle-ladder.md` for examples per rung and the climbing decision tree.
+## LLM-as-judge discipline (when rung 4 is needed)
+Without calibration, LLM-as-judge produces noise dressed as signal. Three non-negotiables:
+1. **Calibrate against humans** — ≥20 human-graded cases, compute Spearman correlation against LLM-as-judge. Target ≥0.80. Below that, reject the judge configuration.
+2. **Use a different model than the system under test** — same model judging itself produces false positives. Pair: GPT-4 generates → Claude judges. Or vice versa.
+3. **Rubric, not free-form** — provide the judge a structured rubric (criteria + scale + examples) instead of "rate quality 1-10."
+See `references/judge-calibration.md` for the full calibration recipe, rubric templates, and the "judge drift" monitoring pattern.
+## Reference dataset principle
+> 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
+The dataset is the bedrock. Without a reference set, every "improvement" is anecdote.
+Structure:
+```
+.dw/eval/datasets/<feature-name>/
+├── cases.jsonl           # input + expected (or rubric reference) per line
+├── README.md             # provenance, sample size, when last reviewed
+└── runs/<YYYY-MM-DD>.jsonl  # results of each eval run
+```
+See `references/reference-dataset.md` for case-design principles, sampling from production, and when to expand the set.
+## RAG evaluation
+Three orthogonal metrics — measure all three, not just one:
+| Metric | What it measures | Tool |
+|--------|-----------------|------|
+| **Retrieval precision@k** | Of the top-K retrieved chunks, how many were relevant | Exact match against labeled ground-truth |
+| **Answer faithfulness** | Does the answer cite only what the retrieved context supports? | LLM-as-judge with rubric |
+| **Context utilization** | Did the answer USE the retrieved context, or hallucinate around it? | Heuristic + LLM-as-judge |
+Precision alone misses hallucination. Faithfulness alone misses retrieval failure. Context utilization alone misses both. See `references/rag-metrics.md` for the full implementation.
+## Agent / tool-use evaluation
+Two questions distinguish good agent eval from bad:
+### Question 1: outcome or trajectory?
+| Approach | What it checks | Failure mode |
+|----------|---------------|--------------|
+| **Outcome-only** | Did the agent achieve the goal? Was the final state correct? | Misses "ghost actions" — agent did the right thing for the wrong reasons |
+| **Trajectory** | Did the agent take the expected sequence of steps / tool calls? | Punishes legitimate creativity — agent solved it via a different valid path |
+**Recommendation:** outcome-only with side-effect assertion as default. Trajectory match for cases where the path matters (e.g., "must call `get-user` before `update-user`").
+### Question 2: which trajectory match mode?
+When trajectory matching IS the right call, four modes are available:
+- **Strict** — same tool calls, same order, same arguments. Use when both sequence and parameters are part of the contract.
+- **Unordered** — same tool calls, any order. Use when concurrent calls are valid.
+- **Subset** — actual trajectory contains a subset of reference calls. Use to enforce "don't exceed expected tool use" (frugality / cost).
+- **Superset** — actual contains all reference calls plus possibly more. Use when specific tools are mandatory but extras are acceptable.
+See `references/agent-eval.md` for examples and the decision tree.
+## Required reading by context
+| Doing what | Read |
+|------------|------|
+| Designing an eval suite for an AI feature | `references/oracle-ladder.md` (climb the ladder) |
+| Using LLM-as-judge | `references/judge-calibration.md` (mandatory before relying on it) |
+| Building / curating a reference dataset | `references/reference-dataset.md` |
+| RAG-specific feature | `references/rag-metrics.md` |
+| Agent / tool-use feature | `references/agent-eval.md` |
+## Anti-patterns (will block in `/dw-code-review`)
+- **LLM-as-judge without calibration evidence.** PR adds LLM-as-judge but the calibration Spearman score is missing or < 0.80. REJECTED.
+- **Same-model judge.** Judge model is the same as the system under test. REJECTED unless explicitly documented (and even then, results are suspect).
+- **Single-rung eval.** Feature ships with only LLM-as-judge; no rung 1-3 grounding. REJECTED — the cheap rungs catch the loud failures.
+- **Synthetic-only dataset.** No traceable production-failure source for any case. REJECTED — confirm at least 20% of cases come from real user inputs.
+- **"Looks good to me" QA.** No reference dataset, no metric, no rubric — just sampling output and calling it good. REJECTED.
+- **Coverage as metric.** Quoting "we tested 50 prompts" without saying what was measured. The number is meaningless without the metric.
+## Integration with dev-workflow commands
+- `/dw-create-tasks`: when the PRD has an AI feature requirement, an eval-plan subtask is mandatory. The task references this skill's oracle ladder.
+- `/dw-code-review`: AI feature PRs require a reference dataset + ≥2 oracle rungs (lower rungs FIRST). The constitution gate also applies — if the project has principles about AI feature reliability, they're enforced here.
+- `/dw-run-qa --ai`: new mode (when this skill is bundled) — runs the reference dataset against the current implementation, logs to `QA/logs/ai/<feature>-<date>.jsonl`, computes precision@k / faithfulness / outcome accuracy per the feature type.
+- `/dw-bugfix` when the bug is an AI failure mode (hallucination, tool misuse, classification error): adds the failing case to the reference dataset BEFORE fixing — the case is now a regression test forever.
+## When the discipline bends
+- **Prototype / spike phase**: skip calibration; document as "spike — eval added before merge to main."
+- **Internal-only AI feature with low blast radius** (e.g., classifier for internal CRM tags): rung 1-3 only is fine; LLM-as-judge may be overkill.
+- **Real-time features where eval can't run synchronously**: shadow-eval pattern — run the eval async on a sample of production traffic; alert on regression.
+In all bend cases, document the deviation in the techspec / PR. "Skipped judge calibration because internal-only feature affecting <100 users" is fine; just say it.
+## Why this approach
+Two failure modes drive most AI feature regressions:
+1. **No measurement** — team ships, suspects it's worse, can't prove it, debate.
+2. **Wrong measurement** — team measures LLM-as-judge only, judge drifts with the model, scores rise while real quality falls.
+The oracle ladder fixes both: forces measurement, forces ANCHORED measurement (lower rungs are deterministic; upper rungs are calibrated against them).
+## Bottom line
+> An AI feature without an eval suite is a feature you can't ship safely. An eval suite without calibration is a number you can't trust. Build the dataset from real failures, climb the ladder from cheap to expensive, calibrate the judge against humans, and re-run before every model swap. The discipline is small; the absence of it is one of the largest sources of "we shipped and don't know if it's worse" experiences in the industry.