@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (127) hide show
  1. package/README.md +54 -5
  2. package/lib/constants.js +20 -20
  3. package/lib/init.js +24 -1
  4. package/lib/migrate-skills.js +129 -0
  5. package/lib/removed-bundled-skills.js +16 -0
  6. package/lib/uninstall.js +6 -2
  7. package/lib/utils.js +43 -1
  8. package/package.json +1 -1
  9. package/scaffold/en/agent-instructions.md +68 -0
  10. package/scaffold/en/commands/dw-autopilot.md +1 -1
  11. package/scaffold/en/commands/dw-brainstorm.md +1 -1
  12. package/scaffold/en/commands/dw-bugfix.md +4 -3
  13. package/scaffold/en/commands/dw-code-review.md +1 -0
  14. package/scaffold/en/commands/dw-create-tasks.md +6 -0
  15. package/scaffold/en/commands/dw-create-techspec.md +1 -1
  16. package/scaffold/en/commands/dw-deps-audit.md +1 -1
  17. package/scaffold/en/commands/dw-fix-qa.md +1 -1
  18. package/scaffold/en/commands/dw-functional-doc.md +2 -2
  19. package/scaffold/en/commands/dw-help.md +2 -2
  20. package/scaffold/en/commands/dw-redesign-ui.md +7 -7
  21. package/scaffold/en/commands/dw-run-qa.md +5 -4
  22. package/scaffold/en/commands/dw-run-task.md +2 -2
  23. package/scaffold/en/templates/constitution-template.md +1 -1
  24. package/scaffold/pt-br/agent-instructions.md +68 -0
  25. package/scaffold/pt-br/commands/dw-autopilot.md +1 -1
  26. package/scaffold/pt-br/commands/dw-brainstorm.md +1 -1
  27. package/scaffold/pt-br/commands/dw-bugfix.md +4 -3
  28. package/scaffold/pt-br/commands/dw-code-review.md +1 -0
  29. package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
  30. package/scaffold/pt-br/commands/dw-create-techspec.md +1 -1
  31. package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
  32. package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
  33. package/scaffold/pt-br/commands/dw-functional-doc.md +2 -2
  34. package/scaffold/pt-br/commands/dw-help.md +2 -2
  35. package/scaffold/pt-br/commands/dw-redesign-ui.md +7 -7
  36. package/scaffold/pt-br/commands/dw-run-qa.md +5 -4
  37. package/scaffold/pt-br/commands/dw-run-task.md +2 -2
  38. package/scaffold/pt-br/templates/constitution-template.md +1 -1
  39. package/scaffold/skills/dw-council/SKILL.md +1 -1
  40. package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
  41. package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
  42. package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
  43. package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
  44. package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
  45. package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
  46. package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
  47. package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
  48. package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
  49. package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
  50. package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
  51. package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
  52. package/scaffold/skills/dw-testing-discipline/SKILL.md +171 -0
  53. package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
  54. package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +336 -0
  55. package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
  56. package/scaffold/skills/dw-testing-discipline/references/flaky-discipline.md +163 -0
  57. package/scaffold/skills/dw-testing-discipline/references/patterns.md +241 -0
  58. package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +282 -0
  59. package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/security-boundary.md +1 -1
  60. package/scaffold/skills/dw-ui-discipline/SKILL.md +150 -0
  61. package/scaffold/skills/dw-ui-discipline/references/accessibility-floor.md +225 -0
  62. package/scaffold/skills/dw-ui-discipline/references/curated-defaults.md +195 -0
  63. package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +162 -0
  64. package/scaffold/skills/dw-ui-discipline/references/state-matrix.md +101 -0
  65. package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
  66. package/scaffold/skills/ui-ux-pro-max/LICENSE +0 -21
  67. package/scaffold/skills/ui-ux-pro-max/SKILL.md +0 -659
  68. package/scaffold/skills/ui-ux-pro-max/data/_sync_all.py +0 -414
  69. package/scaffold/skills/ui-ux-pro-max/data/app-interface.csv +0 -31
  70. package/scaffold/skills/ui-ux-pro-max/data/charts.csv +0 -26
  71. package/scaffold/skills/ui-ux-pro-max/data/colors.csv +0 -162
  72. package/scaffold/skills/ui-ux-pro-max/data/design.csv +0 -1776
  73. package/scaffold/skills/ui-ux-pro-max/data/draft.csv +0 -1779
  74. package/scaffold/skills/ui-ux-pro-max/data/google-fonts.csv +0 -1924
  75. package/scaffold/skills/ui-ux-pro-max/data/icons.csv +0 -106
  76. package/scaffold/skills/ui-ux-pro-max/data/landing.csv +0 -35
  77. package/scaffold/skills/ui-ux-pro-max/data/products.csv +0 -162
  78. package/scaffold/skills/ui-ux-pro-max/data/react-performance.csv +0 -45
  79. package/scaffold/skills/ui-ux-pro-max/data/stacks/angular.csv +0 -51
  80. package/scaffold/skills/ui-ux-pro-max/data/stacks/astro.csv +0 -54
  81. package/scaffold/skills/ui-ux-pro-max/data/stacks/flutter.csv +0 -53
  82. package/scaffold/skills/ui-ux-pro-max/data/stacks/html-tailwind.csv +0 -56
  83. package/scaffold/skills/ui-ux-pro-max/data/stacks/jetpack-compose.csv +0 -53
  84. package/scaffold/skills/ui-ux-pro-max/data/stacks/laravel.csv +0 -51
  85. package/scaffold/skills/ui-ux-pro-max/data/stacks/nextjs.csv +0 -53
  86. package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxt-ui.csv +0 -51
  87. package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxtjs.csv +0 -59
  88. package/scaffold/skills/ui-ux-pro-max/data/stacks/react-native.csv +0 -52
  89. package/scaffold/skills/ui-ux-pro-max/data/stacks/react.csv +0 -54
  90. package/scaffold/skills/ui-ux-pro-max/data/stacks/shadcn.csv +0 -61
  91. package/scaffold/skills/ui-ux-pro-max/data/stacks/svelte.csv +0 -54
  92. package/scaffold/skills/ui-ux-pro-max/data/stacks/swiftui.csv +0 -51
  93. package/scaffold/skills/ui-ux-pro-max/data/stacks/threejs.csv +0 -54
  94. package/scaffold/skills/ui-ux-pro-max/data/stacks/vue.csv +0 -50
  95. package/scaffold/skills/ui-ux-pro-max/data/styles.csv +0 -85
  96. package/scaffold/skills/ui-ux-pro-max/data/typography.csv +0 -74
  97. package/scaffold/skills/ui-ux-pro-max/data/ui-reasoning.csv +0 -162
  98. package/scaffold/skills/ui-ux-pro-max/data/ux-guidelines.csv +0 -100
  99. package/scaffold/skills/ui-ux-pro-max/scripts/core.py +0 -262
  100. package/scaffold/skills/ui-ux-pro-max/scripts/design_system.py +0 -1148
  101. package/scaffold/skills/ui-ux-pro-max/scripts/search.py +0 -114
  102. package/scaffold/skills/ui-ux-pro-max/skills/brand/SKILL.md +0 -97
  103. package/scaffold/skills/ui-ux-pro-max/skills/design/SKILL.md +0 -302
  104. package/scaffold/skills/ui-ux-pro-max/skills/design-system/SKILL.md +0 -244
  105. package/scaffold/skills/ui-ux-pro-max/templates/base/quick-reference.md +0 -297
  106. package/scaffold/skills/ui-ux-pro-max/templates/base/skill-content.md +0 -358
  107. package/scaffold/skills/ui-ux-pro-max/templates/platforms/agent.json +0 -21
  108. package/scaffold/skills/ui-ux-pro-max/templates/platforms/augment.json +0 -18
  109. package/scaffold/skills/ui-ux-pro-max/templates/platforms/claude.json +0 -21
  110. package/scaffold/skills/ui-ux-pro-max/templates/platforms/codebuddy.json +0 -21
  111. package/scaffold/skills/ui-ux-pro-max/templates/platforms/codex.json +0 -21
  112. package/scaffold/skills/ui-ux-pro-max/templates/platforms/continue.json +0 -21
  113. package/scaffold/skills/ui-ux-pro-max/templates/platforms/copilot.json +0 -21
  114. package/scaffold/skills/ui-ux-pro-max/templates/platforms/cursor.json +0 -21
  115. package/scaffold/skills/ui-ux-pro-max/templates/platforms/droid.json +0 -21
  116. package/scaffold/skills/ui-ux-pro-max/templates/platforms/gemini.json +0 -21
  117. package/scaffold/skills/ui-ux-pro-max/templates/platforms/kilocode.json +0 -21
  118. package/scaffold/skills/ui-ux-pro-max/templates/platforms/kiro.json +0 -21
  119. package/scaffold/skills/ui-ux-pro-max/templates/platforms/opencode.json +0 -21
  120. package/scaffold/skills/ui-ux-pro-max/templates/platforms/qoder.json +0 -21
  121. package/scaffold/skills/ui-ux-pro-max/templates/platforms/roocode.json +0 -21
  122. package/scaffold/skills/ui-ux-pro-max/templates/platforms/trae.json +0 -21
  123. package/scaffold/skills/ui-ux-pro-max/templates/platforms/warp.json +0 -18
  124. package/scaffold/skills/ui-ux-pro-max/templates/platforms/windsurf.json +0 -21
  125. package/scaffold/skills/webapp-testing/SKILL.md +0 -138
  126. package/scaffold/skills/webapp-testing/assets/test-helper.js +0 -56
  127. /package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/three-workflow-patterns.md +0 -0
@@ -0,0 +1,169 @@
1
+ # Runbook templates + on-call handoff
2
+
3
+ Two contexts where this reference applies:
4
+ 1. **Generating a runbook** (entry-mode 3 in the skill — no live incident, just producing operational docs).
5
+ 2. **On-call handoff** at the end of a shift.
6
+
7
+ ## Runbook: service outage template
8
+
9
+ ```markdown
10
+ ## Runbook — <Service Name> outage
11
+
12
+ ### Quick diagnosis (< 5 min)
13
+
14
+ 1. Health check: `curl -sf https://<service>/health | jq .`
15
+ 2. Pod status (K8s): `kubectl get pods -l app=<service>`
16
+ 3. Recent deploys: `kubectl rollout history deployment/<service>`
17
+ 4. Recent logs: `kubectl logs -l app=<service> --tail=50 --since=5m`
18
+ 5. Metrics dashboard: <Grafana / Datadog link>
19
+
20
+ ### Common failure modes
21
+
22
+ | Symptom | Likely cause | Fix |
23
+ |---------|--------------|-----|
24
+ | OOMKilled | Memory leak or under-sized pod | Scale up replicas; increase memory limit; investigate leak |
25
+ | CrashLoopBackOff | Config error, missing env var | Check logs for startup errors; verify ConfigMap/Secret |
26
+ | ImagePullBackOff | Bad image tag or registry auth | Verify tag exists; check registry credentials |
27
+ | 503s from healthy pods | Downstream dependency down | Check the dependency (DB, cache, queue) |
28
+ | Slow responses (high p99) | DB connection saturation; N+1 query; cold cache | Check DB connections; tail slow query log; check cache hit rate |
29
+ | Latency spike after deploy | New code path slower than expected | Roll back; profile the new code path |
30
+
31
+ ### Rollback procedure
32
+
33
+ ```bash
34
+ # K8s
35
+ kubectl rollout undo deployment/<service>
36
+
37
+ # Confirm
38
+ kubectl rollout status deployment/<service>
39
+ ```
40
+
41
+ ### Escalation chain
42
+
43
+ - **L1 (on-call engineer):** <name / PagerDuty schedule>
44
+ - **L2 (team lead):** <name>
45
+ - **L3 (infrastructure / platform team):** <team channel>
46
+ - **L4 (CTO / VP Eng):** <name> (SEV-1 only)
47
+
48
+ ### Known dependencies
49
+
50
+ - Database: <DB name + version>
51
+ - Cache: <Redis / Memcached>
52
+ - Queue: <SQS / RabbitMQ / Kafka>
53
+ - External APIs: <list with SLAs>
54
+
55
+ If a dependency is down, the service may degrade gracefully — see `<degraded-mode>` section in the architecture doc.
56
+
57
+ ### Related runbooks
58
+
59
+ - `<dependency-A>` runbook
60
+ - `<related-service-B>` runbook
61
+ ```
62
+
63
+ ## Runbook: database incident template
64
+
65
+ ```markdown
66
+ ## Runbook — Database (<engine, version>) incident
67
+
68
+ ### Quick diagnosis (Postgres examples; adapt for MySQL/etc.)
69
+
70
+ ```sql
71
+ -- Active connections vs baseline
72
+ SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
73
+ -- Expected baseline: <N>; alert if > <2N>
74
+
75
+ -- Long-running queries
76
+ SELECT pid, now() - query_start AS duration, query
77
+ FROM pg_stat_activity
78
+ WHERE state = 'active' AND now() - query_start > interval '30 seconds'
79
+ ORDER BY duration DESC;
80
+
81
+ -- Lock waits
82
+ SELECT * FROM pg_locks WHERE NOT granted;
83
+
84
+ -- Replication lag (if applicable)
85
+ SELECT * FROM pg_stat_replication;
86
+ ```
87
+
88
+ ### Common DB failure modes
89
+
90
+ | Symptom | Likely cause | Mitigation |
91
+ |---------|--------------|------------|
92
+ | Connection pool exhaustion | App leak; spike in traffic | Restart app pods to release connections; scale app; investigate leak |
93
+ | Lock contention | Long-running transaction blocking writers | Identify holder via pg_stat_activity; terminate if safe (`pg_terminate_backend`) |
94
+ | Disk full | Unbounded growth; WAL retention | Free space; review retention; vacuum |
95
+ | Replication lag | Network or replica overload | Check replica health; review primary write rate |
96
+ | Slow query (p99 spike) | Missing index; bad plan after stats change | `EXPLAIN ANALYZE` the slow query; check `pg_stat_user_indexes` |
97
+
98
+ ### Emergency procedures
99
+
100
+ - **Terminate a stuck query:** `SELECT pg_terminate_backend(<pid>);` — only after confirming it's safe (it won't roll back distributed transactions cleanly).
101
+ - **Failover to replica:** `<runbook-specific commands>`.
102
+ - **Restore from snapshot:** see `<DR runbook>` — only for data corruption / loss.
103
+
104
+ ### Escalation
105
+
106
+ - **DB on-call:** <name / schedule>
107
+ - **DBA team:** <channel>
108
+ - **Vendor support:** <ticket portal> (RDS, CloudSQL, etc.)
109
+ ```
110
+
111
+ ## On-call handoff template
112
+
113
+ End of every shift. Sent in the team's on-call channel + emailed to the incoming engineer.
114
+
115
+ ```markdown
116
+ ## On-call handoff — <Date>
117
+
118
+ **Outgoing:** @<name>
119
+ **Incoming:** @<name>
120
+
121
+ ### Active incidents
122
+
123
+ - [None] OR list with status, severity, channel link
124
+
125
+ ### Ongoing investigations (carried over)
126
+
127
+ - **<service/area>:** <one-line status, what's been tried, next step>
128
+ - **<issue>:** <what's blocking>
129
+
130
+ ### Recent changes (last 24h)
131
+
132
+ - **<service>:** deploy at HH:MM UTC — <commit/PR>
133
+ - **<config>:** change at HH:MM — <what>
134
+ - **<infra>:** change at HH:MM — <what>
135
+
136
+ ### Known issues (workarounds active)
137
+
138
+ - **<symptom>:** workaround: <action>. Tracking: <issue link>.
139
+
140
+ ### Upcoming events
141
+
142
+ - **<date HH:MM UTC>:** scheduled maintenance — <service> — owner <name>
143
+ - **<date>:** traffic spike expected (launch, marketing campaign, etc.)
144
+
145
+ ### Notes for the new shift
146
+
147
+ - [free-form: anything the incoming engineer should know that didn't fit above]
148
+ ```
149
+
150
+ ## Generating runbooks proactively
151
+
152
+ When the skill is invoked in entry-mode 3 (runbook generation, no live incident):
153
+
154
+ 1. Ask: "Which service / area is the runbook for?"
155
+ 2. Look up the service in `.dw/intel/` — pull the stack, dependencies, recent deploys.
156
+ 3. Use the appropriate template above (service outage / DB / on-call handoff).
157
+ 4. Fill in concrete commands, dashboard links, on-call info.
158
+ 5. Save to `.dw/runbooks/<service-name>.md`.
159
+ 6. Suggest adding the runbook path to the relevant `.dw/rules/<service>.md`.
160
+
161
+ Runbooks should be **executable** — every command listed must work without modification when copy-pasted by a tired engineer at 3am.
162
+
163
+ ## Maintenance cadence
164
+
165
+ - Runbooks reviewed quarterly.
166
+ - Updated immediately after any incident where the runbook was wrong or missing.
167
+ - Owner per runbook = the team that owns the service.
168
+
169
+ A stale runbook is worse than no runbook (false confidence). If you can't keep one current, delete it and rely on the incident-response workflow alone.
@@ -0,0 +1,186 @@
1
+ # Severity classification + triage commands
2
+
3
+ ## Severity criteria — extended version
4
+
5
+ ### SEV-1 (Critical)
6
+
7
+ **Trigger:** any of
8
+ - Production service fully down or returning >50% errors.
9
+ - Data loss in progress or already occurred.
10
+ - Security breach detected (active credential leak, RCE, unauthorized data access).
11
+ - Compliance violation (PCI, HIPAA, GDPR exposure).
12
+
13
+ **Response:** page on-call immediately. CEO/CTO notified within 30 min if customer-facing. War room opened.
14
+
15
+ ### SEV-2 (Major)
16
+
17
+ **Trigger:** any of
18
+ - Significant feature degradation affecting >25% of users.
19
+ - Latency 10× normal or higher.
20
+ - Background jobs queue backing up beyond SLA.
21
+ - Partial outage of a critical dependency (auth, payments, search).
22
+
23
+ **Response:** on-call investigates within 30 min. Stakeholders notified within 1 hour.
24
+
25
+ ### SEV-3 (Minor)
26
+
27
+ **Trigger:** any of
28
+ - Single endpoint returning errors with a workaround available.
29
+ - Performance regression affecting <25% of users.
30
+ - Non-critical feature broken.
31
+
32
+ **Response:** investigated within 4 hours. Communicated in standup/Slack channel.
33
+
34
+ ### SEV-4 (Low)
35
+
36
+ **Trigger:** any of
37
+ - Cosmetic issue.
38
+ - Non-user-facing bug.
39
+ - Minor UX papercut.
40
+
41
+ **Response:** logged as a normal bug. Fixed in next routine deploy.
42
+
43
+ ## Blast radius assessment
44
+
45
+ Before declaring severity, answer:
46
+ 1. **Which services** are affected?
47
+ 2. **How many users** are seeing the issue? Estimate from error rate × DAU.
48
+ 3. **What's the revenue impact** per hour (if customer-facing)?
49
+ 4. **What downstream systems** depend on the affected service?
50
+ 5. **Is the issue spreading** or contained?
51
+
52
+ Document each in `01-triage.md`. The numbers go into the postmortem.
53
+
54
+ ## Triage commands by stack
55
+
56
+ ### Kubernetes
57
+
58
+ ```bash
59
+ # Find non-running pods
60
+ kubectl get pods -A | grep -v -E 'Running|Completed'
61
+
62
+ # Pods consuming most memory
63
+ kubectl top pods --sort-by=memory --all-namespaces | head -20
64
+
65
+ # Recent logs from a specific service
66
+ kubectl logs -l app=<service> --tail=100 --since=10m
67
+
68
+ # Recent deploys
69
+ kubectl rollout history deployment/<service>
70
+
71
+ # Rollback the most recent deploy
72
+ kubectl rollout undo deployment/<service>
73
+ ```
74
+
75
+ ### Docker / Docker Compose
76
+
77
+ ```bash
78
+ # Containers that exited
79
+ docker ps -a --filter "status=exited" --format "table {{.Names}}\t{{.Status}}"
80
+
81
+ # Recent container logs
82
+ docker logs <container> --tail=100 --since=10m
83
+
84
+ # Container resource usage
85
+ docker stats --no-stream
86
+
87
+ # Recent compose deploys (assumes versioned compose files)
88
+ git log --oneline --since="24 hours ago" -- docker-compose.yml
89
+ ```
90
+
91
+ ### Generic (any service with HTTP health endpoint)
92
+
93
+ ```bash
94
+ # Health check
95
+ curl -sf https://<host>/health | jq .
96
+
97
+ # Compare healthy vs unhealthy timing
98
+ time curl -sf https://<host>/health
99
+ time curl -sf https://<host>/api/<critical-endpoint>
100
+
101
+ # Recent application deploys (any git-tracked deployment)
102
+ git log --oneline --since="2 hours ago"
103
+
104
+ # Recent infra changes
105
+ git -C /path/to/infra log --oneline --since="24 hours ago"
106
+ ```
107
+
108
+ ### Database (Postgres)
109
+
110
+ ```sql
111
+ -- Active connections (compare against normal baseline)
112
+ SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
113
+
114
+ -- Long-running queries (>30s)
115
+ SELECT pid, now() - query_start AS duration, query, state
116
+ FROM pg_stat_activity
117
+ WHERE state = 'active' AND now() - query_start > interval '30 seconds'
118
+ ORDER BY duration DESC;
119
+
120
+ -- Lock contention
121
+ SELECT blocked_locks.pid AS blocked_pid,
122
+ blocking_locks.pid AS blocking_pid,
123
+ blocked_activity.query AS blocked_query,
124
+ blocking_activity.query AS blocking_query
125
+ FROM pg_catalog.pg_locks blocked_locks
126
+ JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
127
+ JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
128
+ AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
129
+ AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
130
+ AND blocking_locks.pid != blocked_locks.pid
131
+ JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
132
+ WHERE NOT blocked_locks.granted;
133
+
134
+ -- Terminate a stuck query (last resort; investigate first)
135
+ SELECT pg_terminate_backend(<pid>);
136
+ ```
137
+
138
+ ### Generic application metrics
139
+
140
+ If your APM provides these (Datadog, New Relic, Sentry):
141
+ - Error rate per endpoint over the last hour vs the prior week's baseline.
142
+ - p50/p95/p99 latency per endpoint.
143
+ - Throughput (requests per second).
144
+ - Saturation (queue depth, connection pool usage).
145
+
146
+ The Google SRE "Four Golden Signals" (latency, errors, traffic, saturation) cover most cases.
147
+
148
+ ## Immediate mitigation patterns
149
+
150
+ Before debugging root cause, can you reduce blast radius now? Try in order:
151
+
152
+ 1. **Rollback the most recent deploy** if timing matches.
153
+ 2. **Toggle feature flag** for the affected feature.
154
+ 3. **Redirect traffic** away from the unhealthy instance / region.
155
+ 4. **Rate-limit** the abusive client or queue depth.
156
+ 5. **Scale up** if it's resource starvation (CPU, memory, connection pool).
157
+
158
+ If none apply, escalate to investigation phase. Don't burn time forcing mitigation that doesn't fit.
159
+
160
+ ## What to record in `01-triage.md`
161
+
162
+ ```markdown
163
+ # Triage — <incident title>
164
+
165
+ ## Detected
166
+ - **Time:** YYYY-MM-DD HH:MM UTC
167
+ - **Detected by:** alert / user report / monitoring dashboard / customer support
168
+ - **Severity:** SEV-X
169
+
170
+ ## Symptoms
171
+ - [observable behavior]
172
+ - [error rates, latency, etc.]
173
+
174
+ ## Blast radius
175
+ - Services: [list]
176
+ - Users affected: [estimate]
177
+ - Revenue impact: [if known]
178
+ - Downstream impact: [systems that depend on this]
179
+
180
+ ## Initial mitigation
181
+ - [what was tried]
182
+ - [what worked / didn't]
183
+
184
+ ## Next steps
185
+ Proceeding to Phase 2 (investigation) after user confirmation.
186
+ ```
@@ -0,0 +1,148 @@
1
+ ---
2
+ name: dw-llm-eval
3
+ description: Use when authoring or reviewing AI/LLM features (chat, RAG, summarization, classifiers, agents) — enforces an oracle ladder (climb from exact match up to LLM-as-judge), reference-dataset discipline, judge calibration (Spearman ≥0.80), and trajectory-vs-outcome agent eval so AI features ship with measurable behavior instead of "looks good to me" QA.
4
+ ---
5
+
6
+ # LLM Evaluation
7
+
8
+ > Adapted patterns from [`langchain-ai/agentevals`](https://github.com/langchain-ai/agentevals) (MIT) for trajectory-match modes, plus general LLM-eval discipline from OpenAI evals cookbook, Anthropic's evals guidance, and the broader open evaluations literature. Material rewritten in our voice.
9
+
10
+ ## When this skill applies
11
+
12
+ - Any feature that uses an LLM in production: chat, summarization, classification, RAG (retrieval-augmented generation), agents, tool-use, structured extraction, code generation.
13
+ - `/dw-create-tasks` when the PRD mentions an AI feature — eval planning becomes a mandatory subtask.
14
+ - `/dw-code-review` when the diff touches AI feature code paths.
15
+ - `/dw-run-qa --ai` when validating an AI feature against its reference dataset.
16
+
17
+ If the feature is fully deterministic (no LLM in the loop), use `dw-testing-discipline` instead — Iron rules and 25 anti-patterns. This skill is specifically for entropy-tolerant systems.
18
+
19
+ ## First principle
20
+
21
+ > Tests for deterministic code assert exact outputs.
22
+ > Tests for LLM features assert behaviors within tolerance.
23
+ > The discipline is choosing the right tolerance — and proving it's not "anything passes."
24
+
25
+ ## The oracle ladder
26
+
27
+ Five rungs, climb from CHEAPEST/STRICTEST to MOST EXPENSIVE/SUBJECTIVE. Always start at the bottom; only climb when the lower rung can't cover the case.
28
+
29
+ | Rung | What it checks | Cost | When to use |
30
+ |------|----------------|------|-------------|
31
+ | 1. **Exact match** | `output === expected` | ~free | Structured outputs (function calls, JSON with stable shape, classifications) |
32
+ | 2. **Schema validation** | Output matches JSON schema / type contract | ~free | Output shape matters; specific values vary |
33
+ | 3. **Outcome state** | Side effect produced the expected change (DB row, file written, tool called) | cheap | Agents, tool-use, RAG with concrete answers |
34
+ | 4. **LLM-as-judge** | A different model grades the output against a rubric | medium ($$$) | Subjective quality (helpfulness, tone, faithfulness) where no rule can decide |
35
+ | 5. **Human review** | Domain expert scores | expensive | Calibration of rung 4; high-stakes outputs; edge cases |
36
+
37
+ **Rule:** never reach for rung 4 before checking if rungs 1-3 can cover the case. Every rung up costs an order of magnitude more (latency, money, calibration effort) — and adds entropy.
38
+
39
+ See `references/oracle-ladder.md` for examples per rung and the climbing decision tree.
40
+
41
+ ## LLM-as-judge discipline (when rung 4 is needed)
42
+
43
+ Without calibration, LLM-as-judge produces noise dressed as signal. Three non-negotiables:
44
+
45
+ 1. **Calibrate against humans** — ≥20 human-graded cases, compute Spearman correlation against LLM-as-judge. Target ≥0.80. Below that, reject the judge configuration.
46
+ 2. **Use a different model than the system under test** — same model judging itself produces false positives. Pair: GPT-4 generates → Claude judges. Or vice versa.
47
+ 3. **Rubric, not free-form** — provide the judge a structured rubric (criteria + scale + examples) instead of "rate quality 1-10."
48
+
49
+ See `references/judge-calibration.md` for the full calibration recipe, rubric templates, and the "judge drift" monitoring pattern.
50
+
51
+ ## Reference dataset principle
52
+
53
+ > 20 unambiguous cases drawn from real production failures beat 200 synthetic perfect cases.
54
+
55
+ The dataset is the bedrock. Without a reference set, every "improvement" is anecdote.
56
+
57
+ Structure:
58
+ ```
59
+ .dw/eval/datasets/<feature-name>/
60
+ ├── cases.jsonl # input + expected (or rubric reference) per line
61
+ ├── README.md # provenance, sample size, when last reviewed
62
+ └── runs/<YYYY-MM-DD>.jsonl # results of each eval run
63
+ ```
64
+
65
+ See `references/reference-dataset.md` for case-design principles, sampling from production, and when to expand the set.
66
+
67
+ ## RAG evaluation
68
+
69
+ Three orthogonal metrics — measure all three, not just one:
70
+
71
+ | Metric | What it measures | Tool |
72
+ |--------|-----------------|------|
73
+ | **Retrieval precision@k** | Of the top-K retrieved chunks, how many were relevant | Exact match against labeled ground-truth |
74
+ | **Answer faithfulness** | Does the answer cite only what the retrieved context supports? | LLM-as-judge with rubric |
75
+ | **Context utilization** | Did the answer USE the retrieved context, or hallucinate around it? | Heuristic + LLM-as-judge |
76
+
77
+ Precision alone misses hallucination. Faithfulness alone misses retrieval failure. Context utilization alone misses both. See `references/rag-metrics.md` for the full implementation.
78
+
79
+ ## Agent / tool-use evaluation
80
+
81
+ Two questions distinguish good agent eval from bad:
82
+
83
+ ### Question 1: outcome or trajectory?
84
+
85
+ | Approach | What it checks | Failure mode |
86
+ |----------|---------------|--------------|
87
+ | **Outcome-only** | Did the agent achieve the goal? Was the final state correct? | Misses "ghost actions" — agent did the right thing for the wrong reasons |
88
+ | **Trajectory** | Did the agent take the expected sequence of steps / tool calls? | Punishes legitimate creativity — agent solved it via a different valid path |
89
+
90
+ **Recommendation:** outcome-only with side-effect assertion as default. Trajectory match for cases where the path matters (e.g., "must call `get-user` before `update-user`").
91
+
92
+ ### Question 2: which trajectory match mode?
93
+
94
+ When trajectory matching IS the right call, four modes are available:
95
+
96
+ - **Strict** — same tool calls, same order, same arguments. Use when both sequence and parameters are part of the contract.
97
+ - **Unordered** — same tool calls, any order. Use when concurrent calls are valid.
98
+ - **Subset** — actual trajectory contains a subset of reference calls. Use to enforce "don't exceed expected tool use" (frugality / cost).
99
+ - **Superset** — actual contains all reference calls plus possibly more. Use when specific tools are mandatory but extras are acceptable.
100
+
101
+ See `references/agent-eval.md` for examples and the decision tree.
102
+
103
+ ## Required reading by context
104
+
105
+ | Doing what | Read |
106
+ |------------|------|
107
+ | Designing an eval suite for an AI feature | `references/oracle-ladder.md` (climb the ladder) |
108
+ | Using LLM-as-judge | `references/judge-calibration.md` (mandatory before relying on it) |
109
+ | Building / curating a reference dataset | `references/reference-dataset.md` |
110
+ | RAG-specific feature | `references/rag-metrics.md` |
111
+ | Agent / tool-use feature | `references/agent-eval.md` |
112
+
113
+ ## Anti-patterns (will block in `/dw-code-review`)
114
+
115
+ - **LLM-as-judge without calibration evidence.** PR adds LLM-as-judge but the calibration Spearman score is missing or < 0.80. REJECTED.
116
+ - **Same-model judge.** Judge model is the same as the system under test. REJECTED unless explicitly documented (and even then, results are suspect).
117
+ - **Single-rung eval.** Feature ships with only LLM-as-judge; no rung 1-3 grounding. REJECTED — the cheap rungs catch the loud failures.
118
+ - **Synthetic-only dataset.** No traceable production-failure source for any case. REJECTED — confirm at least 20% of cases come from real user inputs.
119
+ - **"Looks good to me" QA.** No reference dataset, no metric, no rubric — just sampling output and calling it good. REJECTED.
120
+ - **Coverage as metric.** Quoting "we tested 50 prompts" without saying what was measured. The number is meaningless without the metric.
121
+
122
+ ## Integration with dev-workflow commands
123
+
124
+ - `/dw-create-tasks`: when the PRD has an AI feature requirement, an eval-plan subtask is mandatory. The task references this skill's oracle ladder.
125
+ - `/dw-code-review`: AI feature PRs require a reference dataset + ≥2 oracle rungs (lower rungs FIRST). The constitution gate also applies — if the project has principles about AI feature reliability, they're enforced here.
126
+ - `/dw-run-qa --ai`: new mode (when this skill is bundled) — runs the reference dataset against the current implementation, logs to `QA/logs/ai/<feature>-<date>.jsonl`, computes precision@k / faithfulness / outcome accuracy per the feature type.
127
+ - `/dw-bugfix` when the bug is an AI failure mode (hallucination, tool misuse, classification error): adds the failing case to the reference dataset BEFORE fixing — the case is now a regression test forever.
128
+
129
+ ## When the discipline bends
130
+
131
+ - **Prototype / spike phase**: skip calibration; document as "spike — eval added before merge to main."
132
+ - **Internal-only AI feature with low blast radius** (e.g., classifier for internal CRM tags): rung 1-3 only is fine; LLM-as-judge may be overkill.
133
+ - **Real-time features where eval can't run synchronously**: shadow-eval pattern — run the eval async on a sample of production traffic; alert on regression.
134
+
135
+ In all bend cases, document the deviation in the techspec / PR. "Skipped judge calibration because internal-only feature affecting <100 users" is fine; just say it.
136
+
137
+ ## Why this approach
138
+
139
+ Two failure modes drive most AI feature regressions:
140
+
141
+ 1. **No measurement** — team ships, suspects it's worse, can't prove it, debate.
142
+ 2. **Wrong measurement** — team measures LLM-as-judge only, judge drifts with the model, scores rise while real quality falls.
143
+
144
+ The oracle ladder fixes both: forces measurement, forces ANCHORED measurement (lower rungs are deterministic; upper rungs are calibrated against them).
145
+
146
+ ## Bottom line
147
+
148
+ > An AI feature without an eval suite is a feature you can't ship safely. An eval suite without calibration is a number you can't trust. Build the dataset from real failures, climb the ladder from cheap to expensive, calibrate the judge against humans, and re-run before every model swap. The discipline is small; the absence of it is one of the largest sources of "we shipped and don't know if it's worse" experiences in the industry.