@forwardimpact/pathway 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (245) hide show
  1. package/bin/{pathway.js → fit-pathway.js} +65 -153
  2. package/package.json +18 -41
  3. package/{app → src}/commands/agent.js +5 -2
  4. package/{app → src}/commands/behaviour.js +1 -1
  5. package/{app → src}/commands/command-factory.js +2 -2
  6. package/{app → src}/commands/discipline.js +1 -1
  7. package/{app → src}/commands/driver.js +2 -2
  8. package/{app → src}/commands/grade.js +2 -2
  9. package/{app → src}/commands/job.js +3 -3
  10. package/{app → src}/commands/serve.js +26 -4
  11. package/{app → src}/commands/site.js +24 -4
  12. package/{app → src}/commands/skill.js +3 -3
  13. package/{app → src}/commands/stage.js +1 -1
  14. package/{app → src}/commands/track.js +2 -2
  15. package/{app → src}/components/card.js +11 -1
  16. package/{app → src}/components/checklist.js +1 -1
  17. package/src/components/code-display.js +153 -0
  18. package/{app → src}/components/comparison-radar.js +1 -1
  19. package/{app → src}/components/detail.js +1 -1
  20. package/src/components/markdown-textarea.js +153 -0
  21. package/{app → src}/components/skill-matrix.js +1 -1
  22. package/{app → src}/css/bundles/app.css +14 -0
  23. package/{app → src}/css/components/badges.css +15 -8
  24. package/{app → src}/css/components/forms.css +23 -13
  25. package/{app → src}/css/components/surfaces.css +49 -3
  26. package/{app → src}/css/components/typography.css +1 -2
  27. package/{app → src}/css/pages/agent-builder.css +11 -102
  28. package/{app → src}/css/pages/detail.css +11 -1
  29. package/{app → src}/css/tokens.css +3 -0
  30. package/{app → src}/formatters/agent/dom.js +26 -71
  31. package/{app → src}/formatters/agent/profile.js +11 -6
  32. package/{app → src}/formatters/behaviour/dom.js +1 -1
  33. package/{app → src}/formatters/discipline/dom.js +1 -1
  34. package/{app → src}/formatters/driver/dom.js +1 -1
  35. package/{app → src}/formatters/grade/dom.js +7 -7
  36. package/{app → src}/formatters/grade/markdown.js +1 -1
  37. package/{app → src}/formatters/interview/dom.js +1 -1
  38. package/{app → src}/formatters/interview/markdown.js +1 -1
  39. package/{app → src}/formatters/interview/shared.js +3 -3
  40. package/{app → src}/formatters/job/description.js +1 -1
  41. package/{app → src}/formatters/job/dom.js +3 -3
  42. package/{app → src}/formatters/job/markdown.js +1 -1
  43. package/{app → src}/formatters/json-ld.js +1 -1
  44. package/{app → src}/formatters/progress/shared.js +3 -3
  45. package/{app → src}/formatters/skill/dom.js +69 -57
  46. package/{app → src}/formatters/skill/markdown.js +1 -1
  47. package/{app → src}/formatters/skill/shared.js +5 -3
  48. package/{app → src}/formatters/stage/microdata.js +2 -2
  49. package/{app → src}/formatters/stage/shared.js +3 -3
  50. package/{app → src}/formatters/tool/shared.js +6 -0
  51. package/{app → src}/formatters/track/dom.js +1 -1
  52. package/{app → src}/formatters/track/markdown.js +1 -1
  53. package/{app → src}/formatters/track/shared.js +4 -1
  54. package/{app → src}/handout-main.js +16 -12
  55. package/src/handout.html +43 -0
  56. package/{app → src}/index.html +23 -2
  57. package/{app → src}/lib/card-mappers.js +28 -1
  58. package/{app → src}/lib/job-cache.js +1 -1
  59. package/{app → src}/lib/render.js +1 -1
  60. package/{app → src}/pages/agent-builder.js +120 -76
  61. package/{app → src}/pages/assessment-results.js +1 -1
  62. package/{app → src}/pages/interview.js +1 -1
  63. package/{app → src}/pages/job-builder.js +1 -1
  64. package/{app → src}/pages/job.js +1 -1
  65. package/{app → src}/pages/landing.js +5 -2
  66. package/{app → src}/pages/self-assessment.js +1 -1
  67. package/{app → src}/pages/skill.js +1 -1
  68. package/{app → src}/pages/stage.js +5 -5
  69. package/{app → src}/pages/tool.js +1 -1
  70. package/{app → src}/slide-main.js +2 -2
  71. package/{app → src}/slides/chapter.js +8 -8
  72. package/{app → src}/slides/index.js +3 -3
  73. package/{app → src}/slides/job.js +1 -1
  74. package/{app → src}/slides/overview.js +9 -9
  75. package/{app → src}/slides/skill.js +1 -0
  76. package/{app → src}/slides.html +16 -1
  77. package/templates/agent.template.md +44 -13
  78. package/templates/job.template.md +14 -20
  79. package/templates/skill.template.md +20 -23
  80. package/LICENSE +0 -201
  81. package/README.md +0 -104
  82. package/app/components/markdown-textarea.js +0 -132
  83. package/app/handout.html +0 -28
  84. package/app/model/agent.js +0 -738
  85. package/app/model/checklist.js +0 -103
  86. package/app/model/derivation.js +0 -766
  87. package/app/model/index-generator.js +0 -65
  88. package/app/model/interview.js +0 -539
  89. package/app/model/job.js +0 -228
  90. package/app/model/levels.js +0 -601
  91. package/app/model/loader.js +0 -599
  92. package/app/model/matching.js +0 -888
  93. package/app/model/modifiers.js +0 -158
  94. package/app/model/profile.js +0 -259
  95. package/app/model/progression.js +0 -507
  96. package/app/model/schema-validation.js +0 -438
  97. package/app/model/validation.js +0 -2130
  98. package/examples/agents/.claude/skills/architecture-design/SKILL.md +0 -130
  99. package/examples/agents/.claude/skills/cloud-platforms/SKILL.md +0 -131
  100. package/examples/agents/.claude/skills/code-quality-review/SKILL.md +0 -108
  101. package/examples/agents/.claude/skills/devops-cicd/SKILL.md +0 -142
  102. package/examples/agents/.claude/skills/full-stack-development/SKILL.md +0 -134
  103. package/examples/agents/.claude/skills/sre-practices/SKILL.md +0 -163
  104. package/examples/agents/.claude/skills/technical-debt-management/SKILL.md +0 -164
  105. package/examples/agents/.github/agents/se-platform-code.agent.md +0 -132
  106. package/examples/agents/.github/agents/se-platform-plan.agent.md +0 -131
  107. package/examples/agents/.github/agents/se-platform-review.agent.md +0 -136
  108. package/examples/agents/.vscode/settings.json +0 -8
  109. package/examples/behaviours/_index.yaml +0 -8
  110. package/examples/behaviours/outcome_ownership.yaml +0 -43
  111. package/examples/behaviours/polymathic_knowledge.yaml +0 -41
  112. package/examples/behaviours/precise_communication.yaml +0 -39
  113. package/examples/behaviours/relentless_curiosity.yaml +0 -37
  114. package/examples/behaviours/systems_thinking.yaml +0 -40
  115. package/examples/capabilities/_index.yaml +0 -8
  116. package/examples/capabilities/business.yaml +0 -189
  117. package/examples/capabilities/delivery.yaml +0 -303
  118. package/examples/capabilities/people.yaml +0 -68
  119. package/examples/capabilities/reliability.yaml +0 -412
  120. package/examples/capabilities/scale.yaml +0 -378
  121. package/examples/copilot-setup-steps.yaml +0 -25
  122. package/examples/devcontainer.yaml +0 -21
  123. package/examples/disciplines/_index.yaml +0 -6
  124. package/examples/disciplines/data_engineering.yaml +0 -78
  125. package/examples/disciplines/engineering_management.yaml +0 -63
  126. package/examples/disciplines/software_engineering.yaml +0 -78
  127. package/examples/drivers.yaml +0 -202
  128. package/examples/framework.yaml +0 -69
  129. package/examples/grades.yaml +0 -115
  130. package/examples/questions/behaviours/outcome_ownership.yaml +0 -51
  131. package/examples/questions/behaviours/polymathic_knowledge.yaml +0 -47
  132. package/examples/questions/behaviours/precise_communication.yaml +0 -54
  133. package/examples/questions/behaviours/relentless_curiosity.yaml +0 -50
  134. package/examples/questions/behaviours/systems_thinking.yaml +0 -52
  135. package/examples/questions/skills/architecture_design.yaml +0 -53
  136. package/examples/questions/skills/cloud_platforms.yaml +0 -47
  137. package/examples/questions/skills/code_quality.yaml +0 -48
  138. package/examples/questions/skills/data_modeling.yaml +0 -45
  139. package/examples/questions/skills/devops.yaml +0 -46
  140. package/examples/questions/skills/full_stack_development.yaml +0 -47
  141. package/examples/questions/skills/sre_practices.yaml +0 -43
  142. package/examples/questions/skills/stakeholder_management.yaml +0 -48
  143. package/examples/questions/skills/team_collaboration.yaml +0 -42
  144. package/examples/questions/skills/technical_writing.yaml +0 -42
  145. package/examples/self-assessments.yaml +0 -64
  146. package/examples/stages.yaml +0 -131
  147. package/examples/tracks/_index.yaml +0 -5
  148. package/examples/tracks/platform.yaml +0 -49
  149. package/examples/tracks/sre.yaml +0 -48
  150. package/examples/vscode-settings.yaml +0 -17
  151. /package/{app → src}/commands/index.js +0 -0
  152. /package/{app → src}/commands/init.js +0 -0
  153. /package/{app → src}/commands/interview.js +0 -0
  154. /package/{app → src}/commands/progress.js +0 -0
  155. /package/{app → src}/commands/questions.js +0 -0
  156. /package/{app → src}/commands/tool.js +0 -0
  157. /package/{app → src}/components/action-buttons.js +0 -0
  158. /package/{app → src}/components/behaviour-profile.js +0 -0
  159. /package/{app → src}/components/builder.js +0 -0
  160. /package/{app → src}/components/error-page.js +0 -0
  161. /package/{app → src}/components/grid.js +0 -0
  162. /package/{app → src}/components/list.js +0 -0
  163. /package/{app → src}/components/modifier-table.js +0 -0
  164. /package/{app → src}/components/nav.js +0 -0
  165. /package/{app → src}/components/progression-table.js +0 -0
  166. /package/{app → src}/components/radar-chart.js +0 -0
  167. /package/{app → src}/css/base.css +0 -0
  168. /package/{app → src}/css/bundles/handout.css +0 -0
  169. /package/{app → src}/css/bundles/slides.css +0 -0
  170. /package/{app → src}/css/components/buttons.css +0 -0
  171. /package/{app → src}/css/components/layout.css +0 -0
  172. /package/{app → src}/css/components/nav.css +0 -0
  173. /package/{app → src}/css/components/progress.css +0 -0
  174. /package/{app → src}/css/components/states.css +0 -0
  175. /package/{app → src}/css/components/tables.css +0 -0
  176. /package/{app → src}/css/components/utilities.css +0 -0
  177. /package/{app → src}/css/pages/assessment-results.css +0 -0
  178. /package/{app → src}/css/pages/interview-builder.css +0 -0
  179. /package/{app → src}/css/pages/job-builder.css +0 -0
  180. /package/{app → src}/css/pages/landing.css +0 -0
  181. /package/{app → src}/css/pages/lifecycle.css +0 -0
  182. /package/{app → src}/css/pages/progress-builder.css +0 -0
  183. /package/{app → src}/css/pages/self-assessment.css +0 -0
  184. /package/{app → src}/css/reset.css +0 -0
  185. /package/{app → src}/css/views/handout.css +0 -0
  186. /package/{app → src}/css/views/print.css +0 -0
  187. /package/{app → src}/css/views/slide-animations.css +0 -0
  188. /package/{app → src}/css/views/slide-base.css +0 -0
  189. /package/{app → src}/css/views/slide-sections.css +0 -0
  190. /package/{app → src}/css/views/slide-tables.css +0 -0
  191. /package/{app → src}/formatters/agent/skill.js +0 -0
  192. /package/{app → src}/formatters/behaviour/markdown.js +0 -0
  193. /package/{app → src}/formatters/behaviour/microdata.js +0 -0
  194. /package/{app → src}/formatters/behaviour/shared.js +0 -0
  195. /package/{app → src}/formatters/discipline/markdown.js +0 -0
  196. /package/{app → src}/formatters/discipline/microdata.js +0 -0
  197. /package/{app → src}/formatters/discipline/shared.js +0 -0
  198. /package/{app → src}/formatters/driver/microdata.js +0 -0
  199. /package/{app → src}/formatters/driver/shared.js +0 -0
  200. /package/{app → src}/formatters/grade/microdata.js +0 -0
  201. /package/{app → src}/formatters/grade/shared.js +0 -0
  202. /package/{app → src}/formatters/index.js +0 -0
  203. /package/{app → src}/formatters/microdata-shared.js +0 -0
  204. /package/{app → src}/formatters/progress/dom.js +0 -0
  205. /package/{app → src}/formatters/progress/markdown.js +0 -0
  206. /package/{app → src}/formatters/questions/json.js +0 -0
  207. /package/{app → src}/formatters/questions/markdown.js +0 -0
  208. /package/{app → src}/formatters/questions/shared.js +0 -0
  209. /package/{app → src}/formatters/questions/yaml.js +0 -0
  210. /package/{app → src}/formatters/shared.js +0 -0
  211. /package/{app → src}/formatters/skill/microdata.js +0 -0
  212. /package/{app → src}/formatters/stage/dom.js +0 -0
  213. /package/{app → src}/formatters/stage/index.js +0 -0
  214. /package/{app → src}/formatters/track/microdata.js +0 -0
  215. /package/{app → src}/lib/cli-output.js +0 -0
  216. /package/{app → src}/lib/error-boundary.js +0 -0
  217. /package/{app → src}/lib/errors.js +0 -0
  218. /package/{app → src}/lib/form-controls.js +0 -0
  219. /package/{app → src}/lib/markdown.js +0 -0
  220. /package/{app → src}/lib/radar.js +0 -0
  221. /package/{app → src}/lib/reactive.js +0 -0
  222. /package/{app → src}/lib/router-core.js +0 -0
  223. /package/{app → src}/lib/router-pages.js +0 -0
  224. /package/{app → src}/lib/router-slides.js +0 -0
  225. /package/{app → src}/lib/state.js +0 -0
  226. /package/{app → src}/lib/template-loader.js +0 -0
  227. /package/{app → src}/lib/utils.js +0 -0
  228. /package/{app → src}/lib/yaml-loader.js +0 -0
  229. /package/{app → src}/main.js +0 -0
  230. /package/{app → src}/pages/behaviour.js +0 -0
  231. /package/{app → src}/pages/discipline.js +0 -0
  232. /package/{app → src}/pages/driver.js +0 -0
  233. /package/{app → src}/pages/grade.js +0 -0
  234. /package/{app → src}/pages/interview-builder.js +0 -0
  235. /package/{app → src}/pages/progress-builder.js +0 -0
  236. /package/{app → src}/pages/progress.js +0 -0
  237. /package/{app → src}/pages/track.js +0 -0
  238. /package/{app → src}/slides/behaviour.js +0 -0
  239. /package/{app → src}/slides/discipline.js +0 -0
  240. /package/{app → src}/slides/driver.js +0 -0
  241. /package/{app → src}/slides/grade.js +0 -0
  242. /package/{app → src}/slides/interview.js +0 -0
  243. /package/{app → src}/slides/progress.js +0 -0
  244. /package/{app → src}/slides/track.js +0 -0
  245. /package/{app → src}/types.js +0 -0
@@ -1,412 +0,0 @@
1
- # yaml-language-server: $schema=https://schema.forwardimpact.team/json/capability.schema.json
2
-
3
- name: Reliability
4
- emoji: 🛡️
5
- displayOrder: 5
6
- description: |
7
- Ensuring systems are dependable, secure, and observable.
8
- Includes DevOps practices, security, monitoring, incident response,
9
- and infrastructure management.
10
- professionalResponsibilities:
11
- awareness:
12
- Follow security and operational guidelines, escalate issues appropriately,
13
- and participate in on-call rotations with guidance
14
- foundational:
15
- Implement reliability practices in your code, create basic monitoring, and
16
- contribute effectively to incident response
17
- working:
18
- Design for reliability, implement comprehensive monitoring and alerting,
19
- lead incident response, and drive post-incident improvements
20
- practitioner:
21
- Establish SLOs/SLIs across teams, build resilient systems, lead reliability
22
- initiatives for your area, mentor engineers on reliability practices, and
23
- drive reliability culture
24
- expert:
25
- Shape reliability strategy across the business unit, lead critical incident
26
- management, pioneer new reliability practices, and be the authority on
27
- system resilience
28
- managementResponsibilities:
29
- awareness:
30
- Understand reliability requirements and support incident escalation
31
- processes
32
- foundational:
33
- Ensure team follows reliability practices, manage on-call schedules, and
34
- facilitate incident retrospectives
35
- working:
36
- Own team reliability outcomes, manage incident response rotations, staff
37
- reliability initiatives, and champion operational excellence
38
- practitioner:
39
- Drive reliability culture across teams, establish SLOs and incident
40
- management processes for your area, and own cross-team reliability outcomes
41
- expert:
42
- Shape reliability strategy across the business unit, lead critical incident
43
- management at executive level, and own enterprise reliability outcomes
44
- skills:
45
- - id: devops
46
- name: DevOps & CI/CD
47
- human:
48
- description:
49
- Building and maintaining deployment pipelines, infrastructure, and
50
- operational practices
51
- levelDescriptions:
52
- awareness:
53
- You understand CI/CD concepts (build, test, deploy) and can trigger
54
- and monitor pipelines others have built. You follow deployment
55
- procedures.
56
- foundational:
57
- You configure basic CI/CD pipelines, understand containerization
58
- (Docker), and can troubleshoot common build and deployment failures.
59
- working:
60
- You build complete CI/CD pipelines end-to-end, manage infrastructure
61
- as code (Terraform, CloudFormation), implement monitoring, and design
62
- deployment strategies for your services.
63
- practitioner:
64
- You design deployment strategies for complex multi-service systems
65
- across teams, optimize pipeline performance and reliability, define
66
- DevOps practices for your area, and mentor engineers on
67
- infrastructure.
68
- expert:
69
- You shape DevOps culture and practices across the business unit. You
70
- introduce innovative approaches to deployment and infrastructure,
71
- solve large-scale DevOps challenges, and are recognized externally.
72
- agent:
73
- name: devops-cicd
74
- description: |
75
- Guide for building CI/CD pipelines, managing infrastructure as code,
76
- and implementing deployment best practices.
77
- useWhen: |
78
- Setting up pipelines, containerizing applications, or configuring
79
- infrastructure.
80
- stages:
81
- specify:
82
- focus: |
83
- Define CI/CD and infrastructure requirements.
84
- Clarify deployment strategy and operational needs.
85
- activities:
86
- - Document deployment frequency requirements
87
- - Identify rollback and recovery requirements
88
- - Specify monitoring and alerting needs
89
- - Define security and compliance constraints
90
- - Mark ambiguities with [NEEDS CLARIFICATION]
91
- ready:
92
- - Deployment requirements are documented
93
- - Recovery requirements are specified
94
- - Monitoring needs are identified
95
- - Compliance constraints are clear
96
- plan:
97
- focus: |
98
- Plan CI/CD pipeline architecture and infrastructure requirements.
99
- Consider deployment strategies and monitoring needs.
100
- activities:
101
- - Define pipeline stages (build, test, deploy)
102
- - Identify infrastructure requirements
103
- - Plan deployment strategy (rolling, blue-green, canary)
104
- - Consider monitoring and alerting needs
105
- - Plan secret management approach
106
- ready:
107
- - Pipeline architecture is documented
108
- - Deployment strategy is chosen and justified
109
- - Infrastructure requirements are identified
110
- - Monitoring approach is defined
111
- code:
112
- focus: |
113
- Implement CI/CD pipelines and infrastructure as code. Follow
114
- best practices for containerization and deployment automation.
115
- activities:
116
- - Configure CI/CD pipeline stages
117
- - Implement infrastructure as code (Terraform, CloudFormation)
118
- - Create Dockerfiles with security best practices
119
- - Set up monitoring and alerting
120
- - Configure secret management
121
- - Implement deployment automation
122
- ready:
123
- - Pipeline runs on every commit
124
- - Tests run before deployment
125
- - Deployments are automated
126
- - Infrastructure is version controlled
127
- - Secrets are managed securely
128
- - Monitoring is in place
129
- review:
130
- focus: |
131
- Verify pipeline reliability, security, and operational readiness.
132
- Ensure rollback procedures work and documentation is complete.
133
- activities:
134
- - Verify pipeline runs successfully end-to-end
135
- - Test rollback procedures
136
- - Review security configurations
137
- - Validate monitoring and alerts
138
- - Check documentation completeness
139
- ready:
140
- - Pipeline is tested and reliable
141
- - Rollback procedure is documented and tested
142
- - Alerts are configured and tested
143
- - Runbooks exist for common issues
144
- deploy:
145
- focus: |
146
- Deploy pipeline and infrastructure changes to production.
147
- Verify operational readiness.
148
- activities:
149
- - Deploy pipeline configuration to production
150
- - Verify deployment workflows work correctly
151
- - Confirm monitoring and alerting are operational
152
- - Run deployment through the new pipeline
153
- ready:
154
- - Pipeline deployed and operational
155
- - Workflows tested in production
156
- - Monitoring confirms healthy operation
157
- - First deployment through pipeline succeeded
158
- toolReferences:
159
- - name: Terraform
160
- url: https://developer.hashicorp.com/terraform/docs
161
- description: Infrastructure as code tool
162
- useWhen: Provisioning and managing cloud infrastructure
163
- - name: Docker
164
- url: https://docs.docker.com/
165
- description: Container platform
166
- useWhen: Containerizing applications or managing container environments
167
- implementationReference: |
168
- ## CI/CD Pipeline Stages
169
-
170
- ### Build
171
- - Install dependencies
172
- - Compile/transpile code
173
- - Generate artifacts
174
- - Cache dependencies for speed
175
-
176
- ### Test
177
- - Run unit tests
178
- - Run integration tests
179
- - Static analysis and linting
180
- - Security scanning
181
-
182
- ### Deploy
183
- - Deploy to staging environment
184
- - Run smoke tests
185
- - Deploy to production
186
- - Verify deployment health
187
-
188
- ## Infrastructure as Code
189
-
190
- ### Terraform
191
- ```hcl
192
- # Define resources declaratively
193
- resource "aws_instance" "example" {
194
- ami = "ami-0c55b159cbfafe1f0"
195
- instance_type = "t2.micro"
196
- }
197
- ```
198
-
199
- ### Docker
200
- ```dockerfile
201
- FROM node:18-alpine
202
- WORKDIR /app
203
- COPY package*.json ./
204
- RUN npm ci --only=production
205
- COPY . .
206
- CMD ["node", "server.js"]
207
- ```
208
-
209
- ## Deployment Strategies
210
-
211
- ### Rolling Deployment
212
- - Gradual replacement of instances
213
- - Zero downtime
214
- - Easy rollback
215
-
216
- ### Blue-Green Deployment
217
- - Two identical environments
218
- - Switch traffic atomically
219
- - Fast rollback
220
-
221
- ### Canary Deployment
222
- - Route small percentage to new version
223
- - Monitor for issues
224
- - Gradually increase traffic
225
- - id: sre_practices
226
- name: Site Reliability Engineering
227
- human:
228
- description:
229
- Ensuring system reliability through observability, incident response,
230
- and capacity planning
231
- levelDescriptions:
232
- awareness:
233
- You understand SLIs, SLOs, and error budgets conceptually. You can use
234
- monitoring dashboards and escalate issues appropriately.
235
- foundational:
236
- You create basic alerts and dashboards. You participate in on-call
237
- rotations and contribute to incident response under guidance.
238
- working:
239
- You design observability strategies for your services, lead incident
240
- response, implement resilience testing, and conduct blameless
241
- post-mortems. You balance reliability investment with feature
242
- velocity.
243
- practitioner:
244
- You define reliability standards across teams in your area, drive
245
- post-incident improvements systematically, design capacity planning
246
- processes, and mentor engineers on SRE practices.
247
- expert:
248
- You shape reliability culture and standards across the business unit.
249
- You pioneer new reliability practices, solve large-scale reliability
250
- challenges, and are recognized as an authority on system resilience.
251
- agent:
252
- name: sre-practices
253
- description: |
254
- Guide for ensuring system reliability through observability, incident
255
- response, and capacity planning.
256
- useWhen: |
257
- Designing monitoring, handling incidents, setting SLOs, or improving
258
- system resilience.
259
- stages:
260
- specify:
261
- focus: |
262
- Define reliability requirements and SLO targets.
263
- Identify critical user journeys that need protection.
264
- activities:
265
- - Identify critical user journeys and business impact
266
- - Document reliability requirements (availability, latency)
267
- - Define SLO targets with stakeholder agreement
268
- - Specify acceptable error budgets
269
- - Mark ambiguities with [NEEDS CLARIFICATION]
270
- ready:
271
- - Critical user journeys are identified
272
- - Reliability requirements are documented
273
- - SLO targets are defined
274
- - Error budgets are agreed
275
- plan:
276
- focus: |
277
- Define reliability requirements, SLIs/SLOs, and observability
278
- strategy. Plan for resilience and capacity needs.
279
- activities:
280
- - Define SLIs for key user journeys
281
- - Set SLOs with stakeholder agreement
282
- - Plan observability strategy (metrics, logs, traces)
283
- - Identify failure modes and resilience patterns
284
- - Define alerting thresholds
285
- ready:
286
- - SLIs defined for key user journeys
287
- - SLOs set with stakeholder agreement
288
- - Monitoring strategy is planned
289
- - Failure modes are identified
290
- - Alerting thresholds are defined
291
- code:
292
- focus: |
293
- Implement observability, resilience patterns, and operational
294
- tooling. Build systems that fail gracefully and recover quickly.
295
- activities:
296
- - Implement metrics, logging, and tracing
297
- - Configure alerts based on SLOs
298
- - Implement resilience patterns (timeouts, retries, circuit
299
- breakers)
300
- - Create runbooks for common issues
301
- - Set up error budget tracking
302
- ready:
303
- - Comprehensive monitoring is in place
304
- - Alerts are actionable and low-noise
305
- - Resilience patterns are implemented
306
- - Runbooks exist for common issues
307
- - Error budget tracking is in place
308
- review:
309
- focus: |
310
- Verify reliability implementation meets SLOs and operational
311
- readiness. Ensure incident response procedures are in place.
312
- activities:
313
- - Validate SLOs are measurable
314
- - Test failure scenarios
315
- - Review runbook completeness
316
- - Verify incident response procedures
317
- - Check alert quality and coverage
318
- ready:
319
- - SLOs are measurable and validated
320
- - Failure scenarios are tested
321
- - Incident response process documented
322
- - Post-mortem culture established
323
- - Disaster recovery approach is tested
324
- deploy:
325
- focus: |
326
- Deploy reliability infrastructure and verify production
327
- monitoring. Ensure on-call readiness.
328
- activities:
329
- - Deploy monitoring and alerting to production
330
- - Verify dashboards and alerts work correctly
331
- - Confirm on-call rotation is ready
332
- - Run production readiness review
333
- ready:
334
- - Monitoring is live in production
335
- - Alerts fire correctly for SLO breaches
336
- - On-call team is trained and ready
337
- - Production readiness review is complete
338
- implementationReference: |
339
- ## Service Level Concepts
340
-
341
- ### SLI (Service Level Indicator)
342
- Quantitative measure of service behavior:
343
- - Request latency (p50, p95, p99)
344
- - Error rate (% of failed requests)
345
- - Availability (% of successful requests)
346
- - Throughput (requests per second)
347
-
348
- ### SLO (Service Level Objective)
349
- Target value for an SLI:
350
- - "99.9% of requests complete in < 200ms"
351
- - "Error rate < 0.1% over 30 days"
352
- - "99.95% availability monthly"
353
-
354
- ### Error Budget
355
- Allowed unreliability: 100% - SLO
356
- - 99.9% SLO = 0.1% error budget
357
- - ~43 minutes downtime per month
358
- - Spend on features or reliability
359
-
360
- ## Observability
361
-
362
- ### Three Pillars
363
- - **Metrics**: Aggregated numeric data (counters, gauges, histograms)
364
- - **Logs**: Discrete event records with context
365
- - **Traces**: Request flow across services
366
-
367
- ### Alerting Principles
368
- - Alert on symptoms, not causes
369
- - Every alert should be actionable
370
- - Reduce noise ruthlessly
371
- - Page only for user-impacting issues
372
- - Use severity levels appropriately
373
-
374
- ## Incident Response
375
-
376
- ### Incident Lifecycle
377
- 1. **Detection**: Automated alerts or user reports
378
- 2. **Triage**: Assess severity and impact
379
- 3. **Mitigation**: Stop the bleeding first
380
- 4. **Resolution**: Fix the underlying issue
381
- 5. **Post-mortem**: Learn and improve
382
-
383
- ### During an Incident
384
- - Communicate early and often
385
- - Focus on mitigation before root cause
386
- - Document actions in real-time
387
- - Escalate when needed
388
- - Update stakeholders regularly
389
-
390
- ## Post-Mortem Process
391
-
392
- ### Blameless Culture
393
- - Focus on systems, not individuals
394
- - Assume good intentions
395
- - Ask "how did the system allow this?"
396
- - Share findings openly
397
-
398
- ### Post-Mortem Template
399
- 1. Incident summary
400
- 2. Timeline of events
401
- 3. Root cause analysis
402
- 4. What went well
403
- 5. What could be improved
404
- 6. Action items with owners
405
-
406
- ## Resilience Patterns
407
-
408
- - **Timeouts**: Don't wait forever
409
- - **Retries**: With exponential backoff
410
- - **Circuit breakers**: Fail fast when downstream is unhealthy
411
- - **Bulkheads**: Isolate failures
412
- - **Graceful degradation**: Partial functionality over total failure