@forwardimpact/schema 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/fit-schema.js +260 -0
- package/examples/behaviours/_index.yaml +8 -0
- package/examples/behaviours/outcome_ownership.yaml +43 -0
- package/examples/behaviours/polymathic_knowledge.yaml +41 -0
- package/examples/behaviours/precise_communication.yaml +39 -0
- package/examples/behaviours/relentless_curiosity.yaml +37 -0
- package/examples/behaviours/systems_thinking.yaml +40 -0
- package/examples/capabilities/_index.yaml +8 -0
- package/examples/capabilities/business.yaml +189 -0
- package/examples/capabilities/delivery.yaml +305 -0
- package/examples/capabilities/people.yaml +68 -0
- package/examples/capabilities/reliability.yaml +414 -0
- package/examples/capabilities/scale.yaml +378 -0
- package/examples/copilot-setup-steps.yaml +25 -0
- package/examples/devcontainer.yaml +21 -0
- package/examples/disciplines/_index.yaml +6 -0
- package/examples/disciplines/data_engineering.yaml +78 -0
- package/examples/disciplines/engineering_management.yaml +63 -0
- package/examples/disciplines/software_engineering.yaml +78 -0
- package/examples/drivers.yaml +202 -0
- package/examples/framework.yaml +69 -0
- package/examples/grades.yaml +115 -0
- package/examples/questions/behaviours/outcome_ownership.yaml +51 -0
- package/examples/questions/behaviours/polymathic_knowledge.yaml +47 -0
- package/examples/questions/behaviours/precise_communication.yaml +54 -0
- package/examples/questions/behaviours/relentless_curiosity.yaml +50 -0
- package/examples/questions/behaviours/systems_thinking.yaml +52 -0
- package/examples/questions/skills/architecture_design.yaml +53 -0
- package/examples/questions/skills/cloud_platforms.yaml +47 -0
- package/examples/questions/skills/code_quality.yaml +48 -0
- package/examples/questions/skills/data_modeling.yaml +45 -0
- package/examples/questions/skills/devops.yaml +46 -0
- package/examples/questions/skills/full_stack_development.yaml +47 -0
- package/examples/questions/skills/sre_practices.yaml +43 -0
- package/examples/questions/skills/stakeholder_management.yaml +48 -0
- package/examples/questions/skills/team_collaboration.yaml +42 -0
- package/examples/questions/skills/technical_writing.yaml +42 -0
- package/examples/self-assessments.yaml +64 -0
- package/examples/stages.yaml +139 -0
- package/examples/tracks/_index.yaml +5 -0
- package/examples/tracks/platform.yaml +49 -0
- package/examples/tracks/sre.yaml +48 -0
- package/examples/vscode-settings.yaml +21 -0
- package/lib/index-generator.js +65 -0
- package/lib/index.js +44 -0
- package/lib/levels.js +601 -0
- package/lib/loader.js +599 -0
- package/lib/modifiers.js +23 -0
- package/lib/schema-validation.js +438 -0
- package/lib/validation.js +2130 -0
- package/package.json +49 -0
- package/schema/json/behaviour-questions.schema.json +68 -0
- package/schema/json/behaviour.schema.json +73 -0
- package/schema/json/capability.schema.json +220 -0
- package/schema/json/defs.schema.json +132 -0
- package/schema/json/discipline.schema.json +132 -0
- package/schema/json/drivers.schema.json +48 -0
- package/schema/json/framework.schema.json +55 -0
- package/schema/json/grades.schema.json +121 -0
- package/schema/json/index.schema.json +18 -0
- package/schema/json/self-assessments.schema.json +52 -0
- package/schema/json/skill-questions.schema.json +68 -0
- package/schema/json/stages.schema.json +84 -0
- package/schema/json/track.schema.json +100 -0
- package/schema/rdf/pathway.ttl +2362 -0
|
@@ -0,0 +1,414 @@
|
|
|
1
|
+
# yaml-language-server: $schema=https://schema.forwardimpact.team/json/capability.schema.json
|
|
2
|
+
|
|
3
|
+
name: Reliability
|
|
4
|
+
emojiIcon: 🛡️
|
|
5
|
+
displayOrder: 5
|
|
6
|
+
description: |
|
|
7
|
+
Ensuring systems are dependable, secure, and observable.
|
|
8
|
+
Includes DevOps practices, security, monitoring, incident response,
|
|
9
|
+
and infrastructure management.
|
|
10
|
+
professionalResponsibilities:
|
|
11
|
+
awareness:
|
|
12
|
+
Follow security and operational guidelines, escalate issues appropriately,
|
|
13
|
+
and participate in on-call rotations with guidance
|
|
14
|
+
foundational:
|
|
15
|
+
Implement reliability practices in your code, create basic monitoring, and
|
|
16
|
+
contribute effectively to incident response
|
|
17
|
+
working:
|
|
18
|
+
Design for reliability, implement comprehensive monitoring and alerting,
|
|
19
|
+
lead incident response, and drive post-incident improvements
|
|
20
|
+
practitioner:
|
|
21
|
+
Establish SLOs/SLIs across teams, build resilient systems, lead reliability
|
|
22
|
+
initiatives for your area, mentor engineers on reliability practices, and
|
|
23
|
+
drive reliability culture
|
|
24
|
+
expert:
|
|
25
|
+
Shape reliability strategy across the business unit, lead critical incident
|
|
26
|
+
management, pioneer new reliability practices, and be the authority on
|
|
27
|
+
system resilience
|
|
28
|
+
managementResponsibilities:
|
|
29
|
+
awareness:
|
|
30
|
+
Understand reliability requirements and support incident escalation
|
|
31
|
+
processes
|
|
32
|
+
foundational:
|
|
33
|
+
Ensure team follows reliability practices, manage on-call schedules, and
|
|
34
|
+
facilitate incident retrospectives
|
|
35
|
+
working:
|
|
36
|
+
Own team reliability outcomes, manage incident response rotations, staff
|
|
37
|
+
reliability initiatives, and champion operational excellence
|
|
38
|
+
practitioner:
|
|
39
|
+
Drive reliability culture across teams, establish SLOs and incident
|
|
40
|
+
management processes for your area, and own cross-team reliability outcomes
|
|
41
|
+
expert:
|
|
42
|
+
Shape reliability strategy across the business unit, lead critical incident
|
|
43
|
+
management at executive level, and own enterprise reliability outcomes
|
|
44
|
+
skills:
|
|
45
|
+
- id: devops
|
|
46
|
+
name: DevOps & CI/CD
|
|
47
|
+
human:
|
|
48
|
+
description:
|
|
49
|
+
Building and maintaining deployment pipelines, infrastructure, and
|
|
50
|
+
operational practices
|
|
51
|
+
levelDescriptions:
|
|
52
|
+
awareness:
|
|
53
|
+
You understand CI/CD concepts (build, test, deploy) and can trigger
|
|
54
|
+
and monitor pipelines others have built. You follow deployment
|
|
55
|
+
procedures.
|
|
56
|
+
foundational:
|
|
57
|
+
You configure basic CI/CD pipelines, understand containerization
|
|
58
|
+
(Docker), and can troubleshoot common build and deployment failures.
|
|
59
|
+
working:
|
|
60
|
+
You build complete CI/CD pipelines end-to-end, manage infrastructure
|
|
61
|
+
as code (Terraform, CloudFormation), implement monitoring, and design
|
|
62
|
+
deployment strategies for your services.
|
|
63
|
+
practitioner:
|
|
64
|
+
You design deployment strategies for complex multi-service systems
|
|
65
|
+
across teams, optimize pipeline performance and reliability, define
|
|
66
|
+
DevOps practices for your area, and mentor engineers on
|
|
67
|
+
infrastructure.
|
|
68
|
+
expert:
|
|
69
|
+
You shape DevOps culture and practices across the business unit. You
|
|
70
|
+
introduce innovative approaches to deployment and infrastructure,
|
|
71
|
+
solve large-scale DevOps challenges, and are recognized externally.
|
|
72
|
+
agent:
|
|
73
|
+
name: devops-cicd
|
|
74
|
+
description: |
|
|
75
|
+
Guide for building CI/CD pipelines, managing infrastructure as code,
|
|
76
|
+
and implementing deployment best practices.
|
|
77
|
+
useWhen: |
|
|
78
|
+
Setting up pipelines, containerizing applications, or configuring
|
|
79
|
+
infrastructure.
|
|
80
|
+
stages:
|
|
81
|
+
specify:
|
|
82
|
+
focus: |
|
|
83
|
+
Define CI/CD and infrastructure requirements.
|
|
84
|
+
Clarify deployment strategy and operational needs.
|
|
85
|
+
activities:
|
|
86
|
+
- Document deployment frequency requirements
|
|
87
|
+
- Identify rollback and recovery requirements
|
|
88
|
+
- Specify monitoring and alerting needs
|
|
89
|
+
- Define security and compliance constraints
|
|
90
|
+
- Mark ambiguities with [NEEDS CLARIFICATION]
|
|
91
|
+
ready:
|
|
92
|
+
- Deployment requirements are documented
|
|
93
|
+
- Recovery requirements are specified
|
|
94
|
+
- Monitoring needs are identified
|
|
95
|
+
- Compliance constraints are clear
|
|
96
|
+
plan:
|
|
97
|
+
focus: |
|
|
98
|
+
Plan CI/CD pipeline architecture and infrastructure requirements.
|
|
99
|
+
Consider deployment strategies and monitoring needs.
|
|
100
|
+
activities:
|
|
101
|
+
- Define pipeline stages (build, test, deploy)
|
|
102
|
+
- Identify infrastructure requirements
|
|
103
|
+
- Plan deployment strategy (rolling, blue-green, canary)
|
|
104
|
+
- Consider monitoring and alerting needs
|
|
105
|
+
- Plan secret management approach
|
|
106
|
+
ready:
|
|
107
|
+
- Pipeline architecture is documented
|
|
108
|
+
- Deployment strategy is chosen and justified
|
|
109
|
+
- Infrastructure requirements are identified
|
|
110
|
+
- Monitoring approach is defined
|
|
111
|
+
code:
|
|
112
|
+
focus: |
|
|
113
|
+
Implement CI/CD pipelines and infrastructure as code. Follow
|
|
114
|
+
best practices for containerization and deployment automation.
|
|
115
|
+
activities:
|
|
116
|
+
- Configure CI/CD pipeline stages
|
|
117
|
+
- Implement infrastructure as code (Terraform, CloudFormation)
|
|
118
|
+
- Create Dockerfiles with security best practices
|
|
119
|
+
- Set up monitoring and alerting
|
|
120
|
+
- Configure secret management
|
|
121
|
+
- Implement deployment automation
|
|
122
|
+
ready:
|
|
123
|
+
- Pipeline runs on every commit
|
|
124
|
+
- Tests run before deployment
|
|
125
|
+
- Deployments are automated
|
|
126
|
+
- Infrastructure is version controlled
|
|
127
|
+
- Secrets are managed securely
|
|
128
|
+
- Monitoring is in place
|
|
129
|
+
review:
|
|
130
|
+
focus: |
|
|
131
|
+
Verify pipeline reliability, security, and operational readiness.
|
|
132
|
+
Ensure rollback procedures work and documentation is complete.
|
|
133
|
+
activities:
|
|
134
|
+
- Verify pipeline runs successfully end-to-end
|
|
135
|
+
- Test rollback procedures
|
|
136
|
+
- Review security configurations
|
|
137
|
+
- Validate monitoring and alerts
|
|
138
|
+
- Check documentation completeness
|
|
139
|
+
ready:
|
|
140
|
+
- Pipeline is tested and reliable
|
|
141
|
+
- Rollback procedure is documented and tested
|
|
142
|
+
- Alerts are configured and tested
|
|
143
|
+
- Runbooks exist for common issues
|
|
144
|
+
deploy:
|
|
145
|
+
focus: |
|
|
146
|
+
Deploy pipeline and infrastructure changes to production.
|
|
147
|
+
Verify operational readiness.
|
|
148
|
+
activities:
|
|
149
|
+
- Deploy pipeline configuration to production
|
|
150
|
+
- Verify deployment workflows work correctly
|
|
151
|
+
- Confirm monitoring and alerting are operational
|
|
152
|
+
- Run deployment through the new pipeline
|
|
153
|
+
ready:
|
|
154
|
+
- Pipeline deployed and operational
|
|
155
|
+
- Workflows tested in production
|
|
156
|
+
- Monitoring confirms healthy operation
|
|
157
|
+
- First deployment through pipeline succeeded
|
|
158
|
+
toolReferences:
|
|
159
|
+
- name: Terraform
|
|
160
|
+
url: https://developer.hashicorp.com/terraform/docs
|
|
161
|
+
simpleIcon: terraform
|
|
162
|
+
description: Infrastructure as code tool
|
|
163
|
+
useWhen: Provisioning and managing cloud infrastructure
|
|
164
|
+
- name: Docker
|
|
165
|
+
url: https://docs.docker.com/
|
|
166
|
+
simpleIcon: docker
|
|
167
|
+
description: Container platform
|
|
168
|
+
useWhen: Containerizing applications or managing container environments
|
|
169
|
+
implementationReference: |
|
|
170
|
+
## CI/CD Pipeline Stages
|
|
171
|
+
|
|
172
|
+
### Build
|
|
173
|
+
- Install dependencies
|
|
174
|
+
- Compile/transpile code
|
|
175
|
+
- Generate artifacts
|
|
176
|
+
- Cache dependencies for speed
|
|
177
|
+
|
|
178
|
+
### Test
|
|
179
|
+
- Run unit tests
|
|
180
|
+
- Run integration tests
|
|
181
|
+
- Static analysis and linting
|
|
182
|
+
- Security scanning
|
|
183
|
+
|
|
184
|
+
### Deploy
|
|
185
|
+
- Deploy to staging environment
|
|
186
|
+
- Run smoke tests
|
|
187
|
+
- Deploy to production
|
|
188
|
+
- Verify deployment health
|
|
189
|
+
|
|
190
|
+
## Infrastructure as Code
|
|
191
|
+
|
|
192
|
+
### Terraform
|
|
193
|
+
```hcl
|
|
194
|
+
# Define resources declaratively
|
|
195
|
+
resource "aws_instance" "example" {
|
|
196
|
+
ami = "ami-0c55b159cbfafe1f0"
|
|
197
|
+
instance_type = "t2.micro"
|
|
198
|
+
}
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Docker
|
|
202
|
+
```dockerfile
|
|
203
|
+
FROM node:18-alpine
|
|
204
|
+
WORKDIR /app
|
|
205
|
+
COPY package*.json ./
|
|
206
|
+
RUN npm ci --only=production
|
|
207
|
+
COPY . .
|
|
208
|
+
CMD ["node", "server.js"]
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
## Deployment Strategies
|
|
212
|
+
|
|
213
|
+
### Rolling Deployment
|
|
214
|
+
- Gradual replacement of instances
|
|
215
|
+
- Zero downtime
|
|
216
|
+
- Easy rollback
|
|
217
|
+
|
|
218
|
+
### Blue-Green Deployment
|
|
219
|
+
- Two identical environments
|
|
220
|
+
- Switch traffic atomically
|
|
221
|
+
- Fast rollback
|
|
222
|
+
|
|
223
|
+
### Canary Deployment
|
|
224
|
+
- Route small percentage to new version
|
|
225
|
+
- Monitor for issues
|
|
226
|
+
- Gradually increase traffic
|
|
227
|
+
- id: sre_practices
|
|
228
|
+
name: Site Reliability Engineering
|
|
229
|
+
human:
|
|
230
|
+
description:
|
|
231
|
+
Ensuring system reliability through observability, incident response,
|
|
232
|
+
and capacity planning
|
|
233
|
+
levelDescriptions:
|
|
234
|
+
awareness:
|
|
235
|
+
You understand SLIs, SLOs, and error budgets conceptually. You can use
|
|
236
|
+
monitoring dashboards and escalate issues appropriately.
|
|
237
|
+
foundational:
|
|
238
|
+
You create basic alerts and dashboards. You participate in on-call
|
|
239
|
+
rotations and contribute to incident response under guidance.
|
|
240
|
+
working:
|
|
241
|
+
You design observability strategies for your services, lead incident
|
|
242
|
+
response, implement resilience testing, and conduct blameless
|
|
243
|
+
post-mortems. You balance reliability investment with feature
|
|
244
|
+
velocity.
|
|
245
|
+
practitioner:
|
|
246
|
+
You define reliability standards across teams in your area, drive
|
|
247
|
+
post-incident improvements systematically, design capacity planning
|
|
248
|
+
processes, and mentor engineers on SRE practices.
|
|
249
|
+
expert:
|
|
250
|
+
You shape reliability culture and standards across the business unit.
|
|
251
|
+
You pioneer new reliability practices, solve large-scale reliability
|
|
252
|
+
challenges, and are recognized as an authority on system resilience.
|
|
253
|
+
agent:
|
|
254
|
+
name: sre-practices
|
|
255
|
+
description: |
|
|
256
|
+
Guide for ensuring system reliability through observability, incident
|
|
257
|
+
response, and capacity planning.
|
|
258
|
+
useWhen: |
|
|
259
|
+
Designing monitoring, handling incidents, setting SLOs, or improving
|
|
260
|
+
system resilience.
|
|
261
|
+
stages:
|
|
262
|
+
specify:
|
|
263
|
+
focus: |
|
|
264
|
+
Define reliability requirements and SLO targets.
|
|
265
|
+
Identify critical user journeys that need protection.
|
|
266
|
+
activities:
|
|
267
|
+
- Identify critical user journeys and business impact
|
|
268
|
+
- Document reliability requirements (availability, latency)
|
|
269
|
+
- Define SLO targets with stakeholder agreement
|
|
270
|
+
- Specify acceptable error budgets
|
|
271
|
+
- Mark ambiguities with [NEEDS CLARIFICATION]
|
|
272
|
+
ready:
|
|
273
|
+
- Critical user journeys are identified
|
|
274
|
+
- Reliability requirements are documented
|
|
275
|
+
- SLO targets are defined
|
|
276
|
+
- Error budgets are agreed
|
|
277
|
+
plan:
|
|
278
|
+
focus: |
|
|
279
|
+
Define reliability requirements, SLIs/SLOs, and observability
|
|
280
|
+
strategy. Plan for resilience and capacity needs.
|
|
281
|
+
activities:
|
|
282
|
+
- Define SLIs for key user journeys
|
|
283
|
+
- Set SLOs with stakeholder agreement
|
|
284
|
+
- Plan observability strategy (metrics, logs, traces)
|
|
285
|
+
- Identify failure modes and resilience patterns
|
|
286
|
+
- Define alerting thresholds
|
|
287
|
+
ready:
|
|
288
|
+
- SLIs defined for key user journeys
|
|
289
|
+
- SLOs set with stakeholder agreement
|
|
290
|
+
- Monitoring strategy is planned
|
|
291
|
+
- Failure modes are identified
|
|
292
|
+
- Alerting thresholds are defined
|
|
293
|
+
code:
|
|
294
|
+
focus: |
|
|
295
|
+
Implement observability, resilience patterns, and operational
|
|
296
|
+
tooling. Build systems that fail gracefully and recover quickly.
|
|
297
|
+
activities:
|
|
298
|
+
- Implement metrics, logging, and tracing
|
|
299
|
+
- Configure alerts based on SLOs
|
|
300
|
+
- Implement resilience patterns (timeouts, retries, circuit
|
|
301
|
+
breakers)
|
|
302
|
+
- Create runbooks for common issues
|
|
303
|
+
- Set up error budget tracking
|
|
304
|
+
ready:
|
|
305
|
+
- Comprehensive monitoring is in place
|
|
306
|
+
- Alerts are actionable and low-noise
|
|
307
|
+
- Resilience patterns are implemented
|
|
308
|
+
- Runbooks exist for common issues
|
|
309
|
+
- Error budget tracking is in place
|
|
310
|
+
review:
|
|
311
|
+
focus: |
|
|
312
|
+
Verify reliability implementation meets SLOs and operational
|
|
313
|
+
readiness. Ensure incident response procedures are in place.
|
|
314
|
+
activities:
|
|
315
|
+
- Validate SLOs are measurable
|
|
316
|
+
- Test failure scenarios
|
|
317
|
+
- Review runbook completeness
|
|
318
|
+
- Verify incident response procedures
|
|
319
|
+
- Check alert quality and coverage
|
|
320
|
+
ready:
|
|
321
|
+
- SLOs are measurable and validated
|
|
322
|
+
- Failure scenarios are tested
|
|
323
|
+
- Incident response process documented
|
|
324
|
+
- Post-mortem culture established
|
|
325
|
+
- Disaster recovery approach is tested
|
|
326
|
+
deploy:
|
|
327
|
+
focus: |
|
|
328
|
+
Deploy reliability infrastructure and verify production
|
|
329
|
+
monitoring. Ensure on-call readiness.
|
|
330
|
+
activities:
|
|
331
|
+
- Deploy monitoring and alerting to production
|
|
332
|
+
- Verify dashboards and alerts work correctly
|
|
333
|
+
- Confirm on-call rotation is ready
|
|
334
|
+
- Run production readiness review
|
|
335
|
+
ready:
|
|
336
|
+
- Monitoring is live in production
|
|
337
|
+
- Alerts fire correctly for SLO breaches
|
|
338
|
+
- On-call team is trained and ready
|
|
339
|
+
- Production readiness review is complete
|
|
340
|
+
implementationReference: |
|
|
341
|
+
## Service Level Concepts
|
|
342
|
+
|
|
343
|
+
### SLI (Service Level Indicator)
|
|
344
|
+
Quantitative measure of service behavior:
|
|
345
|
+
- Request latency (p50, p95, p99)
|
|
346
|
+
- Error rate (% of failed requests)
|
|
347
|
+
- Availability (% of successful requests)
|
|
348
|
+
- Throughput (requests per second)
|
|
349
|
+
|
|
350
|
+
### SLO (Service Level Objective)
|
|
351
|
+
Target value for an SLI:
|
|
352
|
+
- "99.9% of requests complete in < 200ms"
|
|
353
|
+
- "Error rate < 0.1% over 30 days"
|
|
354
|
+
- "99.95% availability monthly"
|
|
355
|
+
|
|
356
|
+
### Error Budget
|
|
357
|
+
Allowed unreliability: 100% - SLO
|
|
358
|
+
- 99.9% SLO = 0.1% error budget
|
|
359
|
+
- ~43 minutes downtime per month
|
|
360
|
+
- Spend on features or reliability
|
|
361
|
+
|
|
362
|
+
## Observability
|
|
363
|
+
|
|
364
|
+
### Three Pillars
|
|
365
|
+
- **Metrics**: Aggregated numeric data (counters, gauges, histograms)
|
|
366
|
+
- **Logs**: Discrete event records with context
|
|
367
|
+
- **Traces**: Request flow across services
|
|
368
|
+
|
|
369
|
+
### Alerting Principles
|
|
370
|
+
- Alert on symptoms, not causes
|
|
371
|
+
- Every alert should be actionable
|
|
372
|
+
- Reduce noise ruthlessly
|
|
373
|
+
- Page only for user-impacting issues
|
|
374
|
+
- Use severity levels appropriately
|
|
375
|
+
|
|
376
|
+
## Incident Response
|
|
377
|
+
|
|
378
|
+
### Incident Lifecycle
|
|
379
|
+
1. **Detection**: Automated alerts or user reports
|
|
380
|
+
2. **Triage**: Assess severity and impact
|
|
381
|
+
3. **Mitigation**: Stop the bleeding first
|
|
382
|
+
4. **Resolution**: Fix the underlying issue
|
|
383
|
+
5. **Post-mortem**: Learn and improve
|
|
384
|
+
|
|
385
|
+
### During an Incident
|
|
386
|
+
- Communicate early and often
|
|
387
|
+
- Focus on mitigation before root cause
|
|
388
|
+
- Document actions in real-time
|
|
389
|
+
- Escalate when needed
|
|
390
|
+
- Update stakeholders regularly
|
|
391
|
+
|
|
392
|
+
## Post-Mortem Process
|
|
393
|
+
|
|
394
|
+
### Blameless Culture
|
|
395
|
+
- Focus on systems, not individuals
|
|
396
|
+
- Assume good intentions
|
|
397
|
+
- Ask "how did the system allow this?"
|
|
398
|
+
- Share findings openly
|
|
399
|
+
|
|
400
|
+
### Post-Mortem Template
|
|
401
|
+
1. Incident summary
|
|
402
|
+
2. Timeline of events
|
|
403
|
+
3. Root cause analysis
|
|
404
|
+
4. What went well
|
|
405
|
+
5. What could be improved
|
|
406
|
+
6. Action items with owners
|
|
407
|
+
|
|
408
|
+
## Resilience Patterns
|
|
409
|
+
|
|
410
|
+
- **Timeouts**: Don't wait forever
|
|
411
|
+
- **Retries**: With exponential backoff
|
|
412
|
+
- **Circuit breakers**: Fail fast when downstream is unhealthy
|
|
413
|
+
- **Bulkheads**: Isolate failures
|
|
414
|
+
- **Graceful degradation**: Partial functionality over total failure
|