@forwardimpact/map 0.12.0 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +1 -1
- package/bin/fit-map.js +12 -12
- package/package.json +9 -6
- package/schema/json/discipline.schema.json +2 -6
- package/schema/rdf/discipline.ttl +6 -19
- package/src/index-generator.js +67 -38
- package/src/index.js +10 -25
- package/src/loader.js +407 -562
- package/src/schema-validation.js +327 -307
- package/examples/behaviours/_index.yaml +0 -8
- package/examples/behaviours/outcome_ownership.yaml +0 -43
- package/examples/behaviours/polymathic_knowledge.yaml +0 -41
- package/examples/behaviours/precise_communication.yaml +0 -39
- package/examples/behaviours/relentless_curiosity.yaml +0 -37
- package/examples/behaviours/systems_thinking.yaml +0 -40
- package/examples/capabilities/_index.yaml +0 -8
- package/examples/capabilities/business.yaml +0 -205
- package/examples/capabilities/delivery.yaml +0 -1001
- package/examples/capabilities/people.yaml +0 -68
- package/examples/capabilities/reliability.yaml +0 -349
- package/examples/capabilities/scale.yaml +0 -1672
- package/examples/copilot-setup-steps.yaml +0 -25
- package/examples/devcontainer.yaml +0 -21
- package/examples/disciplines/_index.yaml +0 -6
- package/examples/disciplines/data_engineering.yaml +0 -68
- package/examples/disciplines/engineering_management.yaml +0 -61
- package/examples/disciplines/software_engineering.yaml +0 -68
- package/examples/drivers.yaml +0 -202
- package/examples/framework.yaml +0 -73
- package/examples/levels.yaml +0 -115
- package/examples/questions/behaviours/outcome_ownership.yaml +0 -228
- package/examples/questions/behaviours/polymathic_knowledge.yaml +0 -275
- package/examples/questions/behaviours/precise_communication.yaml +0 -248
- package/examples/questions/behaviours/relentless_curiosity.yaml +0 -248
- package/examples/questions/behaviours/systems_thinking.yaml +0 -238
- package/examples/questions/capabilities/business.yaml +0 -107
- package/examples/questions/capabilities/delivery.yaml +0 -101
- package/examples/questions/capabilities/people.yaml +0 -106
- package/examples/questions/capabilities/reliability.yaml +0 -105
- package/examples/questions/capabilities/scale.yaml +0 -104
- package/examples/questions/skills/architecture_design.yaml +0 -115
- package/examples/questions/skills/cloud_platforms.yaml +0 -105
- package/examples/questions/skills/code_quality.yaml +0 -162
- package/examples/questions/skills/data_modeling.yaml +0 -107
- package/examples/questions/skills/devops.yaml +0 -111
- package/examples/questions/skills/full_stack_development.yaml +0 -118
- package/examples/questions/skills/sre_practices.yaml +0 -113
- package/examples/questions/skills/stakeholder_management.yaml +0 -116
- package/examples/questions/skills/team_collaboration.yaml +0 -106
- package/examples/questions/skills/technical_writing.yaml +0 -110
- package/examples/self-assessments.yaml +0 -64
- package/examples/stages.yaml +0 -191
- package/examples/tracks/_index.yaml +0 -5
- package/examples/tracks/platform.yaml +0 -47
- package/examples/tracks/sre.yaml +0 -46
- package/examples/vscode-settings.yaml +0 -21
|
@@ -1,68 +0,0 @@
|
|
|
1
|
-
# yaml-language-server: $schema=https://www.forwardimpact.team/schema/json/capability.schema.json
|
|
2
|
-
|
|
3
|
-
name: People
|
|
4
|
-
emojiIcon: 👥
|
|
5
|
-
ordinalRank: 6
|
|
6
|
-
description: |
|
|
7
|
-
Growing individuals and building effective teams.
|
|
8
|
-
Includes mentoring, coaching, hiring, performance management,
|
|
9
|
-
and creating inclusive environments.
|
|
10
|
-
professionalResponsibilities:
|
|
11
|
-
awareness:
|
|
12
|
-
You contribute positively to team dynamics, are open to feedback, and learn
|
|
13
|
-
actively from colleagues
|
|
14
|
-
foundational:
|
|
15
|
-
You support teammates through pair programming, knowledge sharing, and
|
|
16
|
-
constructive code reviews
|
|
17
|
-
working:
|
|
18
|
-
You mentor junior engineers on technical topics, contribute to hiring
|
|
19
|
-
through interviews, and actively build team knowledge
|
|
20
|
-
practitioner:
|
|
21
|
-
You coach multiple engineers on career growth, lead hiring for technical
|
|
22
|
-
roles across your area, and shape team technical culture
|
|
23
|
-
expert:
|
|
24
|
-
You develop technical leaders, shape engineering talent strategy across the
|
|
25
|
-
business unit, and build high-performing engineering teams
|
|
26
|
-
managementResponsibilities:
|
|
27
|
-
awareness:
|
|
28
|
-
You build positive relationships with team members and seek feedback on your
|
|
29
|
-
leadership
|
|
30
|
-
foundational:
|
|
31
|
-
You conduct effective 1:1s, provide regular feedback, support individual
|
|
32
|
-
development, and recognize contributions
|
|
33
|
-
working:
|
|
34
|
-
You manage team performance, own hiring decisions, create inclusive team
|
|
35
|
-
environments, and handle difficult conversations
|
|
36
|
-
practitioner:
|
|
37
|
-
You develop and retain talent across teams, build leadership pipelines for
|
|
38
|
-
your area, make promotion decisions, and shape team culture
|
|
39
|
-
expert:
|
|
40
|
-
You develop senior leaders, shape talent strategy across the business unit,
|
|
41
|
-
build high-performing teams, and own succession planning
|
|
42
|
-
skills:
|
|
43
|
-
- id: team_collaboration
|
|
44
|
-
name: Team Collaboration
|
|
45
|
-
isHumanOnly: true
|
|
46
|
-
human:
|
|
47
|
-
description: Working effectively with others to achieve shared goals
|
|
48
|
-
proficiencyDescriptions:
|
|
49
|
-
awareness:
|
|
50
|
-
You participate constructively in team activities, communicate
|
|
51
|
-
clearly, and ask for help when stuck. You are reliable and follow
|
|
52
|
-
through on commitments.
|
|
53
|
-
foundational:
|
|
54
|
-
You collaborate effectively on shared work, support teammates
|
|
55
|
-
proactively, share knowledge freely, and give and receive feedback
|
|
56
|
-
constructively.
|
|
57
|
-
working:
|
|
58
|
-
You facilitate collaboration across the team, resolve minor conflicts
|
|
59
|
-
before they escalate, enable others to succeed, and contribute
|
|
60
|
-
positively to team dynamics and morale.
|
|
61
|
-
practitioner:
|
|
62
|
-
You build high-performing teams across your area, navigate complex
|
|
63
|
-
interpersonal dynamics, foster psychological safety, and create
|
|
64
|
-
environments where diverse perspectives are valued and heard.
|
|
65
|
-
expert:
|
|
66
|
-
You create collaborative culture across the business unit. You
|
|
67
|
-
transform dysfunctional team dynamics, are recognized for building
|
|
68
|
-
exceptional teams, and mentor others on collaboration excellence.
|
|
@@ -1,349 +0,0 @@
|
|
|
1
|
-
# yaml-language-server: $schema=https://www.forwardimpact.team/schema/json/capability.schema.json
|
|
2
|
-
|
|
3
|
-
id: reliability
|
|
4
|
-
name: Reliability
|
|
5
|
-
emojiIcon: 🛡️
|
|
6
|
-
ordinalRank: 8
|
|
7
|
-
description: |
|
|
8
|
-
Ensuring systems are dependable, secure, and observable.
|
|
9
|
-
Includes DevOps practices, security, monitoring, incident response,
|
|
10
|
-
and infrastructure management.
|
|
11
|
-
professionalResponsibilities:
|
|
12
|
-
awareness:
|
|
13
|
-
You follow security and operational guidelines, escalate issues
|
|
14
|
-
appropriately, and participate in on-call rotations with guidance
|
|
15
|
-
foundational:
|
|
16
|
-
You implement reliability practices in your code, create basic monitoring,
|
|
17
|
-
and contribute effectively to incident response
|
|
18
|
-
working:
|
|
19
|
-
You design for reliability, implement comprehensive monitoring and alerting,
|
|
20
|
-
lead incident response, and drive post-incident improvements
|
|
21
|
-
practitioner:
|
|
22
|
-
You establish SLOs/SLIs across teams, build resilient systems, lead
|
|
23
|
-
reliability initiatives for your area, mentor engineers on reliability
|
|
24
|
-
practices, and drive reliability culture
|
|
25
|
-
expert:
|
|
26
|
-
You shape reliability strategy across the business unit, lead critical
|
|
27
|
-
incident management, pioneer new reliability practices, and are the
|
|
28
|
-
authority on system resilience
|
|
29
|
-
managementResponsibilities:
|
|
30
|
-
awareness:
|
|
31
|
-
You understand reliability requirements and support incident escalation
|
|
32
|
-
processes
|
|
33
|
-
foundational:
|
|
34
|
-
You ensure your team follows reliability practices, manage on-call
|
|
35
|
-
schedules, and facilitate incident retrospectives
|
|
36
|
-
working:
|
|
37
|
-
You own team reliability outcomes, manage incident response rotations, staff
|
|
38
|
-
reliability initiatives, and champion operational excellence
|
|
39
|
-
practitioner:
|
|
40
|
-
You drive reliability culture across teams, establish SLOs and incident
|
|
41
|
-
management processes for your area, and own cross-team reliability outcomes
|
|
42
|
-
expert:
|
|
43
|
-
You shape reliability strategy across the business unit, lead critical
|
|
44
|
-
incident management at executive level, and own enterprise reliability
|
|
45
|
-
outcomes
|
|
46
|
-
skills:
|
|
47
|
-
- id: service_management
|
|
48
|
-
name: Service Management
|
|
49
|
-
isHumanOnly: true
|
|
50
|
-
human:
|
|
51
|
-
description:
|
|
52
|
-
Managing services throughout their lifecycle from design to retirement,
|
|
53
|
-
focusing on value delivery to users
|
|
54
|
-
proficiencyDescriptions:
|
|
55
|
-
awareness:
|
|
56
|
-
You understand service lifecycle concepts (design, deploy, operate,
|
|
57
|
-
retire) and follow service management processes established by others.
|
|
58
|
-
foundational:
|
|
59
|
-
You document services you own, participate in service reviews, handle
|
|
60
|
-
basic service requests, and understand SLAs for your services.
|
|
61
|
-
working:
|
|
62
|
-
You design service offerings with clear value propositions, manage
|
|
63
|
-
service level agreements, improve service delivery based on user
|
|
64
|
-
feedback, and communicate service status proactively.
|
|
65
|
-
practitioner:
|
|
66
|
-
You lead service management practices for multiple services across
|
|
67
|
-
teams, optimize service portfolios for your area, balance service
|
|
68
|
-
investments, and train engineers on service-oriented thinking.
|
|
69
|
-
expert:
|
|
70
|
-
You shape service management strategy across the business unit. You
|
|
71
|
-
drive service excellence culture, innovate on service delivery
|
|
72
|
-
approaches, and are recognized as a service management authority.
|
|
73
|
-
- id: sre_practices
|
|
74
|
-
name: Site Reliability Engineering
|
|
75
|
-
human:
|
|
76
|
-
description:
|
|
77
|
-
Ensuring system reliability through observability, incident response,
|
|
78
|
-
and capacity planning
|
|
79
|
-
proficiencyDescriptions:
|
|
80
|
-
awareness:
|
|
81
|
-
You understand SLIs, SLOs, and error budgets conceptually. You can use
|
|
82
|
-
monitoring dashboards and escalate issues appropriately.
|
|
83
|
-
foundational:
|
|
84
|
-
You create basic alerts and dashboards. You participate in on-call
|
|
85
|
-
rotations and contribute to incident response under guidance.
|
|
86
|
-
working:
|
|
87
|
-
You design observability strategies for your services, lead incident
|
|
88
|
-
response, implement resilience testing, and conduct blameless
|
|
89
|
-
post-mortems. You balance reliability investment with feature
|
|
90
|
-
velocity.
|
|
91
|
-
practitioner:
|
|
92
|
-
You define reliability standards across teams in your area, drive
|
|
93
|
-
post-incident improvements systematically, design capacity planning
|
|
94
|
-
processes, and mentor engineers on SRE practices.
|
|
95
|
-
expert:
|
|
96
|
-
You shape reliability culture and standards across the business unit.
|
|
97
|
-
You pioneer new reliability practices, solve large-scale reliability
|
|
98
|
-
challenges, and are recognized as an authority on system resilience.
|
|
99
|
-
agent:
|
|
100
|
-
name: sre-practices
|
|
101
|
-
description: |
|
|
102
|
-
Guide for ensuring system reliability through observability, incident
|
|
103
|
-
response, and capacity planning.
|
|
104
|
-
useWhen: |
|
|
105
|
-
Designing monitoring, handling incidents, setting SLOs, or improving
|
|
106
|
-
system resilience.
|
|
107
|
-
stages:
|
|
108
|
-
specify:
|
|
109
|
-
focus: |
|
|
110
|
-
Define reliability requirements and SLO targets.
|
|
111
|
-
Identify critical user journeys that need protection.
|
|
112
|
-
readChecklist:
|
|
113
|
-
- Identify critical user journeys and business impact
|
|
114
|
-
- Document reliability requirements (availability, latency)
|
|
115
|
-
- Define SLO targets with stakeholder agreement
|
|
116
|
-
- Specify acceptable error budgets
|
|
117
|
-
- Mark ambiguities with [NEEDS CLARIFICATION]
|
|
118
|
-
confirmChecklist:
|
|
119
|
-
- Critical user journeys are identified
|
|
120
|
-
- Reliability requirements are documented
|
|
121
|
-
- SLO targets are defined
|
|
122
|
-
- Error budgets are agreed
|
|
123
|
-
plan:
|
|
124
|
-
focus: |
|
|
125
|
-
Define reliability requirements, SLIs/SLOs, and observability
|
|
126
|
-
strategy. Plan for resilience and capacity needs.
|
|
127
|
-
readChecklist:
|
|
128
|
-
- Define SLIs for key user journeys
|
|
129
|
-
- Set SLOs with stakeholder agreement
|
|
130
|
-
- Plan observability strategy (metrics, logs, traces)
|
|
131
|
-
- Identify failure modes and resilience patterns
|
|
132
|
-
- Define alerting thresholds
|
|
133
|
-
confirmChecklist:
|
|
134
|
-
- SLIs defined for key user journeys
|
|
135
|
-
- SLOs set with stakeholder agreement
|
|
136
|
-
- Monitoring strategy is planned
|
|
137
|
-
- Failure modes are identified
|
|
138
|
-
- Alerting thresholds are defined
|
|
139
|
-
onboard:
|
|
140
|
-
focus: |
|
|
141
|
-
Set up the observability and reliability tooling. Install
|
|
142
|
-
tracing, logging, and monitoring libraries, and configure
|
|
143
|
-
the observability backend.
|
|
144
|
-
readChecklist:
|
|
145
|
-
- Install observability tools (OpenTelemetry, Pino/structlog)
|
|
146
|
-
- Configure OTLP exporter endpoint and credentials
|
|
147
|
-
- Set up structured logging configuration
|
|
148
|
-
- Verify traces and logs reach the observability backend
|
|
149
|
-
- Create runbook template directory
|
|
150
|
-
confirmChecklist:
|
|
151
|
-
- OpenTelemetry SDK installed and configured
|
|
152
|
-
- OTLP exporter sends data to backend
|
|
153
|
-
- Structured logging produces valid JSON
|
|
154
|
-
- Test trace and log entries appear in backend
|
|
155
|
-
- Runbook directory structure created
|
|
156
|
-
code:
|
|
157
|
-
focus: |
|
|
158
|
-
Implement observability, resilience patterns, and operational
|
|
159
|
-
tooling. Build systems that fail gracefully and recover quickly.
|
|
160
|
-
readChecklist:
|
|
161
|
-
- Implement metrics, logging, and tracing
|
|
162
|
-
- Configure alerts based on SLOs
|
|
163
|
-
- Implement resilience patterns (timeouts, retries, circuit
|
|
164
|
-
breakers)
|
|
165
|
-
- Create runbooks for common issues
|
|
166
|
-
- Set up error budget tracking
|
|
167
|
-
confirmChecklist:
|
|
168
|
-
- Comprehensive monitoring is in place
|
|
169
|
-
- Alerts are actionable and low-noise
|
|
170
|
-
- Resilience patterns are implemented
|
|
171
|
-
- Runbooks exist for common issues
|
|
172
|
-
- Error budget tracking is in place
|
|
173
|
-
review:
|
|
174
|
-
focus: |
|
|
175
|
-
Verify reliability implementation meets SLOs and operational
|
|
176
|
-
readiness. Ensure incident response procedures are in place.
|
|
177
|
-
readChecklist:
|
|
178
|
-
- Validate SLOs are measurable
|
|
179
|
-
- Test failure scenarios
|
|
180
|
-
- Review runbook completeness
|
|
181
|
-
- Verify incident response procedures
|
|
182
|
-
- Check alert quality and coverage
|
|
183
|
-
confirmChecklist:
|
|
184
|
-
- SLOs are measurable and validated
|
|
185
|
-
- Failure scenarios are tested
|
|
186
|
-
- Incident response process documented
|
|
187
|
-
- Post-mortem culture established
|
|
188
|
-
- Disaster recovery approach is tested
|
|
189
|
-
deploy:
|
|
190
|
-
focus: |
|
|
191
|
-
Deploy reliability infrastructure and verify production
|
|
192
|
-
monitoring. Ensure on-call readiness.
|
|
193
|
-
readChecklist:
|
|
194
|
-
- Deploy monitoring and alerting to production
|
|
195
|
-
- Verify dashboards and alerts work correctly
|
|
196
|
-
- Confirm on-call rotation is ready
|
|
197
|
-
- Run production readiness review
|
|
198
|
-
confirmChecklist:
|
|
199
|
-
- Monitoring is live in production
|
|
200
|
-
- Alerts fire correctly for SLO breaches
|
|
201
|
-
- On-call team is trained and ready
|
|
202
|
-
- Production readiness review is complete
|
|
203
|
-
toolReferences:
|
|
204
|
-
- name: OpenTelemetry
|
|
205
|
-
url: https://opentelemetry.io/docs/
|
|
206
|
-
simpleIcon: opentelemetry
|
|
207
|
-
description:
|
|
208
|
-
Vendor-neutral observability framework for traces, metrics, and logs
|
|
209
|
-
useWhen:
|
|
210
|
-
Instrumenting applications for distributed tracing and observability
|
|
211
|
-
- name: Pino
|
|
212
|
-
url: https://getpino.io/
|
|
213
|
-
simpleIcon: nodedotjs
|
|
214
|
-
description: Fast, low-overhead structured logging for Node.js
|
|
215
|
-
useWhen: Adding structured logging to JavaScript applications
|
|
216
|
-
- name: structlog
|
|
217
|
-
url: https://www.structlog.org/
|
|
218
|
-
simpleIcon: python
|
|
219
|
-
description: Structured logging library for Python
|
|
220
|
-
useWhen: Adding structured logging to Python applications
|
|
221
|
-
instructions: |
|
|
222
|
-
## Step 1: Define SLIs and SLOs
|
|
223
|
-
|
|
224
|
-
Identify what matters to users. For each critical user
|
|
225
|
-
journey (page load, checkout, API), define an SLI (what to
|
|
226
|
-
measure) and SLO (target threshold). Calculate error budgets
|
|
227
|
-
from SLOs.
|
|
228
|
-
|
|
229
|
-
## Step 2: Instrument Application
|
|
230
|
-
|
|
231
|
-
Add distributed tracing with OpenTelemetry. Configure the
|
|
232
|
-
OTLP exporter to send traces to your observability backend.
|
|
233
|
-
Use auto-instrumentation for common libraries.
|
|
234
|
-
|
|
235
|
-
## Step 3: Add Structured Logging
|
|
236
|
-
|
|
237
|
-
Replace free-text logging with structured JSON logs. Use Pino
|
|
238
|
-
for Node.js or structlog for Python. Always include context
|
|
239
|
-
fields (userId, orderId, correlationId) for queryability.
|
|
240
|
-
|
|
241
|
-
## Step 4: Create Runbook Template
|
|
242
|
-
|
|
243
|
-
For each alert, document symptoms, diagnosis steps,
|
|
244
|
-
mitigation actions, and escalation criteria. Keep runbooks
|
|
245
|
-
co-located with the alerting configuration.
|
|
246
|
-
installScript: |
|
|
247
|
-
set -e
|
|
248
|
-
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
|
|
249
|
-
npm install @opentelemetry/exporter-trace-otlp-http pino
|
|
250
|
-
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp structlog
|
|
251
|
-
python -c "import opentelemetry; import structlog"
|
|
252
|
-
implementationReference: |
|
|
253
|
-
## SLI/SLO Table
|
|
254
|
-
|
|
255
|
-
| User Journey | SLI | SLO |
|
|
256
|
-
|--------------|-----|-----|
|
|
257
|
-
| Page load | Latency p99 | < 500ms for 99.9% of requests |
|
|
258
|
-
| Checkout | Success rate | > 99.95% of transactions succeed |
|
|
259
|
-
| API | Availability | 99.9% uptime (43 min/month budget) |
|
|
260
|
-
|
|
261
|
-
## OpenTelemetry — Node.js
|
|
262
|
-
|
|
263
|
-
```javascript
|
|
264
|
-
const { NodeSDK } = require('@opentelemetry/sdk-node')
|
|
265
|
-
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
|
|
266
|
-
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
|
|
267
|
-
|
|
268
|
-
const sdk = new NodeSDK({
|
|
269
|
-
traceExporter: new OTLPTraceExporter({
|
|
270
|
-
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces'
|
|
271
|
-
}),
|
|
272
|
-
instrumentations: [getNodeAutoInstrumentations()]
|
|
273
|
-
})
|
|
274
|
-
sdk.start()
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
## OpenTelemetry — Python
|
|
278
|
-
|
|
279
|
-
```python
|
|
280
|
-
from opentelemetry import trace
|
|
281
|
-
from opentelemetry.sdk.trace import TracerProvider
|
|
282
|
-
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
|
283
|
-
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
|
|
284
|
-
|
|
285
|
-
provider = TracerProvider()
|
|
286
|
-
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
|
|
287
|
-
trace.set_tracer_provider(provider)
|
|
288
|
-
```
|
|
289
|
-
|
|
290
|
-
## Structured Logging — Pino (Node.js)
|
|
291
|
-
|
|
292
|
-
```javascript
|
|
293
|
-
const pino = require('pino')
|
|
294
|
-
const logger = pino({ level: process.env.LOG_LEVEL || 'info' })
|
|
295
|
-
|
|
296
|
-
logger.info({ userId: 123, action: 'checkout' }, 'Processing checkout')
|
|
297
|
-
logger.error({ err, orderId }, 'Checkout failed')
|
|
298
|
-
```
|
|
299
|
-
|
|
300
|
-
## Structured Logging — structlog (Python)
|
|
301
|
-
|
|
302
|
-
```python
|
|
303
|
-
import structlog
|
|
304
|
-
|
|
305
|
-
structlog.configure(processors=[
|
|
306
|
-
structlog.processors.TimeStamper(fmt="iso"),
|
|
307
|
-
structlog.processors.JSONRenderer()
|
|
308
|
-
])
|
|
309
|
-
logger = structlog.get_logger()
|
|
310
|
-
logger.info("processing_checkout", user_id=123)
|
|
311
|
-
```
|
|
312
|
-
|
|
313
|
-
## Runbook Template
|
|
314
|
-
|
|
315
|
-
```markdown
|
|
316
|
-
# Runbook: High Error Rate
|
|
317
|
-
|
|
318
|
-
## Symptoms
|
|
319
|
-
- Error rate > 0.1% for 5+ minutes
|
|
320
|
-
|
|
321
|
-
## Diagnosis
|
|
322
|
-
1. Check application logs for error patterns
|
|
323
|
-
2. Check dependency health endpoints
|
|
324
|
-
3. Check recent deployments
|
|
325
|
-
|
|
326
|
-
## Mitigation
|
|
327
|
-
1. If recent deploy: Roll back
|
|
328
|
-
2. If dependency issue: Enable circuit breaker
|
|
329
|
-
3. If load spike: Scale up
|
|
330
|
-
|
|
331
|
-
## Escalation
|
|
332
|
-
If not resolved in 15 min, escalate to team lead.
|
|
333
|
-
```
|
|
334
|
-
|
|
335
|
-
## Verification
|
|
336
|
-
|
|
337
|
-
Your reliability setup is working when:
|
|
338
|
-
- Traces appear in your observability backend
|
|
339
|
-
- Structured logs contain consistent fields (correlation IDs, user context)
|
|
340
|
-
- Runbooks exist for known failure modes
|
|
341
|
-
- Team knows how to respond to common alerts
|
|
342
|
-
|
|
343
|
-
## Common Pitfalls
|
|
344
|
-
|
|
345
|
-
- **Missing environment variables**: OTLP exporters fail silently
|
|
346
|
-
- **No correlation IDs**: Cannot trace requests across services
|
|
347
|
-
- **Unstructured logs**: Free-text logs are hard to query
|
|
348
|
-
- **Alert fatigue**: Too many alerts drown out real issues
|
|
349
|
-
- **No runbooks**: Alerts fire but responders don't know what to do
|