@forwardimpact/schema 0.8.3 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/examples/capabilities/delivery.yaml +821 -172
- package/examples/capabilities/reliability.yaml +165 -285
- package/examples/capabilities/scale.yaml +1344 -103
- package/package.json +1 -1
- package/schema/json/capability.schema.json +12 -4
- package/schema/rdf/capability.ttl +34 -8
- package/src/loader.js +5 -0
- package/src/validation.js +44 -0
|
@@ -1,8 +1,9 @@
|
|
|
1
1
|
# yaml-language-server: $schema=https://www.forwardimpact.team/schema/json/capability.schema.json
|
|
2
2
|
|
|
3
|
+
id: reliability
|
|
3
4
|
name: Reliability
|
|
4
5
|
emojiIcon: 🛡️
|
|
5
|
-
ordinalRank:
|
|
6
|
+
ordinalRank: 8
|
|
6
7
|
description: |
|
|
7
8
|
Ensuring systems are dependable, secure, and observable.
|
|
8
9
|
Includes DevOps practices, security, monitoring, incident response,
|
|
@@ -42,225 +43,32 @@ managementResponsibilities:
|
|
|
42
43
|
Shape reliability strategy across the business unit, lead critical incident
|
|
43
44
|
management at executive level, and own enterprise reliability outcomes
|
|
44
45
|
skills:
|
|
45
|
-
- id:
|
|
46
|
-
name:
|
|
46
|
+
- id: service_management
|
|
47
|
+
name: Service Management
|
|
48
|
+
isHumanOnly: true
|
|
47
49
|
human:
|
|
48
50
|
description:
|
|
49
|
-
|
|
50
|
-
|
|
51
|
+
Managing services throughout their lifecycle from design to retirement,
|
|
52
|
+
focusing on value delivery to users
|
|
51
53
|
levelDescriptions:
|
|
52
54
|
awareness:
|
|
53
|
-
You understand
|
|
54
|
-
and
|
|
55
|
-
procedures.
|
|
55
|
+
You understand service lifecycle concepts (design, deploy, operate,
|
|
56
|
+
retire) and follow service management processes established by others.
|
|
56
57
|
foundational:
|
|
57
|
-
You
|
|
58
|
-
|
|
58
|
+
You document services you own, participate in service reviews, handle
|
|
59
|
+
basic service requests, and understand SLAs for your services.
|
|
59
60
|
working:
|
|
60
|
-
You
|
|
61
|
-
|
|
62
|
-
|
|
61
|
+
You design service offerings with clear value propositions, manage
|
|
62
|
+
service level agreements, improve service delivery based on user
|
|
63
|
+
feedback, and communicate service status proactively.
|
|
63
64
|
practitioner:
|
|
64
|
-
You
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
infrastructure.
|
|
65
|
+
You lead service management practices for multiple services across
|
|
66
|
+
teams, optimize service portfolios for your area, balance service
|
|
67
|
+
investments, and train engineers on service-oriented thinking.
|
|
68
68
|
expert:
|
|
69
|
-
You shape
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
agent:
|
|
73
|
-
name: devops-cicd
|
|
74
|
-
description: |
|
|
75
|
-
Guide for building CI/CD pipelines, managing infrastructure as code,
|
|
76
|
-
and implementing deployment best practices.
|
|
77
|
-
useWhen: |
|
|
78
|
-
Setting up pipelines, containerizing applications, or configuring
|
|
79
|
-
infrastructure.
|
|
80
|
-
stages:
|
|
81
|
-
specify:
|
|
82
|
-
focus: |
|
|
83
|
-
Define CI/CD and infrastructure requirements.
|
|
84
|
-
Clarify deployment strategy and operational needs.
|
|
85
|
-
readChecklist:
|
|
86
|
-
- Document deployment frequency requirements
|
|
87
|
-
- Identify rollback and recovery requirements
|
|
88
|
-
- Specify monitoring and alerting needs
|
|
89
|
-
- Define security and compliance constraints
|
|
90
|
-
- Mark ambiguities with [NEEDS CLARIFICATION]
|
|
91
|
-
confirmChecklist:
|
|
92
|
-
- Deployment requirements are documented
|
|
93
|
-
- Recovery requirements are specified
|
|
94
|
-
- Monitoring needs are identified
|
|
95
|
-
- Compliance constraints are clear
|
|
96
|
-
plan:
|
|
97
|
-
focus: |
|
|
98
|
-
Plan CI/CD pipeline architecture and infrastructure requirements.
|
|
99
|
-
Consider deployment strategies and monitoring needs.
|
|
100
|
-
readChecklist:
|
|
101
|
-
- Define pipeline stages (build, test, deploy)
|
|
102
|
-
- Identify infrastructure requirements
|
|
103
|
-
- Plan deployment strategy (rolling, blue-green, canary)
|
|
104
|
-
- Consider monitoring and alerting needs
|
|
105
|
-
- Plan secret management approach
|
|
106
|
-
confirmChecklist:
|
|
107
|
-
- Pipeline architecture is documented
|
|
108
|
-
- Deployment strategy is chosen and justified
|
|
109
|
-
- Infrastructure requirements are identified
|
|
110
|
-
- Monitoring approach is defined
|
|
111
|
-
onboard:
|
|
112
|
-
focus: |
|
|
113
|
-
Set up the CI/CD and infrastructure development environment.
|
|
114
|
-
Install pipeline tools, container runtime, and IaC tooling.
|
|
115
|
-
readChecklist:
|
|
116
|
-
- Install container runtime (Docker/Colima)
|
|
117
|
-
- Install infrastructure as code tools (Terraform)
|
|
118
|
-
- Configure CI/CD service credentials
|
|
119
|
-
- Initialize Terraform workspace and backend
|
|
120
|
-
- Set up secret management for pipeline credentials
|
|
121
|
-
confirmChecklist:
|
|
122
|
-
- Container runtime builds and runs images
|
|
123
|
-
- Terraform initialized with providers
|
|
124
|
-
- CI/CD credentials configured securely
|
|
125
|
-
- Pipeline configuration file created
|
|
126
|
-
- Secret management is in place
|
|
127
|
-
code:
|
|
128
|
-
focus: |
|
|
129
|
-
Implement CI/CD pipelines and infrastructure as code. Follow
|
|
130
|
-
best practices for containerization and deployment automation.
|
|
131
|
-
readChecklist:
|
|
132
|
-
- Configure CI/CD pipeline stages
|
|
133
|
-
- Implement infrastructure as code (Terraform, CloudFormation)
|
|
134
|
-
- Create Dockerfiles with security best practices
|
|
135
|
-
- Set up monitoring and alerting
|
|
136
|
-
- Configure secret management
|
|
137
|
-
- Implement deployment automation
|
|
138
|
-
confirmChecklist:
|
|
139
|
-
- Pipeline runs on every commit
|
|
140
|
-
- Tests run before deployment
|
|
141
|
-
- Deployments are automated
|
|
142
|
-
- Infrastructure is version controlled
|
|
143
|
-
- Secrets are managed securely
|
|
144
|
-
- Monitoring is in place
|
|
145
|
-
review:
|
|
146
|
-
focus: |
|
|
147
|
-
Verify pipeline reliability, security, and operational readiness.
|
|
148
|
-
Ensure rollback procedures work and documentation is complete.
|
|
149
|
-
readChecklist:
|
|
150
|
-
- Verify pipeline runs successfully end-to-end
|
|
151
|
-
- Test rollback procedures
|
|
152
|
-
- Review security configurations
|
|
153
|
-
- Validate monitoring and alerts
|
|
154
|
-
- Check documentation completeness
|
|
155
|
-
confirmChecklist:
|
|
156
|
-
- Pipeline is tested and reliable
|
|
157
|
-
- Rollback procedure is documented and tested
|
|
158
|
-
- Alerts are configured and tested
|
|
159
|
-
- Runbooks exist for common issues
|
|
160
|
-
deploy:
|
|
161
|
-
focus: |
|
|
162
|
-
Deploy pipeline and infrastructure changes to production.
|
|
163
|
-
Verify operational readiness.
|
|
164
|
-
readChecklist:
|
|
165
|
-
- Deploy pipeline configuration to production
|
|
166
|
-
- Verify deployment workflows work correctly
|
|
167
|
-
- Confirm monitoring and alerting are operational
|
|
168
|
-
- Run deployment through the new pipeline
|
|
169
|
-
confirmChecklist:
|
|
170
|
-
- Pipeline deployed and operational
|
|
171
|
-
- Workflows tested in production
|
|
172
|
-
- Monitoring confirms healthy operation
|
|
173
|
-
- First deployment through pipeline succeeded
|
|
174
|
-
toolReferences:
|
|
175
|
-
- name: Terraform
|
|
176
|
-
url: https://developer.hashicorp.com/terraform/docs
|
|
177
|
-
simpleIcon: terraform
|
|
178
|
-
description: Infrastructure as code tool
|
|
179
|
-
useWhen: Provisioning and managing cloud infrastructure as code
|
|
180
|
-
- name: Colima
|
|
181
|
-
url: https://github.com/abiosoft/colima
|
|
182
|
-
simpleIcon: docker
|
|
183
|
-
description: Container runtime for macOS with Docker-compatible CLI
|
|
184
|
-
useWhen:
|
|
185
|
-
Running containers locally, building images, or containerizing
|
|
186
|
-
applications
|
|
187
|
-
implementationReference: |
|
|
188
|
-
## CI/CD Pipeline Stages
|
|
189
|
-
|
|
190
|
-
### Build
|
|
191
|
-
- Install dependencies
|
|
192
|
-
- Compile/transpile code
|
|
193
|
-
- Generate artifacts
|
|
194
|
-
- Cache dependencies for speed
|
|
195
|
-
|
|
196
|
-
### Test
|
|
197
|
-
- Run unit tests
|
|
198
|
-
- Run integration tests
|
|
199
|
-
- Static analysis and linting
|
|
200
|
-
- Security scanning
|
|
201
|
-
|
|
202
|
-
### Deploy
|
|
203
|
-
- Deploy to staging environment
|
|
204
|
-
- Run smoke tests
|
|
205
|
-
- Deploy to production
|
|
206
|
-
- Verify deployment health
|
|
207
|
-
|
|
208
|
-
## Infrastructure as Code
|
|
209
|
-
|
|
210
|
-
### Terraform Workflow
|
|
211
|
-
```bash
|
|
212
|
-
terraform init # Initialize providers and backend
|
|
213
|
-
terraform plan # Preview changes before applying
|
|
214
|
-
terraform apply # Apply changes to infrastructure
|
|
215
|
-
terraform destroy # Tear down infrastructure
|
|
216
|
-
```
|
|
217
|
-
|
|
218
|
-
### Terraform in CI/CD Pipelines
|
|
219
|
-
1. **Plan on PR**: Run `terraform plan` on pull requests
|
|
220
|
-
2. **Review output**: Require approval for destructive changes
|
|
221
|
-
3. **Apply on merge**: Run `terraform apply` after merge to main
|
|
222
|
-
4. **State locking**: Use remote backend with locking (S3 + DynamoDB)
|
|
223
|
-
|
|
224
|
-
### Terraform Structure
|
|
225
|
-
```hcl
|
|
226
|
-
# main.tf - Define resources declaratively
|
|
227
|
-
resource "aws_instance" "app" {
|
|
228
|
-
ami = var.ami_id
|
|
229
|
-
instance_type = var.instance_type
|
|
230
|
-
tags = { Environment = var.environment }
|
|
231
|
-
}
|
|
232
|
-
|
|
233
|
-
# variables.tf - Parameterize for environments
|
|
234
|
-
variable "environment" { type = string }
|
|
235
|
-
variable "instance_type" { default = "t3.micro" }
|
|
236
|
-
```
|
|
237
|
-
|
|
238
|
-
### Docker
|
|
239
|
-
```dockerfile
|
|
240
|
-
FROM node:18-alpine
|
|
241
|
-
WORKDIR /app
|
|
242
|
-
COPY package*.json ./
|
|
243
|
-
RUN npm ci --only=production
|
|
244
|
-
COPY . .
|
|
245
|
-
CMD ["node", "server.js"]
|
|
246
|
-
```
|
|
247
|
-
|
|
248
|
-
## Deployment Strategies
|
|
249
|
-
|
|
250
|
-
### Rolling Deployment
|
|
251
|
-
- Gradual replacement of instances
|
|
252
|
-
- Zero downtime
|
|
253
|
-
- Easy rollback
|
|
254
|
-
|
|
255
|
-
### Blue-Green Deployment
|
|
256
|
-
- Two identical environments
|
|
257
|
-
- Switch traffic atomically
|
|
258
|
-
- Fast rollback
|
|
259
|
-
|
|
260
|
-
### Canary Deployment
|
|
261
|
-
- Route small percentage to new version
|
|
262
|
-
- Monitor for issues
|
|
263
|
-
- Gradually increase traffic
|
|
69
|
+
You shape service management strategy across the business unit. You
|
|
70
|
+
drive service excellence culture, innovate on service delivery
|
|
71
|
+
approaches, and are recognized as a service management authority.
|
|
264
72
|
- id: sre_practices
|
|
265
73
|
name: Site Reliability Engineering
|
|
266
74
|
human:
|
|
@@ -329,21 +137,21 @@ skills:
|
|
|
329
137
|
- Alerting thresholds are defined
|
|
330
138
|
onboard:
|
|
331
139
|
focus: |
|
|
332
|
-
Set up the observability and
|
|
333
|
-
|
|
334
|
-
|
|
140
|
+
Set up the observability and reliability tooling. Install
|
|
141
|
+
tracing, logging, and monitoring libraries, and configure
|
|
142
|
+
the observability backend.
|
|
335
143
|
readChecklist:
|
|
336
|
-
- Install
|
|
337
|
-
- Configure
|
|
338
|
-
- Set up logging
|
|
339
|
-
-
|
|
340
|
-
-
|
|
144
|
+
- Install observability tools (OpenTelemetry, Pino/structlog)
|
|
145
|
+
- Configure OTLP exporter endpoint and credentials
|
|
146
|
+
- Set up structured logging configuration
|
|
147
|
+
- Verify traces and logs reach the observability backend
|
|
148
|
+
- Create runbook template directory
|
|
341
149
|
confirmChecklist:
|
|
342
|
-
-
|
|
343
|
-
-
|
|
344
|
-
-
|
|
345
|
-
-
|
|
346
|
-
-
|
|
150
|
+
- OpenTelemetry SDK installed and configured
|
|
151
|
+
- OTLP exporter sends data to backend
|
|
152
|
+
- Structured logging produces valid JSON
|
|
153
|
+
- Test trace and log entries appear in backend
|
|
154
|
+
- Runbook directory structure created
|
|
347
155
|
code:
|
|
348
156
|
focus: |
|
|
349
157
|
Implement observability, resilience patterns, and operational
|
|
@@ -391,78 +199,150 @@ skills:
|
|
|
391
199
|
- Alerts fire correctly for SLO breaches
|
|
392
200
|
- On-call team is trained and ready
|
|
393
201
|
- Production readiness review is complete
|
|
202
|
+
toolReferences:
|
|
203
|
+
- name: OpenTelemetry
|
|
204
|
+
url: https://opentelemetry.io/docs/
|
|
205
|
+
simpleIcon: opentelemetry
|
|
206
|
+
description:
|
|
207
|
+
Vendor-neutral observability framework for traces, metrics, and logs
|
|
208
|
+
useWhen:
|
|
209
|
+
Instrumenting applications for distributed tracing and observability
|
|
210
|
+
- name: Pino
|
|
211
|
+
url: https://getpino.io/
|
|
212
|
+
simpleIcon: nodedotjs
|
|
213
|
+
description: Fast, low-overhead structured logging for Node.js
|
|
214
|
+
useWhen: Adding structured logging to JavaScript applications
|
|
215
|
+
- name: structlog
|
|
216
|
+
url: https://www.structlog.org/
|
|
217
|
+
simpleIcon: python
|
|
218
|
+
description: Structured logging library for Python
|
|
219
|
+
useWhen: Adding structured logging to Python applications
|
|
220
|
+
instructions: |
|
|
221
|
+
## Step 1: Define SLIs and SLOs
|
|
222
|
+
|
|
223
|
+
Identify what matters to users. For each critical user
|
|
224
|
+
journey (page load, checkout, API), define an SLI (what to
|
|
225
|
+
measure) and SLO (target threshold). Calculate error budgets
|
|
226
|
+
from SLOs.
|
|
227
|
+
|
|
228
|
+
## Step 2: Instrument Application
|
|
229
|
+
|
|
230
|
+
Add distributed tracing with OpenTelemetry. Configure the
|
|
231
|
+
OTLP exporter to send traces to your observability backend.
|
|
232
|
+
Use auto-instrumentation for common libraries.
|
|
233
|
+
|
|
234
|
+
## Step 3: Add Structured Logging
|
|
235
|
+
|
|
236
|
+
Replace free-text logging with structured JSON logs. Use Pino
|
|
237
|
+
for Node.js or structlog for Python. Always include context
|
|
238
|
+
fields (userId, orderId, correlationId) for queryability.
|
|
239
|
+
|
|
240
|
+
## Step 4: Create Runbook Template
|
|
241
|
+
|
|
242
|
+
For each alert, document symptoms, diagnosis steps,
|
|
243
|
+
mitigation actions, and escalation criteria. Keep runbooks
|
|
244
|
+
co-located with the alerting configuration.
|
|
245
|
+
installScript: |
|
|
246
|
+
set -e
|
|
247
|
+
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
|
|
248
|
+
npm install @opentelemetry/exporter-trace-otlp-http pino
|
|
249
|
+
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp structlog
|
|
250
|
+
python -c "import opentelemetry; import structlog"
|
|
394
251
|
implementationReference: |
|
|
395
|
-
##
|
|
252
|
+
## SLI/SLO Table
|
|
253
|
+
|
|
254
|
+
| User Journey | SLI | SLO |
|
|
255
|
+
|--------------|-----|-----|
|
|
256
|
+
| Page load | Latency p99 | < 500ms for 99.9% of requests |
|
|
257
|
+
| Checkout | Success rate | > 99.95% of transactions succeed |
|
|
258
|
+
| API | Availability | 99.9% uptime (43 min/month budget) |
|
|
259
|
+
|
|
260
|
+
## OpenTelemetry — Node.js
|
|
261
|
+
|
|
262
|
+
```javascript
|
|
263
|
+
const { NodeSDK } = require('@opentelemetry/sdk-node')
|
|
264
|
+
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
|
|
265
|
+
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
|
|
266
|
+
|
|
267
|
+
const sdk = new NodeSDK({
|
|
268
|
+
traceExporter: new OTLPTraceExporter({
|
|
269
|
+
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces'
|
|
270
|
+
}),
|
|
271
|
+
instrumentations: [getNodeAutoInstrumentations()]
|
|
272
|
+
})
|
|
273
|
+
sdk.start()
|
|
274
|
+
```
|
|
396
275
|
|
|
397
|
-
|
|
398
|
-
Quantitative measure of service behavior:
|
|
399
|
-
- Request latency (p50, p95, p99)
|
|
400
|
-
- Error rate (% of failed requests)
|
|
401
|
-
- Availability (% of successful requests)
|
|
402
|
-
- Throughput (requests per second)
|
|
276
|
+
## OpenTelemetry — Python
|
|
403
277
|
|
|
404
|
-
|
|
405
|
-
|
|
406
|
-
|
|
407
|
-
|
|
408
|
-
|
|
278
|
+
```python
|
|
279
|
+
from opentelemetry import trace
|
|
280
|
+
from opentelemetry.sdk.trace import TracerProvider
|
|
281
|
+
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
|
282
|
+
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
|
|
409
283
|
|
|
410
|
-
|
|
411
|
-
|
|
412
|
-
|
|
413
|
-
|
|
414
|
-
|
|
284
|
+
provider = TracerProvider()
|
|
285
|
+
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
|
|
286
|
+
trace.set_tracer_provider(provider)
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
## Structured Logging — Pino (Node.js)
|
|
415
290
|
|
|
416
|
-
|
|
291
|
+
```javascript
|
|
292
|
+
const pino = require('pino')
|
|
293
|
+
const logger = pino({ level: process.env.LOG_LEVEL || 'info' })
|
|
417
294
|
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
|
|
421
|
-
- **Traces**: Request flow across services
|
|
295
|
+
logger.info({ userId: 123, action: 'checkout' }, 'Processing checkout')
|
|
296
|
+
logger.error({ err, orderId }, 'Checkout failed')
|
|
297
|
+
```
|
|
422
298
|
|
|
423
|
-
|
|
424
|
-
- Alert on symptoms, not causes
|
|
425
|
-
- Every alert should be actionable
|
|
426
|
-
- Reduce noise ruthlessly
|
|
427
|
-
- Page only for user-impacting issues
|
|
428
|
-
- Use severity levels appropriately
|
|
299
|
+
## Structured Logging — structlog (Python)
|
|
429
300
|
|
|
430
|
-
|
|
301
|
+
```python
|
|
302
|
+
import structlog
|
|
303
|
+
|
|
304
|
+
structlog.configure(processors=[
|
|
305
|
+
structlog.processors.TimeStamper(fmt="iso"),
|
|
306
|
+
structlog.processors.JSONRenderer()
|
|
307
|
+
])
|
|
308
|
+
logger = structlog.get_logger()
|
|
309
|
+
logger.info("processing_checkout", user_id=123)
|
|
310
|
+
```
|
|
431
311
|
|
|
432
|
-
|
|
433
|
-
1. **Detection**: Automated alerts or user reports
|
|
434
|
-
2. **Triage**: Assess severity and impact
|
|
435
|
-
3. **Mitigation**: Stop the bleeding first
|
|
436
|
-
4. **Resolution**: Fix the underlying issue
|
|
437
|
-
5. **Post-mortem**: Learn and improve
|
|
312
|
+
## Runbook Template
|
|
438
313
|
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
- Focus on mitigation before root cause
|
|
442
|
-
- Document actions in real-time
|
|
443
|
-
- Escalate when needed
|
|
444
|
-
- Update stakeholders regularly
|
|
314
|
+
```markdown
|
|
315
|
+
# Runbook: High Error Rate
|
|
445
316
|
|
|
446
|
-
##
|
|
317
|
+
## Symptoms
|
|
318
|
+
- Error rate > 0.1% for 5+ minutes
|
|
319
|
+
|
|
320
|
+
## Diagnosis
|
|
321
|
+
1. Check application logs for error patterns
|
|
322
|
+
2. Check dependency health endpoints
|
|
323
|
+
3. Check recent deployments
|
|
324
|
+
|
|
325
|
+
## Mitigation
|
|
326
|
+
1. If recent deploy: Roll back
|
|
327
|
+
2. If dependency issue: Enable circuit breaker
|
|
328
|
+
3. If load spike: Scale up
|
|
329
|
+
|
|
330
|
+
## Escalation
|
|
331
|
+
If not resolved in 15 min, escalate to team lead.
|
|
332
|
+
```
|
|
447
333
|
|
|
448
|
-
|
|
449
|
-
- Focus on systems, not individuals
|
|
450
|
-
- Assume good intentions
|
|
451
|
-
- Ask "how did the system allow this?"
|
|
452
|
-
- Share findings openly
|
|
334
|
+
## Verification
|
|
453
335
|
|
|
454
|
-
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
5. What could be improved
|
|
460
|
-
6. Action items with owners
|
|
336
|
+
Your reliability setup is working when:
|
|
337
|
+
- Traces appear in your observability backend
|
|
338
|
+
- Structured logs contain consistent fields (correlation IDs, user context)
|
|
339
|
+
- Runbooks exist for known failure modes
|
|
340
|
+
- Team knows how to respond to common alerts
|
|
461
341
|
|
|
462
|
-
##
|
|
342
|
+
## Common Pitfalls
|
|
463
343
|
|
|
464
|
-
- **
|
|
465
|
-
- **
|
|
466
|
-
- **
|
|
467
|
-
- **
|
|
468
|
-
- **
|
|
344
|
+
- **Missing environment variables**: OTLP exporters fail silently
|
|
345
|
+
- **No correlation IDs**: Cannot trace requests across services
|
|
346
|
+
- **Unstructured logs**: Free-text logs are hard to query
|
|
347
|
+
- **Alert fatigue**: Too many alerts drown out real issues
|
|
348
|
+
- **No runbooks**: Alerts fire but responders don't know what to do
|