@forwardimpact/schema 0.8.3 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,8 +1,9 @@
1
1
  # yaml-language-server: $schema=https://www.forwardimpact.team/schema/json/capability.schema.json
2
2
 
3
+ id: reliability
3
4
  name: Reliability
4
5
  emojiIcon: 🛡️
5
- ordinalRank: 5
6
+ ordinalRank: 8
6
7
  description: |
7
8
  Ensuring systems are dependable, secure, and observable.
8
9
  Includes DevOps practices, security, monitoring, incident response,
@@ -42,225 +43,32 @@ managementResponsibilities:
42
43
  Shape reliability strategy across the business unit, lead critical incident
43
44
  management at executive level, and own enterprise reliability outcomes
44
45
  skills:
45
- - id: devops
46
- name: DevOps & CI/CD
46
+ - id: service_management
47
+ name: Service Management
48
+ isHumanOnly: true
47
49
  human:
48
50
  description:
49
- Building and maintaining deployment pipelines, infrastructure, and
50
- operational practices
51
+ Managing services throughout their lifecycle from design to retirement,
52
+ focusing on value delivery to users
51
53
  levelDescriptions:
52
54
  awareness:
53
- You understand CI/CD concepts (build, test, deploy) and can trigger
54
- and monitor pipelines others have built. You follow deployment
55
- procedures.
55
+ You understand service lifecycle concepts (design, deploy, operate,
56
+ retire) and follow service management processes established by others.
56
57
  foundational:
57
- You configure basic CI/CD pipelines, understand containerization
58
- (Docker), and can troubleshoot common build and deployment failures.
58
+ You document services you own, participate in service reviews, handle
59
+ basic service requests, and understand SLAs for your services.
59
60
  working:
60
- You build complete CI/CD pipelines end-to-end, manage infrastructure
61
- as code (Terraform, CloudFormation), implement monitoring, and design
62
- deployment strategies for your services.
61
+ You design service offerings with clear value propositions, manage
62
+ service level agreements, improve service delivery based on user
63
+ feedback, and communicate service status proactively.
63
64
  practitioner:
64
- You design deployment strategies for complex multi-service systems
65
- across teams, optimize pipeline performance and reliability, define
66
- DevOps practices for your area, and mentor engineers on
67
- infrastructure.
65
+ You lead service management practices for multiple services across
66
+ teams, optimize service portfolios for your area, balance service
67
+ investments, and train engineers on service-oriented thinking.
68
68
  expert:
69
- You shape DevOps culture and practices across the business unit. You
70
- introduce innovative approaches to deployment and infrastructure,
71
- solve large-scale DevOps challenges, and are recognized externally.
72
- agent:
73
- name: devops-cicd
74
- description: |
75
- Guide for building CI/CD pipelines, managing infrastructure as code,
76
- and implementing deployment best practices.
77
- useWhen: |
78
- Setting up pipelines, containerizing applications, or configuring
79
- infrastructure.
80
- stages:
81
- specify:
82
- focus: |
83
- Define CI/CD and infrastructure requirements.
84
- Clarify deployment strategy and operational needs.
85
- readChecklist:
86
- - Document deployment frequency requirements
87
- - Identify rollback and recovery requirements
88
- - Specify monitoring and alerting needs
89
- - Define security and compliance constraints
90
- - Mark ambiguities with [NEEDS CLARIFICATION]
91
- confirmChecklist:
92
- - Deployment requirements are documented
93
- - Recovery requirements are specified
94
- - Monitoring needs are identified
95
- - Compliance constraints are clear
96
- plan:
97
- focus: |
98
- Plan CI/CD pipeline architecture and infrastructure requirements.
99
- Consider deployment strategies and monitoring needs.
100
- readChecklist:
101
- - Define pipeline stages (build, test, deploy)
102
- - Identify infrastructure requirements
103
- - Plan deployment strategy (rolling, blue-green, canary)
104
- - Consider monitoring and alerting needs
105
- - Plan secret management approach
106
- confirmChecklist:
107
- - Pipeline architecture is documented
108
- - Deployment strategy is chosen and justified
109
- - Infrastructure requirements are identified
110
- - Monitoring approach is defined
111
- onboard:
112
- focus: |
113
- Set up the CI/CD and infrastructure development environment.
114
- Install pipeline tools, container runtime, and IaC tooling.
115
- readChecklist:
116
- - Install container runtime (Docker/Colima)
117
- - Install infrastructure as code tools (Terraform)
118
- - Configure CI/CD service credentials
119
- - Initialize Terraform workspace and backend
120
- - Set up secret management for pipeline credentials
121
- confirmChecklist:
122
- - Container runtime builds and runs images
123
- - Terraform initialized with providers
124
- - CI/CD credentials configured securely
125
- - Pipeline configuration file created
126
- - Secret management is in place
127
- code:
128
- focus: |
129
- Implement CI/CD pipelines and infrastructure as code. Follow
130
- best practices for containerization and deployment automation.
131
- readChecklist:
132
- - Configure CI/CD pipeline stages
133
- - Implement infrastructure as code (Terraform, CloudFormation)
134
- - Create Dockerfiles with security best practices
135
- - Set up monitoring and alerting
136
- - Configure secret management
137
- - Implement deployment automation
138
- confirmChecklist:
139
- - Pipeline runs on every commit
140
- - Tests run before deployment
141
- - Deployments are automated
142
- - Infrastructure is version controlled
143
- - Secrets are managed securely
144
- - Monitoring is in place
145
- review:
146
- focus: |
147
- Verify pipeline reliability, security, and operational readiness.
148
- Ensure rollback procedures work and documentation is complete.
149
- readChecklist:
150
- - Verify pipeline runs successfully end-to-end
151
- - Test rollback procedures
152
- - Review security configurations
153
- - Validate monitoring and alerts
154
- - Check documentation completeness
155
- confirmChecklist:
156
- - Pipeline is tested and reliable
157
- - Rollback procedure is documented and tested
158
- - Alerts are configured and tested
159
- - Runbooks exist for common issues
160
- deploy:
161
- focus: |
162
- Deploy pipeline and infrastructure changes to production.
163
- Verify operational readiness.
164
- readChecklist:
165
- - Deploy pipeline configuration to production
166
- - Verify deployment workflows work correctly
167
- - Confirm monitoring and alerting are operational
168
- - Run deployment through the new pipeline
169
- confirmChecklist:
170
- - Pipeline deployed and operational
171
- - Workflows tested in production
172
- - Monitoring confirms healthy operation
173
- - First deployment through pipeline succeeded
174
- toolReferences:
175
- - name: Terraform
176
- url: https://developer.hashicorp.com/terraform/docs
177
- simpleIcon: terraform
178
- description: Infrastructure as code tool
179
- useWhen: Provisioning and managing cloud infrastructure as code
180
- - name: Colima
181
- url: https://github.com/abiosoft/colima
182
- simpleIcon: docker
183
- description: Container runtime for macOS with Docker-compatible CLI
184
- useWhen:
185
- Running containers locally, building images, or containerizing
186
- applications
187
- implementationReference: |
188
- ## CI/CD Pipeline Stages
189
-
190
- ### Build
191
- - Install dependencies
192
- - Compile/transpile code
193
- - Generate artifacts
194
- - Cache dependencies for speed
195
-
196
- ### Test
197
- - Run unit tests
198
- - Run integration tests
199
- - Static analysis and linting
200
- - Security scanning
201
-
202
- ### Deploy
203
- - Deploy to staging environment
204
- - Run smoke tests
205
- - Deploy to production
206
- - Verify deployment health
207
-
208
- ## Infrastructure as Code
209
-
210
- ### Terraform Workflow
211
- ```bash
212
- terraform init # Initialize providers and backend
213
- terraform plan # Preview changes before applying
214
- terraform apply # Apply changes to infrastructure
215
- terraform destroy # Tear down infrastructure
216
- ```
217
-
218
- ### Terraform in CI/CD Pipelines
219
- 1. **Plan on PR**: Run `terraform plan` on pull requests
220
- 2. **Review output**: Require approval for destructive changes
221
- 3. **Apply on merge**: Run `terraform apply` after merge to main
222
- 4. **State locking**: Use remote backend with locking (S3 + DynamoDB)
223
-
224
- ### Terraform Structure
225
- ```hcl
226
- # main.tf - Define resources declaratively
227
- resource "aws_instance" "app" {
228
- ami = var.ami_id
229
- instance_type = var.instance_type
230
- tags = { Environment = var.environment }
231
- }
232
-
233
- # variables.tf - Parameterize for environments
234
- variable "environment" { type = string }
235
- variable "instance_type" { default = "t3.micro" }
236
- ```
237
-
238
- ### Docker
239
- ```dockerfile
240
- FROM node:18-alpine
241
- WORKDIR /app
242
- COPY package*.json ./
243
- RUN npm ci --only=production
244
- COPY . .
245
- CMD ["node", "server.js"]
246
- ```
247
-
248
- ## Deployment Strategies
249
-
250
- ### Rolling Deployment
251
- - Gradual replacement of instances
252
- - Zero downtime
253
- - Easy rollback
254
-
255
- ### Blue-Green Deployment
256
- - Two identical environments
257
- - Switch traffic atomically
258
- - Fast rollback
259
-
260
- ### Canary Deployment
261
- - Route small percentage to new version
262
- - Monitor for issues
263
- - Gradually increase traffic
69
+ You shape service management strategy across the business unit. You
70
+ drive service excellence culture, innovate on service delivery
71
+ approaches, and are recognized as a service management authority.
264
72
  - id: sre_practices
265
73
  name: Site Reliability Engineering
266
74
  human:
@@ -329,21 +137,21 @@ skills:
329
137
  - Alerting thresholds are defined
330
138
  onboard:
331
139
  focus: |
332
- Set up the observability and monitoring environment.
333
- Install monitoring tools, configure dashboards, and verify
334
- instrumentation libraries work.
140
+ Set up the observability and reliability tooling. Install
141
+ tracing, logging, and monitoring libraries, and configure
142
+ the observability backend.
335
143
  readChecklist:
336
- - Install monitoring/observability tools (Prometheus, Grafana)
337
- - Configure tracing library (OpenTelemetry)
338
- - Set up logging framework with structured output
339
- - Configure dashboard templates for SLI tracking
340
- - Verify instrumentation exports metrics correctly
144
+ - Install observability tools (OpenTelemetry, Pino/structlog)
145
+ - Configure OTLP exporter endpoint and credentials
146
+ - Set up structured logging configuration
147
+ - Verify traces and logs reach the observability backend
148
+ - Create runbook template directory
341
149
  confirmChecklist:
342
- - Monitoring tools installed and accessible
343
- - Tracing library integrated and exporting spans
344
- - Logging outputs structured data
345
- - Dashboard templates render correctly
346
- - Instrumentation verified with test traffic
150
+ - OpenTelemetry SDK installed and configured
151
+ - OTLP exporter sends data to backend
152
+ - Structured logging produces valid JSON
153
+ - Test trace and log entries appear in backend
154
+ - Runbook directory structure created
347
155
  code:
348
156
  focus: |
349
157
  Implement observability, resilience patterns, and operational
@@ -391,78 +199,150 @@ skills:
391
199
  - Alerts fire correctly for SLO breaches
392
200
  - On-call team is trained and ready
393
201
  - Production readiness review is complete
202
+ toolReferences:
203
+ - name: OpenTelemetry
204
+ url: https://opentelemetry.io/docs/
205
+ simpleIcon: opentelemetry
206
+ description:
207
+ Vendor-neutral observability framework for traces, metrics, and logs
208
+ useWhen:
209
+ Instrumenting applications for distributed tracing and observability
210
+ - name: Pino
211
+ url: https://getpino.io/
212
+ simpleIcon: nodedotjs
213
+ description: Fast, low-overhead structured logging for Node.js
214
+ useWhen: Adding structured logging to JavaScript applications
215
+ - name: structlog
216
+ url: https://www.structlog.org/
217
+ simpleIcon: python
218
+ description: Structured logging library for Python
219
+ useWhen: Adding structured logging to Python applications
220
+ instructions: |
221
+ ## Step 1: Define SLIs and SLOs
222
+
223
+ Identify what matters to users. For each critical user
224
+ journey (page load, checkout, API), define an SLI (what to
225
+ measure) and SLO (target threshold). Calculate error budgets
226
+ from SLOs.
227
+
228
+ ## Step 2: Instrument Application
229
+
230
+ Add distributed tracing with OpenTelemetry. Configure the
231
+ OTLP exporter to send traces to your observability backend.
232
+ Use auto-instrumentation for common libraries.
233
+
234
+ ## Step 3: Add Structured Logging
235
+
236
+ Replace free-text logging with structured JSON logs. Use Pino
237
+ for Node.js or structlog for Python. Always include context
238
+ fields (userId, orderId, correlationId) for queryability.
239
+
240
+ ## Step 4: Create Runbook Template
241
+
242
+ For each alert, document symptoms, diagnosis steps,
243
+ mitigation actions, and escalation criteria. Keep runbooks
244
+ co-located with the alerting configuration.
245
+ installScript: |
246
+ set -e
247
+ npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
248
+ npm install @opentelemetry/exporter-trace-otlp-http pino
249
+ pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp structlog
250
+ python -c "import opentelemetry; import structlog"
394
251
  implementationReference: |
395
- ## Service Level Concepts
252
+ ## SLI/SLO Table
253
+
254
+ | User Journey | SLI | SLO |
255
+ |--------------|-----|-----|
256
+ | Page load | Latency p99 | < 500ms for 99.9% of requests |
257
+ | Checkout | Success rate | > 99.95% of transactions succeed |
258
+ | API | Availability | 99.9% uptime (43 min/month budget) |
259
+
260
+ ## OpenTelemetry — Node.js
261
+
262
+ ```javascript
263
+ const { NodeSDK } = require('@opentelemetry/sdk-node')
264
+ const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')
265
+ const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')
266
+
267
+ const sdk = new NodeSDK({
268
+ traceExporter: new OTLPTraceExporter({
269
+ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces'
270
+ }),
271
+ instrumentations: [getNodeAutoInstrumentations()]
272
+ })
273
+ sdk.start()
274
+ ```
396
275
 
397
- ### SLI (Service Level Indicator)
398
- Quantitative measure of service behavior:
399
- - Request latency (p50, p95, p99)
400
- - Error rate (% of failed requests)
401
- - Availability (% of successful requests)
402
- - Throughput (requests per second)
276
+ ## OpenTelemetry Python
403
277
 
404
- ### SLO (Service Level Objective)
405
- Target value for an SLI:
406
- - "99.9% of requests complete in < 200ms"
407
- - "Error rate < 0.1% over 30 days"
408
- - "99.95% availability monthly"
278
+ ```python
279
+ from opentelemetry import trace
280
+ from opentelemetry.sdk.trace import TracerProvider
281
+ from opentelemetry.sdk.trace.export import BatchSpanProcessor
282
+ from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
409
283
 
410
- ### Error Budget
411
- Allowed unreliability: 100% - SLO
412
- - 99.9% SLO = 0.1% error budget
413
- - ~43 minutes downtime per month
414
- - Spend on features or reliability
284
+ provider = TracerProvider()
285
+ provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
286
+ trace.set_tracer_provider(provider)
287
+ ```
288
+
289
+ ## Structured Logging — Pino (Node.js)
415
290
 
416
- ## Observability
291
+ ```javascript
292
+ const pino = require('pino')
293
+ const logger = pino({ level: process.env.LOG_LEVEL || 'info' })
417
294
 
418
- ### Three Pillars
419
- - **Metrics**: Aggregated numeric data (counters, gauges, histograms)
420
- - **Logs**: Discrete event records with context
421
- - **Traces**: Request flow across services
295
+ logger.info({ userId: 123, action: 'checkout' }, 'Processing checkout')
296
+ logger.error({ err, orderId }, 'Checkout failed')
297
+ ```
422
298
 
423
- ### Alerting Principles
424
- - Alert on symptoms, not causes
425
- - Every alert should be actionable
426
- - Reduce noise ruthlessly
427
- - Page only for user-impacting issues
428
- - Use severity levels appropriately
299
+ ## Structured Logging — structlog (Python)
429
300
 
430
- ## Incident Response
301
+ ```python
302
+ import structlog
303
+
304
+ structlog.configure(processors=[
305
+ structlog.processors.TimeStamper(fmt="iso"),
306
+ structlog.processors.JSONRenderer()
307
+ ])
308
+ logger = structlog.get_logger()
309
+ logger.info("processing_checkout", user_id=123)
310
+ ```
431
311
 
432
- ### Incident Lifecycle
433
- 1. **Detection**: Automated alerts or user reports
434
- 2. **Triage**: Assess severity and impact
435
- 3. **Mitigation**: Stop the bleeding first
436
- 4. **Resolution**: Fix the underlying issue
437
- 5. **Post-mortem**: Learn and improve
312
+ ## Runbook Template
438
313
 
439
- ### During an Incident
440
- - Communicate early and often
441
- - Focus on mitigation before root cause
442
- - Document actions in real-time
443
- - Escalate when needed
444
- - Update stakeholders regularly
314
+ ```markdown
315
+ # Runbook: High Error Rate
445
316
 
446
- ## Post-Mortem Process
317
+ ## Symptoms
318
+ - Error rate > 0.1% for 5+ minutes
319
+
320
+ ## Diagnosis
321
+ 1. Check application logs for error patterns
322
+ 2. Check dependency health endpoints
323
+ 3. Check recent deployments
324
+
325
+ ## Mitigation
326
+ 1. If recent deploy: Roll back
327
+ 2. If dependency issue: Enable circuit breaker
328
+ 3. If load spike: Scale up
329
+
330
+ ## Escalation
331
+ If not resolved in 15 min, escalate to team lead.
332
+ ```
447
333
 
448
- ### Blameless Culture
449
- - Focus on systems, not individuals
450
- - Assume good intentions
451
- - Ask "how did the system allow this?"
452
- - Share findings openly
334
+ ## Verification
453
335
 
454
- ### Post-Mortem Template
455
- 1. Incident summary
456
- 2. Timeline of events
457
- 3. Root cause analysis
458
- 4. What went well
459
- 5. What could be improved
460
- 6. Action items with owners
336
+ Your reliability setup is working when:
337
+ - Traces appear in your observability backend
338
+ - Structured logs contain consistent fields (correlation IDs, user context)
339
+ - Runbooks exist for known failure modes
340
+ - Team knows how to respond to common alerts
461
341
 
462
- ## Resilience Patterns
342
+ ## Common Pitfalls
463
343
 
464
- - **Timeouts**: Don't wait forever
465
- - **Retries**: With exponential backoff
466
- - **Circuit breakers**: Fail fast when downstream is unhealthy
467
- - **Bulkheads**: Isolate failures
468
- - **Graceful degradation**: Partial functionality over total failure
344
+ - **Missing environment variables**: OTLP exporters fail silently
345
+ - **No correlation IDs**: Cannot trace requests across services
346
+ - **Unstructured logs**: Free-text logs are hard to query
347
+ - **Alert fatigue**: Too many alerts drown out real issues
348
+ - **No runbooks**: Alerts fire but responders don't know what to do