agentic-team-templates 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +280 -0
- package/bin/cli.js +5 -0
- package/package.json +47 -0
- package/src/index.js +521 -0
- package/templates/_shared/code-quality.md +162 -0
- package/templates/_shared/communication.md +114 -0
- package/templates/_shared/core-principles.md +62 -0
- package/templates/_shared/git-workflow.md +165 -0
- package/templates/_shared/security-fundamentals.md +173 -0
- package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
- package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
- package/templates/blockchain/.cursorrules/overview.md +130 -0
- package/templates/blockchain/.cursorrules/security.md +318 -0
- package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
- package/templates/blockchain/.cursorrules/testing.md +415 -0
- package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
- package/templates/blockchain/CLAUDE.md +389 -0
- package/templates/cli-tools/.cursorrules/architecture.md +412 -0
- package/templates/cli-tools/.cursorrules/arguments.md +406 -0
- package/templates/cli-tools/.cursorrules/distribution.md +546 -0
- package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
- package/templates/cli-tools/.cursorrules/overview.md +136 -0
- package/templates/cli-tools/.cursorrules/testing.md +537 -0
- package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
- package/templates/cli-tools/CLAUDE.md +356 -0
- package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
- package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
- package/templates/data-engineering/.cursorrules/overview.md +85 -0
- package/templates/data-engineering/.cursorrules/performance.md +339 -0
- package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
- package/templates/data-engineering/.cursorrules/security.md +460 -0
- package/templates/data-engineering/.cursorrules/testing.md +452 -0
- package/templates/data-engineering/CLAUDE.md +974 -0
- package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
- package/templates/devops-sre/.cursorrules/change-management.md +584 -0
- package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
- package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
- package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
- package/templates/devops-sre/.cursorrules/observability.md +714 -0
- package/templates/devops-sre/.cursorrules/overview.md +230 -0
- package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
- package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
- package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
- package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
- package/templates/devops-sre/CLAUDE.md +1007 -0
- package/templates/documentation/.cursorrules/adr.md +277 -0
- package/templates/documentation/.cursorrules/api-documentation.md +411 -0
- package/templates/documentation/.cursorrules/code-comments.md +253 -0
- package/templates/documentation/.cursorrules/maintenance.md +260 -0
- package/templates/documentation/.cursorrules/overview.md +82 -0
- package/templates/documentation/.cursorrules/readme-standards.md +306 -0
- package/templates/documentation/CLAUDE.md +120 -0
- package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
- package/templates/fullstack/.cursorrules/architecture.md +298 -0
- package/templates/fullstack/.cursorrules/overview.md +109 -0
- package/templates/fullstack/.cursorrules/shared-types.md +348 -0
- package/templates/fullstack/.cursorrules/testing.md +386 -0
- package/templates/fullstack/CLAUDE.md +349 -0
- package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
- package/templates/ml-ai/.cursorrules/deployment.md +601 -0
- package/templates/ml-ai/.cursorrules/model-development.md +538 -0
- package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
- package/templates/ml-ai/.cursorrules/overview.md +131 -0
- package/templates/ml-ai/.cursorrules/security.md +637 -0
- package/templates/ml-ai/.cursorrules/testing.md +678 -0
- package/templates/ml-ai/CLAUDE.md +1136 -0
- package/templates/mobile/.cursorrules/navigation.md +246 -0
- package/templates/mobile/.cursorrules/offline-first.md +302 -0
- package/templates/mobile/.cursorrules/overview.md +71 -0
- package/templates/mobile/.cursorrules/performance.md +345 -0
- package/templates/mobile/.cursorrules/testing.md +339 -0
- package/templates/mobile/CLAUDE.md +233 -0
- package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
- package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
- package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
- package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
- package/templates/platform-engineering/.cursorrules/observability.md +747 -0
- package/templates/platform-engineering/.cursorrules/overview.md +215 -0
- package/templates/platform-engineering/.cursorrules/security.md +855 -0
- package/templates/platform-engineering/.cursorrules/testing.md +878 -0
- package/templates/platform-engineering/CLAUDE.md +850 -0
- package/templates/utility-agent/.cursorrules/action-control.md +284 -0
- package/templates/utility-agent/.cursorrules/context-management.md +186 -0
- package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
- package/templates/utility-agent/.cursorrules/overview.md +78 -0
- package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
- package/templates/utility-agent/CLAUDE.md +513 -0
- package/templates/web-backend/.cursorrules/api-design.md +255 -0
- package/templates/web-backend/.cursorrules/authentication.md +309 -0
- package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
- package/templates/web-backend/.cursorrules/error-handling.md +366 -0
- package/templates/web-backend/.cursorrules/overview.md +69 -0
- package/templates/web-backend/.cursorrules/security.md +358 -0
- package/templates/web-backend/.cursorrules/testing.md +395 -0
- package/templates/web-backend/CLAUDE.md +366 -0
- package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
- package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
- package/templates/web-frontend/.cursorrules/overview.md +72 -0
- package/templates/web-frontend/.cursorrules/performance.md +325 -0
- package/templates/web-frontend/.cursorrules/state-management.md +227 -0
- package/templates/web-frontend/.cursorrules/styling.md +271 -0
- package/templates/web-frontend/.cursorrules/testing.md +311 -0
- package/templates/web-frontend/CLAUDE.md +399 -0
|
@@ -0,0 +1,230 @@
|
|
|
1
|
+
# DevOps/SRE Overview
|
|
2
|
+
|
|
3
|
+
Staff-level guidelines for Site Reliability Engineering and operational excellence.
|
|
4
|
+
|
|
5
|
+
## Scope
|
|
6
|
+
|
|
7
|
+
This template applies to:
|
|
8
|
+
|
|
9
|
+
- Site Reliability Engineering (SRE) practices and culture
|
|
10
|
+
- Production operations and 24/7 system reliability
|
|
11
|
+
- Incident management, response, and postmortems
|
|
12
|
+
- Monitoring, alerting, and observability strategies
|
|
13
|
+
- SLO/SLI definition and error budget management
|
|
14
|
+
- Capacity planning and performance engineering
|
|
15
|
+
- Disaster recovery and business continuity
|
|
16
|
+
- Toil reduction and operational automation
|
|
17
|
+
- Change management and safe deployments
|
|
18
|
+
- Chaos engineering and resilience testing
|
|
19
|
+
|
|
20
|
+
## Core Principles
|
|
21
|
+
|
|
22
|
+
### 1. Reliability is a Feature
|
|
23
|
+
|
|
24
|
+
Users don't distinguish between "the app is slow" and "the app is broken." Reliability directly impacts user experience, trust, and business outcomes.
|
|
25
|
+
|
|
26
|
+
- Treat reliability work as product work
|
|
27
|
+
- Measure reliability from the user's perspective
|
|
28
|
+
- Invest in reliability proportional to business impact
|
|
29
|
+
- Make reliability visible to stakeholders
|
|
30
|
+
|
|
31
|
+
### 2. Error Budgets Over Perfection
|
|
32
|
+
|
|
33
|
+
100% reliability is the wrong target. Perfect reliability means zero innovation.
|
|
34
|
+
|
|
35
|
+
- Define explicit reliability targets (SLOs)
|
|
36
|
+
- Use error budgets to balance reliability and velocity
|
|
37
|
+
- When budget is healthy, take risks and move fast
|
|
38
|
+
- When budget is low, prioritize reliability work
|
|
39
|
+
- Error budget is a currency, not a constraint
|
|
40
|
+
|
|
41
|
+
### 3. Automate Toil Away
|
|
42
|
+
|
|
43
|
+
Toil is the kind of work tied to running a service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows.
|
|
44
|
+
|
|
45
|
+
- If you're doing it manually more than twice, automate it
|
|
46
|
+
- Measure and track toil as a metric
|
|
47
|
+
- Allocate engineering time specifically for toil reduction
|
|
48
|
+
- Target < 50% of SRE time spent on toil
|
|
49
|
+
|
|
50
|
+
### 4. Observability First
|
|
51
|
+
|
|
52
|
+
You can't fix what you can't measure. You can't improve what you can't observe.
|
|
53
|
+
|
|
54
|
+
- Instrument everything from day one
|
|
55
|
+
- Logs, metrics, and traces are not optional
|
|
56
|
+
- Design systems to be debuggable
|
|
57
|
+
- Correlate signals across the stack
|
|
58
|
+
|
|
59
|
+
### 5. Blameless Culture
|
|
60
|
+
|
|
61
|
+
Incidents are learning opportunities, not blame games. Human error is a symptom of system problems.
|
|
62
|
+
|
|
63
|
+
- Focus on systems, not individuals
|
|
64
|
+
- Ask "how did the system allow this?" not "who caused this?"
|
|
65
|
+
- Share postmortems widely
|
|
66
|
+
- Celebrate learning from failures
|
|
67
|
+
|
|
68
|
+
## Project Structure
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
sre/
|
|
72
|
+
├── monitoring/ # Observability configuration
|
|
73
|
+
│ ├── prometheus/ # Prometheus rules and alerts
|
|
74
|
+
│ │ ├── rules/
|
|
75
|
+
│ │ └── alerts/
|
|
76
|
+
│ ├── grafana/ # Dashboard definitions
|
|
77
|
+
│ │ └── dashboards/
|
|
78
|
+
│ ├── loki/ # Log aggregation config
|
|
79
|
+
│ └── alertmanager/ # Alert routing
|
|
80
|
+
│
|
|
81
|
+
├── runbooks/ # Operational runbooks
|
|
82
|
+
│ ├── services/ # Per-service runbooks
|
|
83
|
+
│ │ ├── api-server.md
|
|
84
|
+
│ │ └── database.md
|
|
85
|
+
│ ├── alerts/ # Per-alert runbooks
|
|
86
|
+
│ └── procedures/ # General procedures
|
|
87
|
+
│ ├── incident-response.md
|
|
88
|
+
│ └── on-call-handoff.md
|
|
89
|
+
│
|
|
90
|
+
├── slos/ # SLO definitions
|
|
91
|
+
│ ├── api.yaml
|
|
92
|
+
│ ├── frontend.yaml
|
|
93
|
+
│ └── error-budget-policy.yaml
|
|
94
|
+
│
|
|
95
|
+
├── incident-response/ # Incident management
|
|
96
|
+
│ ├── templates/
|
|
97
|
+
│ │ ├── incident-doc.md
|
|
98
|
+
│ │ └── postmortem.md
|
|
99
|
+
│ ├── severity-definitions.yaml
|
|
100
|
+
│ └── escalation-policy.yaml
|
|
101
|
+
│
|
|
102
|
+
├── chaos/ # Chaos engineering
|
|
103
|
+
│ ├── experiments/
|
|
104
|
+
│ └── game-days/
|
|
105
|
+
│
|
|
106
|
+
├── load-testing/ # Performance testing
|
|
107
|
+
│ ├── k6/
|
|
108
|
+
│ └── scenarios/
|
|
109
|
+
│
|
|
110
|
+
├── disaster-recovery/ # DR documentation
|
|
111
|
+
│ ├── runbooks/
|
|
112
|
+
│ ├── backup-policies/
|
|
113
|
+
│ └── failover-procedures/
|
|
114
|
+
│
|
|
115
|
+
└── docs/ # SRE documentation
|
|
116
|
+
├── on-call-guide.md
|
|
117
|
+
├── escalation-policy.md
|
|
118
|
+
└── service-catalog.md
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## Technology Stack
|
|
122
|
+
|
|
123
|
+
| Layer | Primary | Alternatives |
|
|
124
|
+
|-------|---------|--------------|
|
|
125
|
+
| Metrics Collection | Prometheus | Datadog, InfluxDB, Victoria Metrics |
|
|
126
|
+
| Metrics Visualization | Grafana | Datadog, Kibana, Chronograf |
|
|
127
|
+
| Log Aggregation | Loki | Elasticsearch, Splunk, Datadog Logs |
|
|
128
|
+
| Distributed Tracing | Jaeger, Tempo | Zipkin, X-Ray, Honeycomb |
|
|
129
|
+
| Alerting | Alertmanager | PagerDuty, Datadog, OpsGenie |
|
|
130
|
+
| Incident Management | PagerDuty, Incident.io | OpsGenie, Squadcast, Rootly |
|
|
131
|
+
| Status Pages | Statuspage.io | Instatus, Cachet, Better Uptime |
|
|
132
|
+
| On-Call Management | PagerDuty | OpsGenie, VictorOps, Squadcast |
|
|
133
|
+
| Chaos Engineering | Chaos Mesh, Litmus | Gremlin, AWS FIS, Pumba |
|
|
134
|
+
| Load Testing | k6 | Locust, Gatling, JMeter |
|
|
135
|
+
| Synthetic Monitoring | Grafana Synthetic | Datadog Synthetics, Pingdom |
|
|
136
|
+
|
|
137
|
+
## Staff Engineer Responsibilities
|
|
138
|
+
|
|
139
|
+
### Technical Leadership
|
|
140
|
+
|
|
141
|
+
- Define and evolve organization-wide reliability standards
|
|
142
|
+
- Establish SLO frameworks and error budget policies
|
|
143
|
+
- Design observability architectures that scale
|
|
144
|
+
- Make build vs. buy decisions for SRE tooling
|
|
145
|
+
- Drive cultural shift toward reliability ownership
|
|
146
|
+
|
|
147
|
+
### Cross-Team Enablement
|
|
148
|
+
|
|
149
|
+
- Create reusable monitoring and alerting patterns
|
|
150
|
+
- Build self-service observability for development teams
|
|
151
|
+
- Establish incident response procedures and training
|
|
152
|
+
- Design runbook templates and standards
|
|
153
|
+
- Lead chaos engineering and game day initiatives
|
|
154
|
+
|
|
155
|
+
### Operational Excellence
|
|
156
|
+
|
|
157
|
+
- Own and improve the incident management process
|
|
158
|
+
- Drive postmortem quality and action item follow-through
|
|
159
|
+
- Reduce mean time to detection (MTTD) and recovery (MTTR)
|
|
160
|
+
- Eliminate recurring incidents through systemic fixes
|
|
161
|
+
- Balance on-call health with operational coverage
|
|
162
|
+
|
|
163
|
+
### Strategic Thinking
|
|
164
|
+
|
|
165
|
+
- Align reliability investments with business priorities
|
|
166
|
+
- Plan capacity for multi-year growth projections
|
|
167
|
+
- Design disaster recovery strategies
|
|
168
|
+
- Evaluate emerging SRE tools and practices
|
|
169
|
+
- Manage technical debt in operational systems
|
|
170
|
+
|
|
171
|
+
## Key Metrics
|
|
172
|
+
|
|
173
|
+
### Reliability Metrics
|
|
174
|
+
|
|
175
|
+
- **Availability**: Percentage of successful requests
|
|
176
|
+
- **Latency**: Response time at various percentiles (p50, p95, p99)
|
|
177
|
+
- **Error Rate**: Percentage of failed requests
|
|
178
|
+
- **Throughput**: Requests processed per unit time
|
|
179
|
+
|
|
180
|
+
### Operational Metrics
|
|
181
|
+
|
|
182
|
+
- **MTTD**: Mean Time to Detect incidents
|
|
183
|
+
- **MTTR**: Mean Time to Resolve incidents
|
|
184
|
+
- **MTBF**: Mean Time Between Failures
|
|
185
|
+
- **Change Failure Rate**: Percentage of changes causing incidents
|
|
186
|
+
|
|
187
|
+
### On-Call Health Metrics
|
|
188
|
+
|
|
189
|
+
- **Pages per shift**: Target < 10
|
|
190
|
+
- **Pages per night**: Target < 2
|
|
191
|
+
- **False positive rate**: Target < 10%
|
|
192
|
+
- **Alert noise ratio**: Actionable vs total alerts
|
|
193
|
+
|
|
194
|
+
### Toil Metrics
|
|
195
|
+
|
|
196
|
+
- **Toil percentage**: Time spent on toil vs engineering work
|
|
197
|
+
- **Manual intervention rate**: Human touchpoints per deployment
|
|
198
|
+
- **Automation coverage**: Percentage of runbooks with automation
|
|
199
|
+
|
|
200
|
+
## Anti-Patterns to Avoid
|
|
201
|
+
|
|
202
|
+
### Alert on Everything
|
|
203
|
+
|
|
204
|
+
❌ **Wrong**: Create alerts for every possible metric "just in case"
|
|
205
|
+
|
|
206
|
+
✅ **Right**: Every alert must be actionable, urgent, and relevant. If the on-call doesn't need to do something immediately, it shouldn't page.
|
|
207
|
+
|
|
208
|
+
### SLO as Ceiling
|
|
209
|
+
|
|
210
|
+
❌ **Wrong**: Treat SLOs as minimum acceptable reliability
|
|
211
|
+
|
|
212
|
+
✅ **Right**: SLOs define the target; error budget is the tool for balancing reliability with velocity
|
|
213
|
+
|
|
214
|
+
### Postmortem Graveyard
|
|
215
|
+
|
|
216
|
+
❌ **Wrong**: Write postmortems that go into a folder and are never read
|
|
217
|
+
|
|
218
|
+
✅ **Right**: Track action items, measure recurring incidents, share learnings organization-wide
|
|
219
|
+
|
|
220
|
+
### Hero Culture
|
|
221
|
+
|
|
222
|
+
❌ **Wrong**: Rely on specific engineers who "know the system" to fix everything
|
|
223
|
+
|
|
224
|
+
✅ **Right**: Document everything, share knowledge, ensure any on-call can handle any incident
|
|
225
|
+
|
|
226
|
+
### Manual Everything
|
|
227
|
+
|
|
228
|
+
❌ **Wrong**: Keep critical procedures as tribal knowledge
|
|
229
|
+
|
|
230
|
+
✅ **Right**: Automate recovery, create runbooks, build self-healing systems
|