@patricio0312rev/skillset 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +29 -0
- package/LICENSE +21 -0
- package/README.md +176 -0
- package/bin/cli.js +37 -0
- package/package.json +55 -0
- package/src/commands/init.js +301 -0
- package/src/index.js +168 -0
- package/src/lib/config.js +200 -0
- package/src/lib/generator.js +166 -0
- package/src/utils/display.js +95 -0
- package/src/utils/readme.js +196 -0
- package/src/utils/tool-specific.js +233 -0
- package/templates/ai-engineering/agent-orchestration-planner/ SKILL.md +266 -0
- package/templates/ai-engineering/cost-latency-optimizer/ SKILL.md +270 -0
- package/templates/ai-engineering/doc-to-vector-dataset-generator/ SKILL.md +239 -0
- package/templates/ai-engineering/evaluation-harness/ SKILL.md +219 -0
- package/templates/ai-engineering/guardrails-safety-filter-builder/ SKILL.md +226 -0
- package/templates/ai-engineering/llm-debugger/ SKILL.md +283 -0
- package/templates/ai-engineering/prompt-regression-tester/ SKILL.md +216 -0
- package/templates/ai-engineering/prompt-template-builder/ SKILL.md +393 -0
- package/templates/ai-engineering/rag-pipeline-builder/ SKILL.md +244 -0
- package/templates/ai-engineering/tool-function-schema-designer/ SKILL.md +219 -0
- package/templates/architecture/adr-writer/ SKILL.md +250 -0
- package/templates/architecture/api-versioning-deprecation-planner/ SKILL.md +331 -0
- package/templates/architecture/domain-model-boundaries-mapper/ SKILL.md +300 -0
- package/templates/architecture/migration-planner/ SKILL.md +376 -0
- package/templates/architecture/performance-budget-setter/ SKILL.md +318 -0
- package/templates/architecture/reliability-strategy-builder/ SKILL.md +286 -0
- package/templates/architecture/rfc-generator/ SKILL.md +362 -0
- package/templates/architecture/scalability-playbook/ SKILL.md +279 -0
- package/templates/architecture/system-design-generator/ SKILL.md +339 -0
- package/templates/architecture/tech-debt-prioritizer/ SKILL.md +329 -0
- package/templates/backend/api-contract-normalizer/ SKILL.md +487 -0
- package/templates/backend/api-endpoint-generator/ SKILL.md +415 -0
- package/templates/backend/auth-module-builder/ SKILL.md +99 -0
- package/templates/backend/background-jobs-designer/ SKILL.md +166 -0
- package/templates/backend/caching-strategist/ SKILL.md +190 -0
- package/templates/backend/error-handling-standardizer/ SKILL.md +174 -0
- package/templates/backend/rate-limiting-abuse-protection/ SKILL.md +147 -0
- package/templates/backend/rbac-permissions-builder/ SKILL.md +158 -0
- package/templates/backend/service-layer-extractor/ SKILL.md +269 -0
- package/templates/backend/webhook-receiver-hardener/ SKILL.md +211 -0
- package/templates/ci-cd/artifact-sbom-publisher/ SKILL.md +236 -0
- package/templates/ci-cd/caching-strategy-optimizer/ SKILL.md +195 -0
- package/templates/ci-cd/deployment-checklist-generator/ SKILL.md +381 -0
- package/templates/ci-cd/github-actions-pipeline-creator/ SKILL.md +348 -0
- package/templates/ci-cd/monorepo-ci-optimizer/ SKILL.md +298 -0
- package/templates/ci-cd/preview-environments-builder/ SKILL.md +187 -0
- package/templates/ci-cd/quality-gates-enforcer/ SKILL.md +342 -0
- package/templates/ci-cd/release-automation-builder/ SKILL.md +281 -0
- package/templates/ci-cd/rollback-workflow-builder/ SKILL.md +372 -0
- package/templates/ci-cd/secrets-env-manager/ SKILL.md +242 -0
- package/templates/db-management/backup-restore-runbook-generator/ SKILL.md +505 -0
- package/templates/db-management/data-integrity-auditor/ SKILL.md +505 -0
- package/templates/db-management/data-retention-archiving-planner/ SKILL.md +430 -0
- package/templates/db-management/data-seeding-fixtures-builder/ SKILL.md +375 -0
- package/templates/db-management/db-performance-watchlist/ SKILL.md +425 -0
- package/templates/db-management/etl-sync-job-builder/ SKILL.md +457 -0
- package/templates/db-management/multi-tenant-safety-checker/ SKILL.md +398 -0
- package/templates/db-management/prisma-migration-assistant/ SKILL.md +379 -0
- package/templates/db-management/schema-consistency-checker/ SKILL.md +440 -0
- package/templates/db-management/sql-query-optimizer/ SKILL.md +324 -0
- package/templates/foundation/changelog-writer/ SKILL.md +431 -0
- package/templates/foundation/code-formatter-installer/ SKILL.md +320 -0
- package/templates/foundation/codebase-summarizer/ SKILL.md +360 -0
- package/templates/foundation/dependency-doctor/ SKILL.md +163 -0
- package/templates/foundation/dev-environment-bootstrapper/ SKILL.md +259 -0
- package/templates/foundation/dev-onboarding-builder/ SKILL.md +556 -0
- package/templates/foundation/docs-starter-kit/ SKILL.md +574 -0
- package/templates/foundation/explaining-code/SKILL.md +13 -0
- package/templates/foundation/git-hygiene-enforcer/ SKILL.md +455 -0
- package/templates/foundation/project-scaffolder/ SKILL.md +65 -0
- package/templates/foundation/project-scaffolder/references/templates.md +126 -0
- package/templates/foundation/repo-structure-linter/ SKILL.md +0 -0
- package/templates/foundation/repo-structure-linter/references/conventions.md +98 -0
- package/templates/frontend/animation-micro-interaction-pack/ SKILL.md +41 -0
- package/templates/frontend/component-scaffold-generator/ SKILL.md +562 -0
- package/templates/frontend/design-to-component-translator/ SKILL.md +547 -0
- package/templates/frontend/form-wizard-builder/ SKILL.md +553 -0
- package/templates/frontend/frontend-refactor-planner/ SKILL.md +37 -0
- package/templates/frontend/i18n-frontend-implementer/ SKILL.md +44 -0
- package/templates/frontend/modal-drawer-system/ SKILL.md +377 -0
- package/templates/frontend/page-layout-builder/ SKILL.md +630 -0
- package/templates/frontend/state-ux-flow-builder/ SKILL.md +23 -0
- package/templates/frontend/table-builder/ SKILL.md +350 -0
- package/templates/performance/alerting-dashboard-builder/ SKILL.md +162 -0
- package/templates/performance/backend-latency-profiler-helper/ SKILL.md +108 -0
- package/templates/performance/caching-cdn-strategy-planner/ SKILL.md +150 -0
- package/templates/performance/capacity-planning-helper/ SKILL.md +242 -0
- package/templates/performance/core-web-vitals-tuner/ SKILL.md +126 -0
- package/templates/performance/incident-runbook-generator/ SKILL.md +162 -0
- package/templates/performance/load-test-scenario-builder/ SKILL.md +256 -0
- package/templates/performance/observability-setup/ SKILL.md +232 -0
- package/templates/performance/postmortem-writer/ SKILL.md +203 -0
- package/templates/performance/structured-logging-standardizer/ SKILL.md +122 -0
- package/templates/security/auth-security-reviewer/ SKILL.md +428 -0
- package/templates/security/dependency-vulnerability-triage/ SKILL.md +495 -0
- package/templates/security/input-validation-sanitization-auditor/ SKILL.md +76 -0
- package/templates/security/pii-redaction-logging-policy-builder/ SKILL.md +65 -0
- package/templates/security/rbac-policy-tester/ SKILL.md +80 -0
- package/templates/security/secrets-scanner/ SKILL.md +462 -0
- package/templates/security/secure-headers-csp-builder/ SKILL.md +404 -0
- package/templates/security/security-incident-playbook-generator/ SKILL.md +76 -0
- package/templates/security/security-pr-checklist-skill/ SKILL.md +62 -0
- package/templates/security/threat-model-generator/ SKILL.md +394 -0
- package/templates/testing/contract-testing-builder/ SKILL.md +492 -0
- package/templates/testing/coverage-strategist/ SKILL.md +436 -0
- package/templates/testing/e2e-test-builder/ SKILL.md +382 -0
- package/templates/testing/flaky-test-detective/ SKILL.md +416 -0
- package/templates/testing/integration-test-builder/ SKILL.md +525 -0
- package/templates/testing/mocking-assistant/ SKILL.md +383 -0
- package/templates/testing/snapshot-test-refactorer/ SKILL.md +375 -0
- package/templates/testing/test-data-factory-builder/ SKILL.md +449 -0
- package/templates/testing/test-reporting-triage-skill/ SKILL.md +469 -0
- package/templates/testing/unit-test-generator/ SKILL.md +548 -0
|
@@ -0,0 +1,286 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: reliability-strategy-builder
|
|
3
|
+
description: Implements reliability patterns including circuit breakers, retries, fallbacks, bulkheads, and SLO definitions. Provides failure mode analysis and incident response plans. Use for "SRE", "reliability", "resilience", or "failure handling".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Reliability Strategy Builder
|
|
7
|
+
|
|
8
|
+
Build resilient systems with proper failure handling and SLOs.
|
|
9
|
+
|
|
10
|
+
## Reliability Patterns
|
|
11
|
+
|
|
12
|
+
### 1. Circuit Breaker
|
|
13
|
+
|
|
14
|
+
Prevent cascading failures by stopping requests to failing services.
|
|
15
|
+
|
|
16
|
+
```typescript
|
|
17
|
+
class CircuitBreaker {
|
|
18
|
+
private state: "closed" | "open" | "half-open" = "closed";
|
|
19
|
+
private failureCount = 0;
|
|
20
|
+
private lastFailureTime?: Date;
|
|
21
|
+
|
|
22
|
+
async execute<T>(operation: () => Promise<T>): Promise<T> {
|
|
23
|
+
if (this.state === "open") {
|
|
24
|
+
if (this.shouldAttemptReset()) {
|
|
25
|
+
this.state = "half-open";
|
|
26
|
+
} else {
|
|
27
|
+
throw new Error("Circuit breaker is OPEN");
|
|
28
|
+
}
|
|
29
|
+
}
|
|
30
|
+
|
|
31
|
+
try {
|
|
32
|
+
const result = await operation();
|
|
33
|
+
this.onSuccess();
|
|
34
|
+
return result;
|
|
35
|
+
} catch (error) {
|
|
36
|
+
this.onFailure();
|
|
37
|
+
throw error;
|
|
38
|
+
}
|
|
39
|
+
}
|
|
40
|
+
|
|
41
|
+
private onSuccess() {
|
|
42
|
+
this.failureCount = 0;
|
|
43
|
+
this.state = "closed";
|
|
44
|
+
}
|
|
45
|
+
|
|
46
|
+
private onFailure() {
|
|
47
|
+
this.failureCount++;
|
|
48
|
+
this.lastFailureTime = new Date();
|
|
49
|
+
|
|
50
|
+
if (this.failureCount >= 5) {
|
|
51
|
+
this.state = "open";
|
|
52
|
+
}
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
private shouldAttemptReset(): boolean {
|
|
56
|
+
if (!this.lastFailureTime) return false;
|
|
57
|
+
const now = Date.now();
|
|
58
|
+
const elapsed = now - this.lastFailureTime.getTime();
|
|
59
|
+
return elapsed > 60000; // 1 minute
|
|
60
|
+
}
|
|
61
|
+
}
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### 2. Retry with Backoff
|
|
65
|
+
|
|
66
|
+
Handle transient failures with exponential backoff.
|
|
67
|
+
|
|
68
|
+
```typescript
|
|
69
|
+
async function retryWithBackoff<T>(
|
|
70
|
+
operation: () => Promise<T>,
|
|
71
|
+
maxRetries = 3,
|
|
72
|
+
baseDelay = 1000
|
|
73
|
+
): Promise<T> {
|
|
74
|
+
for (let attempt = 0; attempt <= maxRetries; attempt++) {
|
|
75
|
+
try {
|
|
76
|
+
return await operation();
|
|
77
|
+
} catch (error) {
|
|
78
|
+
if (attempt === maxRetries) throw error;
|
|
79
|
+
|
|
80
|
+
// Exponential backoff: 1s, 2s, 4s
|
|
81
|
+
const delay = baseDelay * Math.pow(2, attempt);
|
|
82
|
+
await sleep(delay);
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
throw new Error("Max retries exceeded");
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### 3. Fallback Pattern
|
|
90
|
+
|
|
91
|
+
Provide degraded functionality when primary fails.
|
|
92
|
+
|
|
93
|
+
```typescript
|
|
94
|
+
async function getUserWithFallback(userId: string): Promise<User> {
|
|
95
|
+
try {
|
|
96
|
+
// Try primary database
|
|
97
|
+
return await primaryDb.users.findById(userId);
|
|
98
|
+
} catch (error) {
|
|
99
|
+
logger.warn("Primary DB failed, using cache");
|
|
100
|
+
|
|
101
|
+
// Fallback to cache
|
|
102
|
+
const cached = await cache.get(`user:${userId}`);
|
|
103
|
+
if (cached) return cached;
|
|
104
|
+
|
|
105
|
+
// Final fallback: return minimal user object
|
|
106
|
+
return {
|
|
107
|
+
id: userId,
|
|
108
|
+
name: "Unknown User",
|
|
109
|
+
email: "unavailable",
|
|
110
|
+
};
|
|
111
|
+
}
|
|
112
|
+
}
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
### 4. Bulkhead Pattern
|
|
116
|
+
|
|
117
|
+
Isolate failures to prevent resource exhaustion.
|
|
118
|
+
|
|
119
|
+
```typescript
|
|
120
|
+
class ThreadPool {
|
|
121
|
+
private pools = new Map<string, Semaphore>();
|
|
122
|
+
|
|
123
|
+
constructor() {
|
|
124
|
+
// Separate pools for different operations
|
|
125
|
+
this.pools.set("critical", new Semaphore(100));
|
|
126
|
+
this.pools.set("standard", new Semaphore(50));
|
|
127
|
+
this.pools.set("background", new Semaphore(10));
|
|
128
|
+
}
|
|
129
|
+
|
|
130
|
+
async execute(priority: string, operation: () => Promise<any>) {
|
|
131
|
+
const pool = this.pools.get(priority);
|
|
132
|
+
await pool.acquire();
|
|
133
|
+
|
|
134
|
+
try {
|
|
135
|
+
return await operation();
|
|
136
|
+
} finally {
|
|
137
|
+
pool.release();
|
|
138
|
+
}
|
|
139
|
+
}
|
|
140
|
+
}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## SLO Definitions
|
|
144
|
+
|
|
145
|
+
### SLO Template
|
|
146
|
+
|
|
147
|
+
```yaml
|
|
148
|
+
service: user-api
|
|
149
|
+
slos:
|
|
150
|
+
- name: Availability
|
|
151
|
+
description: API should be available for successful requests
|
|
152
|
+
target: 99.9%
|
|
153
|
+
measurement:
|
|
154
|
+
type: ratio
|
|
155
|
+
success: status_code < 500
|
|
156
|
+
total: all_requests
|
|
157
|
+
window: 30 days
|
|
158
|
+
|
|
159
|
+
- name: Latency
|
|
160
|
+
description: 95% of requests complete within 500ms
|
|
161
|
+
target: 95%
|
|
162
|
+
measurement:
|
|
163
|
+
type: percentile
|
|
164
|
+
metric: request_duration_ms
|
|
165
|
+
threshold: 500
|
|
166
|
+
percentile: 95
|
|
167
|
+
window: 7 days
|
|
168
|
+
|
|
169
|
+
- name: Error Rate
|
|
170
|
+
description: Less than 1% of requests result in errors
|
|
171
|
+
target: 99%
|
|
172
|
+
measurement:
|
|
173
|
+
type: ratio
|
|
174
|
+
success: status_code < 400 OR status_code IN [401, 403, 404]
|
|
175
|
+
total: all_requests
|
|
176
|
+
window: 24 hours
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Error Budget
|
|
180
|
+
|
|
181
|
+
```
|
|
182
|
+
Error Budget = 100% - SLO
|
|
183
|
+
|
|
184
|
+
Example:
|
|
185
|
+
SLO: 99.9% availability
|
|
186
|
+
Error Budget: 0.1% = 43.2 minutes/month downtime allowed
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
## Failure Mode Analysis
|
|
190
|
+
|
|
191
|
+
```markdown
|
|
192
|
+
| Component | Failure Mode | Impact | Probability | Detection | Mitigation |
|
|
193
|
+
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
|
|
194
|
+
| Database | Unresponsive | HIGH | Medium | Health checks every 10s | Circuit breaker, read replicas |
|
|
195
|
+
| API Gateway | Overload | HIGH | Low | Request queue depth | Rate limiting, auto-scaling |
|
|
196
|
+
| Cache | Eviction | MEDIUM | High | Cache hit rate | Fallback to DB, larger cache |
|
|
197
|
+
| Queue | Backed up | LOW | Medium | Queue depth metric | Add workers, DLQ |
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
## Reliability Checklist
|
|
201
|
+
|
|
202
|
+
### Infrastructure
|
|
203
|
+
|
|
204
|
+
- [ ] Load balancer with health checks
|
|
205
|
+
- [ ] Multiple availability zones
|
|
206
|
+
- [ ] Auto-scaling configured
|
|
207
|
+
- [ ] Database replication
|
|
208
|
+
- [ ] Regular backups (tested!)
|
|
209
|
+
|
|
210
|
+
### Application
|
|
211
|
+
|
|
212
|
+
- [ ] Circuit breakers on external calls
|
|
213
|
+
- [ ] Retry logic with backoff
|
|
214
|
+
- [ ] Timeouts on all I/O
|
|
215
|
+
- [ ] Fallback mechanisms
|
|
216
|
+
- [ ] Graceful degradation
|
|
217
|
+
|
|
218
|
+
### Monitoring
|
|
219
|
+
|
|
220
|
+
- [ ] SLO dashboard
|
|
221
|
+
- [ ] Error budgets tracked
|
|
222
|
+
- [ ] Alerting on SLO violations
|
|
223
|
+
- [ ] Latency percentiles (p50, p95, p99)
|
|
224
|
+
- [ ] Dependency health checks
|
|
225
|
+
|
|
226
|
+
### Operations
|
|
227
|
+
|
|
228
|
+
- [ ] Incident response runbook
|
|
229
|
+
- [ ] On-call rotation
|
|
230
|
+
- [ ] Postmortem template
|
|
231
|
+
- [ ] Disaster recovery plan
|
|
232
|
+
- [ ] Chaos engineering tests
|
|
233
|
+
|
|
234
|
+
## Incident Response Plan
|
|
235
|
+
|
|
236
|
+
### Severity Levels
|
|
237
|
+
|
|
238
|
+
```
|
|
239
|
+
SEV1 (Critical): Complete service outage, data loss
|
|
240
|
+
- Response time: <15 minutes
|
|
241
|
+
- Page on-call immediately
|
|
242
|
+
|
|
243
|
+
SEV2 (High): Partial outage, degraded performance
|
|
244
|
+
- Response time: <1 hour
|
|
245
|
+
- Alert on-call
|
|
246
|
+
|
|
247
|
+
SEV3 (Medium): Minor issues, workarounds available
|
|
248
|
+
- Response time: <4 hours
|
|
249
|
+
- Create ticket
|
|
250
|
+
|
|
251
|
+
SEV4 (Low): Cosmetic issues, no user impact
|
|
252
|
+
- Response time: Next business day
|
|
253
|
+
- Backlog
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
### Incident Response Steps
|
|
257
|
+
|
|
258
|
+
1. **Acknowledge**: Confirm receipt within SLA
|
|
259
|
+
2. **Assess**: Determine severity and impact
|
|
260
|
+
3. **Communicate**: Update status page
|
|
261
|
+
4. **Mitigate**: Stop the bleeding (rollback, scale, disable)
|
|
262
|
+
5. **Resolve**: Fix root cause
|
|
263
|
+
6. **Document**: Write postmortem
|
|
264
|
+
|
|
265
|
+
## Best Practices
|
|
266
|
+
|
|
267
|
+
1. **Design for failure**: Assume components will fail
|
|
268
|
+
2. **Fail fast**: Don't let slow failures cascade
|
|
269
|
+
3. **Isolate failures**: Bulkhead pattern
|
|
270
|
+
4. **Graceful degradation**: Reduce functionality, don't crash
|
|
271
|
+
5. **Monitor SLOs**: Track error budgets
|
|
272
|
+
6. **Test failure modes**: Chaos engineering
|
|
273
|
+
7. **Document runbooks**: Clear incident response
|
|
274
|
+
|
|
275
|
+
## Output Checklist
|
|
276
|
+
|
|
277
|
+
- [ ] Circuit breakers implemented
|
|
278
|
+
- [ ] Retry logic with backoff
|
|
279
|
+
- [ ] Fallback mechanisms
|
|
280
|
+
- [ ] Bulkhead isolation
|
|
281
|
+
- [ ] SLOs defined (availability, latency, errors)
|
|
282
|
+
- [ ] Error budgets calculated
|
|
283
|
+
- [ ] Failure mode analysis
|
|
284
|
+
- [ ] Monitoring dashboard
|
|
285
|
+
- [ ] Incident response plan
|
|
286
|
+
- [ ] Runbooks documented
|
|
@@ -0,0 +1,362 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: rfc-generator
|
|
3
|
+
description: Generates Request for Comments documents for technical proposals including problem statement, solution design, alternatives, risks, and rollout plans. Use for "RFC", "technical proposals", "design docs", or "architecture proposals".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# RFC Generator
|
|
7
|
+
|
|
8
|
+
Create comprehensive technical proposals with RFCs.
|
|
9
|
+
|
|
10
|
+
## RFC Template
|
|
11
|
+
|
|
12
|
+
```markdown
|
|
13
|
+
# RFC-042: Implement Read Replicas for Analytics
|
|
14
|
+
|
|
15
|
+
**Status:** Draft | In Review | Accepted | Rejected | Implemented
|
|
16
|
+
**Author:** Alice (alice@example.com)
|
|
17
|
+
**Reviewers:** Bob, Charlie, David
|
|
18
|
+
**Created:** 2024-01-15
|
|
19
|
+
**Updated:** 2024-01-20
|
|
20
|
+
**Target Date:** Q1 2024
|
|
21
|
+
|
|
22
|
+
## Summary
|
|
23
|
+
|
|
24
|
+
Add PostgreSQL read replicas to separate analytical queries from transactional workload, improving database performance and enabling new analytics features.
|
|
25
|
+
|
|
26
|
+
## Problem Statement
|
|
27
|
+
|
|
28
|
+
### Current Situation
|
|
29
|
+
|
|
30
|
+
Our PostgreSQL database serves both transactional (OLTP) and analytical (OLAP) workloads:
|
|
31
|
+
|
|
32
|
+
- 1000 writes/min (checkout, orders, inventory)
|
|
33
|
+
- 5000 reads/min (user browsing, search)
|
|
34
|
+
- 500 analytics queries/min (dashboards, reports)
|
|
35
|
+
|
|
36
|
+
### Issues
|
|
37
|
+
|
|
38
|
+
1. **Performance degradation**: Analytics queries slow down transactions
|
|
39
|
+
2. **Resource contention**: Complex reports consume CPU/memory
|
|
40
|
+
3. **Blocking features**: Can't add more dashboards without impacting users
|
|
41
|
+
4. **Peak hour problems**: Analytics scheduled during business hours
|
|
42
|
+
|
|
43
|
+
### Impact
|
|
44
|
+
|
|
45
|
+
- Checkout p95 latency: 800ms (target: <300ms)
|
|
46
|
+
- Database CPU: 75% average, 95% peak
|
|
47
|
+
- Customer complaints about slow pages
|
|
48
|
+
- Product team blocked on analytics features
|
|
49
|
+
|
|
50
|
+
### Success Criteria
|
|
51
|
+
|
|
52
|
+
- Checkout latency <300ms p95
|
|
53
|
+
- Database CPU <50%
|
|
54
|
+
- Support 2x more analytics queries
|
|
55
|
+
- Zero impact on transactional performance
|
|
56
|
+
|
|
57
|
+
## Proposed Solution
|
|
58
|
+
|
|
59
|
+
### High-Level Design
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
┌─────────────┐
|
|
63
|
+
│ Primary │────────────────┐
|
|
64
|
+
│ (Write) │ │
|
|
65
|
+
└─────────────┘ │
|
|
66
|
+
▼
|
|
67
|
+
┌─────────────┐
|
|
68
|
+
│ Replica 1 │
|
|
69
|
+
│ (Read) │
|
|
70
|
+
└─────────────┘
|
|
71
|
+
▼
|
|
72
|
+
┌─────────────┐
|
|
73
|
+
│ Replica 2 │
|
|
74
|
+
│ (Analytics)│
|
|
75
|
+
└─────────────┘
|
|
76
|
+
|
|
77
|
+
````
|
|
78
|
+
|
|
79
|
+
### Architecture
|
|
80
|
+
1. **Primary database**: Handles all writes and critical reads
|
|
81
|
+
2. **Read Replica 1**: Serves user-facing read queries
|
|
82
|
+
3. **Read Replica 2**: Dedicated to analytics/reporting
|
|
83
|
+
|
|
84
|
+
### Routing Strategy
|
|
85
|
+
```typescript
|
|
86
|
+
const db = {
|
|
87
|
+
primary: primaryConnection,
|
|
88
|
+
read: replicaConnection,
|
|
89
|
+
analytics: analyticsConnection,
|
|
90
|
+
};
|
|
91
|
+
|
|
92
|
+
// Write
|
|
93
|
+
await db.primary.users.create(data);
|
|
94
|
+
|
|
95
|
+
// Critical read (always fresh)
|
|
96
|
+
await db.primary.users.findById(id);
|
|
97
|
+
|
|
98
|
+
// Non-critical read (can be slightly stale)
|
|
99
|
+
await db.read.products.search(query);
|
|
100
|
+
|
|
101
|
+
// Analytics
|
|
102
|
+
await db.analytics.orders.aggregate(pipeline);
|
|
103
|
+
````
|
|
104
|
+
|
|
105
|
+
### Replication
|
|
106
|
+
|
|
107
|
+
- **Type:** Streaming replication
|
|
108
|
+
- **Lag:** <1 second for read replica, <5 seconds acceptable for analytics
|
|
109
|
+
- **Monitoring:** Alert if lag >5 seconds
|
|
110
|
+
|
|
111
|
+
## Detailed Design
|
|
112
|
+
|
|
113
|
+
### Database Configuration
|
|
114
|
+
|
|
115
|
+
```yaml
|
|
116
|
+
# Primary
|
|
117
|
+
max_connections: 200
|
|
118
|
+
shared_buffers: 4GB
|
|
119
|
+
work_mem: 16MB
|
|
120
|
+
|
|
121
|
+
# Read Replica
|
|
122
|
+
max_connections: 100
|
|
123
|
+
shared_buffers: 8GB
|
|
124
|
+
work_mem: 32MB
|
|
125
|
+
|
|
126
|
+
# Analytics Replica
|
|
127
|
+
max_connections: 50
|
|
128
|
+
shared_buffers: 16GB
|
|
129
|
+
work_mem: 64MB
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### Connection Pooling
|
|
133
|
+
|
|
134
|
+
```typescript
|
|
135
|
+
const pools = {
|
|
136
|
+
primary: new Pool({ max: 20, min: 5 }),
|
|
137
|
+
read: new Pool({ max: 50, min: 10 }),
|
|
138
|
+
analytics: new Pool({ max: 10, min: 2 }),
|
|
139
|
+
};
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### Query Classification
|
|
143
|
+
|
|
144
|
+
```typescript
|
|
145
|
+
enum QueryType {
|
|
146
|
+
WRITE = "primary",
|
|
147
|
+
CRITICAL_READ = "primary",
|
|
148
|
+
READ = "read",
|
|
149
|
+
ANALYTICS = "analytics",
|
|
150
|
+
}
|
|
151
|
+
|
|
152
|
+
function route(queryType: QueryType) {
|
|
153
|
+
return pools[queryType];
|
|
154
|
+
}
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## Alternatives Considered
|
|
158
|
+
|
|
159
|
+
### Alternative 1: Vertical Scaling
|
|
160
|
+
|
|
161
|
+
**Approach:** Upgrade to larger database instance
|
|
162
|
+
|
|
163
|
+
- **Pros:** Simple, no code changes
|
|
164
|
+
- **Cons:** Expensive ($500 → $2000/month), doesn't separate workloads, still hits limits
|
|
165
|
+
- **Verdict:** Rejected - doesn't solve isolation problem
|
|
166
|
+
|
|
167
|
+
### Alternative 2: Separate Analytics Database
|
|
168
|
+
|
|
169
|
+
**Approach:** Copy data to dedicated analytics DB (e.g., ClickHouse)
|
|
170
|
+
|
|
171
|
+
- **Pros:** Optimal for analytics, no impact on primary
|
|
172
|
+
- **Cons:** Complex ETL pipeline, eventual consistency, high maintenance
|
|
173
|
+
- **Verdict:** Defer - consider for future if replicas insufficient
|
|
174
|
+
|
|
175
|
+
### Alternative 3: Materialized Views
|
|
176
|
+
|
|
177
|
+
**Approach:** Pre-compute analytics results
|
|
178
|
+
|
|
179
|
+
- **Pros:** Fast queries, no replicas needed
|
|
180
|
+
- **Cons:** Limited to known queries, maintenance overhead
|
|
181
|
+
- **Verdict:** Complement to replicas, not replacement
|
|
182
|
+
|
|
183
|
+
## Tradeoffs
|
|
184
|
+
|
|
185
|
+
### What We're Optimizing For
|
|
186
|
+
|
|
187
|
+
- Performance isolation
|
|
188
|
+
- Cost efficiency
|
|
189
|
+
- Quick implementation
|
|
190
|
+
- Operational simplicity
|
|
191
|
+
|
|
192
|
+
### What We're Sacrificing
|
|
193
|
+
|
|
194
|
+
- Slight data staleness (acceptable for analytics)
|
|
195
|
+
- Additional infrastructure complexity
|
|
196
|
+
- Higher operational costs
|
|
197
|
+
|
|
198
|
+
## Risks & Mitigations
|
|
199
|
+
|
|
200
|
+
### Risk 1: Replication Lag
|
|
201
|
+
|
|
202
|
+
**Impact:** Analytics sees stale data
|
|
203
|
+
**Probability:** Medium
|
|
204
|
+
**Mitigation:**
|
|
205
|
+
|
|
206
|
+
- Monitor lag continuously
|
|
207
|
+
- Alert if >5 seconds
|
|
208
|
+
- Document expected lag for users
|
|
209
|
+
|
|
210
|
+
### Risk 2: Configuration Complexity
|
|
211
|
+
|
|
212
|
+
**Impact:** Routing errors, performance issues
|
|
213
|
+
**Probability:** Low
|
|
214
|
+
**Mitigation:**
|
|
215
|
+
|
|
216
|
+
- Comprehensive testing
|
|
217
|
+
- Gradual rollout
|
|
218
|
+
- Easy rollback mechanism
|
|
219
|
+
|
|
220
|
+
### Risk 3: Cost Overrun
|
|
221
|
+
|
|
222
|
+
**Impact:** Budget exceeded
|
|
223
|
+
**Probability:** Low
|
|
224
|
+
**Mitigation:**
|
|
225
|
+
|
|
226
|
+
- Use smaller instance for analytics ($300/month)
|
|
227
|
+
- Monitor usage
|
|
228
|
+
- Right-size after 1 month
|
|
229
|
+
|
|
230
|
+
## Rollout Plan
|
|
231
|
+
|
|
232
|
+
### Phase 1: Setup (Week 1-2)
|
|
233
|
+
|
|
234
|
+
- [ ] Provision read replica 1
|
|
235
|
+
- [ ] Provision analytics replica 2
|
|
236
|
+
- [ ] Configure replication
|
|
237
|
+
- [ ] Verify lag <1 second
|
|
238
|
+
- [ ] Load testing
|
|
239
|
+
|
|
240
|
+
### Phase 2: Read Replica (Week 3)
|
|
241
|
+
|
|
242
|
+
- [ ] Deploy routing logic
|
|
243
|
+
- [ ] Route 10% search queries to replica
|
|
244
|
+
- [ ] Monitor errors and latency
|
|
245
|
+
- [ ] Ramp to 100%
|
|
246
|
+
|
|
247
|
+
### Phase 3: Analytics Migration (Week 4-5)
|
|
248
|
+
|
|
249
|
+
- [ ] Identify analytics queries
|
|
250
|
+
- [ ] Update dashboard queries to analytics replica
|
|
251
|
+
- [ ] Test reports
|
|
252
|
+
- [ ] Migrate all analytics
|
|
253
|
+
|
|
254
|
+
### Phase 4: Validation (Week 6)
|
|
255
|
+
|
|
256
|
+
- [ ] Measure checkout latency improvement
|
|
257
|
+
- [ ] Verify CPU reduction
|
|
258
|
+
- [ ] User acceptance testing
|
|
259
|
+
- [ ] Mark as complete
|
|
260
|
+
|
|
261
|
+
## Success Metrics
|
|
262
|
+
|
|
263
|
+
### Primary Goals
|
|
264
|
+
|
|
265
|
+
- ✅ Checkout latency <300ms p95 (currently 800ms)
|
|
266
|
+
- ✅ Primary DB CPU <50% (currently 75%)
|
|
267
|
+
- ✅ Zero errors from replication lag
|
|
268
|
+
|
|
269
|
+
### Secondary Goals
|
|
270
|
+
|
|
271
|
+
- Support 2x analytics queries
|
|
272
|
+
- Enable new dashboard features
|
|
273
|
+
- Team satisfaction survey >8/10
|
|
274
|
+
|
|
275
|
+
## Cost Analysis
|
|
276
|
+
|
|
277
|
+
| Component | Current | Proposed | Delta |
|
|
278
|
+
| ----------------- | ----------- | ------------- | ------------ |
|
|
279
|
+
| Primary DB | $500/mo | $500/mo | $0 |
|
|
280
|
+
| Read Replica | - | $500/mo | +$500 |
|
|
281
|
+
| Analytics Replica | - | $300/mo | +$300 |
|
|
282
|
+
| **Total** | **$500/mo** | **$1,300/mo** | **+$800/mo** |
|
|
283
|
+
|
|
284
|
+
**ROI:** Better performance enables revenue growth; analytics unlocks product insights
|
|
285
|
+
|
|
286
|
+
## Open Questions
|
|
287
|
+
|
|
288
|
+
1. What's acceptable replication lag for analytics? (Proposed: <5 sec)
|
|
289
|
+
2. How do we handle replica failure? (Proposed: Fallback to primary)
|
|
290
|
+
3. Should we add more replicas later? (Proposed: Monitor and decide in Q2)
|
|
291
|
+
|
|
292
|
+
## Timeline
|
|
293
|
+
|
|
294
|
+
- Week 1-2: Provisioning and setup
|
|
295
|
+
- Week 3: Read replica migration
|
|
296
|
+
- Week 4-5: Analytics migration
|
|
297
|
+
- Week 6: Validation
|
|
298
|
+
- **Total: 6 weeks**
|
|
299
|
+
|
|
300
|
+
## Appendix
|
|
301
|
+
|
|
302
|
+
### References
|
|
303
|
+
|
|
304
|
+
- [PostgreSQL Replication Docs](https://postgresql.org/docs/replication)
|
|
305
|
+
- [Cost Analysis Spreadsheet](https://docs.google.com/)
|
|
306
|
+
- [Load Test Results](https://example.com)
|
|
307
|
+
|
|
308
|
+
### Review History
|
|
309
|
+
|
|
310
|
+
- 2024-01-15: Initial draft (Alice)
|
|
311
|
+
- 2024-01-17: Added cost analysis (Bob)
|
|
312
|
+
- 2024-01-20: Addressed review comments
|
|
313
|
+
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
## RFC Process
|
|
317
|
+
|
|
318
|
+
### 1. Draft (1 week)
|
|
319
|
+
- Author writes RFC
|
|
320
|
+
- Include problem, solution, alternatives
|
|
321
|
+
- Share with team for early feedback
|
|
322
|
+
|
|
323
|
+
### 2. Review (1-2 weeks)
|
|
324
|
+
- Distribute to reviewers
|
|
325
|
+
- Collect comments
|
|
326
|
+
- Address feedback
|
|
327
|
+
- Iterate on design
|
|
328
|
+
|
|
329
|
+
### 3. Approval (1 week)
|
|
330
|
+
- Present to architecture review
|
|
331
|
+
- Resolve remaining concerns
|
|
332
|
+
- Vote: Accept/Reject
|
|
333
|
+
- Update status
|
|
334
|
+
|
|
335
|
+
### 4. Implementation
|
|
336
|
+
- Track progress
|
|
337
|
+
- Update RFC with learnings
|
|
338
|
+
- Mark as implemented
|
|
339
|
+
|
|
340
|
+
## Best Practices
|
|
341
|
+
|
|
342
|
+
1. **Clear problem**: Start with why
|
|
343
|
+
2. **Concrete solution**: Be specific
|
|
344
|
+
3. **Consider alternatives**: Show you explored options
|
|
345
|
+
4. **Honest tradeoffs**: Every choice has costs
|
|
346
|
+
5. **Measurable success**: Define done
|
|
347
|
+
6. **Risk mitigation**: Plan for failure
|
|
348
|
+
7. **Iterative**: Update based on feedback
|
|
349
|
+
|
|
350
|
+
## Output Checklist
|
|
351
|
+
|
|
352
|
+
- [ ] Problem statement
|
|
353
|
+
- [ ] Proposed solution with architecture
|
|
354
|
+
- [ ] 2+ alternatives considered
|
|
355
|
+
- [ ] Tradeoffs documented
|
|
356
|
+
- [ ] Risks with mitigations
|
|
357
|
+
- [ ] Rollout plan with phases
|
|
358
|
+
- [ ] Success metrics defined
|
|
359
|
+
- [ ] Cost analysis
|
|
360
|
+
- [ ] Timeline estimated
|
|
361
|
+
- [ ] Reviewers assigned
|
|
362
|
+
```
|