@patricio0312rev/skillset 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (115) hide show
  1. package/CHANGELOG.md +29 -0
  2. package/LICENSE +21 -0
  3. package/README.md +176 -0
  4. package/bin/cli.js +37 -0
  5. package/package.json +55 -0
  6. package/src/commands/init.js +301 -0
  7. package/src/index.js +168 -0
  8. package/src/lib/config.js +200 -0
  9. package/src/lib/generator.js +166 -0
  10. package/src/utils/display.js +95 -0
  11. package/src/utils/readme.js +196 -0
  12. package/src/utils/tool-specific.js +233 -0
  13. package/templates/ai-engineering/agent-orchestration-planner/ SKILL.md +266 -0
  14. package/templates/ai-engineering/cost-latency-optimizer/ SKILL.md +270 -0
  15. package/templates/ai-engineering/doc-to-vector-dataset-generator/ SKILL.md +239 -0
  16. package/templates/ai-engineering/evaluation-harness/ SKILL.md +219 -0
  17. package/templates/ai-engineering/guardrails-safety-filter-builder/ SKILL.md +226 -0
  18. package/templates/ai-engineering/llm-debugger/ SKILL.md +283 -0
  19. package/templates/ai-engineering/prompt-regression-tester/ SKILL.md +216 -0
  20. package/templates/ai-engineering/prompt-template-builder/ SKILL.md +393 -0
  21. package/templates/ai-engineering/rag-pipeline-builder/ SKILL.md +244 -0
  22. package/templates/ai-engineering/tool-function-schema-designer/ SKILL.md +219 -0
  23. package/templates/architecture/adr-writer/ SKILL.md +250 -0
  24. package/templates/architecture/api-versioning-deprecation-planner/ SKILL.md +331 -0
  25. package/templates/architecture/domain-model-boundaries-mapper/ SKILL.md +300 -0
  26. package/templates/architecture/migration-planner/ SKILL.md +376 -0
  27. package/templates/architecture/performance-budget-setter/ SKILL.md +318 -0
  28. package/templates/architecture/reliability-strategy-builder/ SKILL.md +286 -0
  29. package/templates/architecture/rfc-generator/ SKILL.md +362 -0
  30. package/templates/architecture/scalability-playbook/ SKILL.md +279 -0
  31. package/templates/architecture/system-design-generator/ SKILL.md +339 -0
  32. package/templates/architecture/tech-debt-prioritizer/ SKILL.md +329 -0
  33. package/templates/backend/api-contract-normalizer/ SKILL.md +487 -0
  34. package/templates/backend/api-endpoint-generator/ SKILL.md +415 -0
  35. package/templates/backend/auth-module-builder/ SKILL.md +99 -0
  36. package/templates/backend/background-jobs-designer/ SKILL.md +166 -0
  37. package/templates/backend/caching-strategist/ SKILL.md +190 -0
  38. package/templates/backend/error-handling-standardizer/ SKILL.md +174 -0
  39. package/templates/backend/rate-limiting-abuse-protection/ SKILL.md +147 -0
  40. package/templates/backend/rbac-permissions-builder/ SKILL.md +158 -0
  41. package/templates/backend/service-layer-extractor/ SKILL.md +269 -0
  42. package/templates/backend/webhook-receiver-hardener/ SKILL.md +211 -0
  43. package/templates/ci-cd/artifact-sbom-publisher/ SKILL.md +236 -0
  44. package/templates/ci-cd/caching-strategy-optimizer/ SKILL.md +195 -0
  45. package/templates/ci-cd/deployment-checklist-generator/ SKILL.md +381 -0
  46. package/templates/ci-cd/github-actions-pipeline-creator/ SKILL.md +348 -0
  47. package/templates/ci-cd/monorepo-ci-optimizer/ SKILL.md +298 -0
  48. package/templates/ci-cd/preview-environments-builder/ SKILL.md +187 -0
  49. package/templates/ci-cd/quality-gates-enforcer/ SKILL.md +342 -0
  50. package/templates/ci-cd/release-automation-builder/ SKILL.md +281 -0
  51. package/templates/ci-cd/rollback-workflow-builder/ SKILL.md +372 -0
  52. package/templates/ci-cd/secrets-env-manager/ SKILL.md +242 -0
  53. package/templates/db-management/backup-restore-runbook-generator/ SKILL.md +505 -0
  54. package/templates/db-management/data-integrity-auditor/ SKILL.md +505 -0
  55. package/templates/db-management/data-retention-archiving-planner/ SKILL.md +430 -0
  56. package/templates/db-management/data-seeding-fixtures-builder/ SKILL.md +375 -0
  57. package/templates/db-management/db-performance-watchlist/ SKILL.md +425 -0
  58. package/templates/db-management/etl-sync-job-builder/ SKILL.md +457 -0
  59. package/templates/db-management/multi-tenant-safety-checker/ SKILL.md +398 -0
  60. package/templates/db-management/prisma-migration-assistant/ SKILL.md +379 -0
  61. package/templates/db-management/schema-consistency-checker/ SKILL.md +440 -0
  62. package/templates/db-management/sql-query-optimizer/ SKILL.md +324 -0
  63. package/templates/foundation/changelog-writer/ SKILL.md +431 -0
  64. package/templates/foundation/code-formatter-installer/ SKILL.md +320 -0
  65. package/templates/foundation/codebase-summarizer/ SKILL.md +360 -0
  66. package/templates/foundation/dependency-doctor/ SKILL.md +163 -0
  67. package/templates/foundation/dev-environment-bootstrapper/ SKILL.md +259 -0
  68. package/templates/foundation/dev-onboarding-builder/ SKILL.md +556 -0
  69. package/templates/foundation/docs-starter-kit/ SKILL.md +574 -0
  70. package/templates/foundation/explaining-code/SKILL.md +13 -0
  71. package/templates/foundation/git-hygiene-enforcer/ SKILL.md +455 -0
  72. package/templates/foundation/project-scaffolder/ SKILL.md +65 -0
  73. package/templates/foundation/project-scaffolder/references/templates.md +126 -0
  74. package/templates/foundation/repo-structure-linter/ SKILL.md +0 -0
  75. package/templates/foundation/repo-structure-linter/references/conventions.md +98 -0
  76. package/templates/frontend/animation-micro-interaction-pack/ SKILL.md +41 -0
  77. package/templates/frontend/component-scaffold-generator/ SKILL.md +562 -0
  78. package/templates/frontend/design-to-component-translator/ SKILL.md +547 -0
  79. package/templates/frontend/form-wizard-builder/ SKILL.md +553 -0
  80. package/templates/frontend/frontend-refactor-planner/ SKILL.md +37 -0
  81. package/templates/frontend/i18n-frontend-implementer/ SKILL.md +44 -0
  82. package/templates/frontend/modal-drawer-system/ SKILL.md +377 -0
  83. package/templates/frontend/page-layout-builder/ SKILL.md +630 -0
  84. package/templates/frontend/state-ux-flow-builder/ SKILL.md +23 -0
  85. package/templates/frontend/table-builder/ SKILL.md +350 -0
  86. package/templates/performance/alerting-dashboard-builder/ SKILL.md +162 -0
  87. package/templates/performance/backend-latency-profiler-helper/ SKILL.md +108 -0
  88. package/templates/performance/caching-cdn-strategy-planner/ SKILL.md +150 -0
  89. package/templates/performance/capacity-planning-helper/ SKILL.md +242 -0
  90. package/templates/performance/core-web-vitals-tuner/ SKILL.md +126 -0
  91. package/templates/performance/incident-runbook-generator/ SKILL.md +162 -0
  92. package/templates/performance/load-test-scenario-builder/ SKILL.md +256 -0
  93. package/templates/performance/observability-setup/ SKILL.md +232 -0
  94. package/templates/performance/postmortem-writer/ SKILL.md +203 -0
  95. package/templates/performance/structured-logging-standardizer/ SKILL.md +122 -0
  96. package/templates/security/auth-security-reviewer/ SKILL.md +428 -0
  97. package/templates/security/dependency-vulnerability-triage/ SKILL.md +495 -0
  98. package/templates/security/input-validation-sanitization-auditor/ SKILL.md +76 -0
  99. package/templates/security/pii-redaction-logging-policy-builder/ SKILL.md +65 -0
  100. package/templates/security/rbac-policy-tester/ SKILL.md +80 -0
  101. package/templates/security/secrets-scanner/ SKILL.md +462 -0
  102. package/templates/security/secure-headers-csp-builder/ SKILL.md +404 -0
  103. package/templates/security/security-incident-playbook-generator/ SKILL.md +76 -0
  104. package/templates/security/security-pr-checklist-skill/ SKILL.md +62 -0
  105. package/templates/security/threat-model-generator/ SKILL.md +394 -0
  106. package/templates/testing/contract-testing-builder/ SKILL.md +492 -0
  107. package/templates/testing/coverage-strategist/ SKILL.md +436 -0
  108. package/templates/testing/e2e-test-builder/ SKILL.md +382 -0
  109. package/templates/testing/flaky-test-detective/ SKILL.md +416 -0
  110. package/templates/testing/integration-test-builder/ SKILL.md +525 -0
  111. package/templates/testing/mocking-assistant/ SKILL.md +383 -0
  112. package/templates/testing/snapshot-test-refactorer/ SKILL.md +375 -0
  113. package/templates/testing/test-data-factory-builder/ SKILL.md +449 -0
  114. package/templates/testing/test-reporting-triage-skill/ SKILL.md +469 -0
  115. package/templates/testing/unit-test-generator/ SKILL.md +548 -0
@@ -0,0 +1,286 @@
1
+ ---
2
+ name: reliability-strategy-builder
3
+ description: Implements reliability patterns including circuit breakers, retries, fallbacks, bulkheads, and SLO definitions. Provides failure mode analysis and incident response plans. Use for "SRE", "reliability", "resilience", or "failure handling".
4
+ ---
5
+
6
+ # Reliability Strategy Builder
7
+
8
+ Build resilient systems with proper failure handling and SLOs.
9
+
10
+ ## Reliability Patterns
11
+
12
+ ### 1. Circuit Breaker
13
+
14
+ Prevent cascading failures by stopping requests to failing services.
15
+
16
+ ```typescript
17
+ class CircuitBreaker {
18
+ private state: "closed" | "open" | "half-open" = "closed";
19
+ private failureCount = 0;
20
+ private lastFailureTime?: Date;
21
+
22
+ async execute<T>(operation: () => Promise<T>): Promise<T> {
23
+ if (this.state === "open") {
24
+ if (this.shouldAttemptReset()) {
25
+ this.state = "half-open";
26
+ } else {
27
+ throw new Error("Circuit breaker is OPEN");
28
+ }
29
+ }
30
+
31
+ try {
32
+ const result = await operation();
33
+ this.onSuccess();
34
+ return result;
35
+ } catch (error) {
36
+ this.onFailure();
37
+ throw error;
38
+ }
39
+ }
40
+
41
+ private onSuccess() {
42
+ this.failureCount = 0;
43
+ this.state = "closed";
44
+ }
45
+
46
+ private onFailure() {
47
+ this.failureCount++;
48
+ this.lastFailureTime = new Date();
49
+
50
+ if (this.failureCount >= 5) {
51
+ this.state = "open";
52
+ }
53
+ }
54
+
55
+ private shouldAttemptReset(): boolean {
56
+ if (!this.lastFailureTime) return false;
57
+ const now = Date.now();
58
+ const elapsed = now - this.lastFailureTime.getTime();
59
+ return elapsed > 60000; // 1 minute
60
+ }
61
+ }
62
+ ```
63
+
64
+ ### 2. Retry with Backoff
65
+
66
+ Handle transient failures with exponential backoff.
67
+
68
+ ```typescript
69
+ async function retryWithBackoff<T>(
70
+ operation: () => Promise<T>,
71
+ maxRetries = 3,
72
+ baseDelay = 1000
73
+ ): Promise<T> {
74
+ for (let attempt = 0; attempt <= maxRetries; attempt++) {
75
+ try {
76
+ return await operation();
77
+ } catch (error) {
78
+ if (attempt === maxRetries) throw error;
79
+
80
+ // Exponential backoff: 1s, 2s, 4s
81
+ const delay = baseDelay * Math.pow(2, attempt);
82
+ await sleep(delay);
83
+ }
84
+ }
85
+ throw new Error("Max retries exceeded");
86
+ }
87
+ ```
88
+
89
+ ### 3. Fallback Pattern
90
+
91
+ Provide degraded functionality when primary fails.
92
+
93
+ ```typescript
94
+ async function getUserWithFallback(userId: string): Promise<User> {
95
+ try {
96
+ // Try primary database
97
+ return await primaryDb.users.findById(userId);
98
+ } catch (error) {
99
+ logger.warn("Primary DB failed, using cache");
100
+
101
+ // Fallback to cache
102
+ const cached = await cache.get(`user:${userId}`);
103
+ if (cached) return cached;
104
+
105
+ // Final fallback: return minimal user object
106
+ return {
107
+ id: userId,
108
+ name: "Unknown User",
109
+ email: "unavailable",
110
+ };
111
+ }
112
+ }
113
+ ```
114
+
115
+ ### 4. Bulkhead Pattern
116
+
117
+ Isolate failures to prevent resource exhaustion.
118
+
119
+ ```typescript
120
+ class ThreadPool {
121
+ private pools = new Map<string, Semaphore>();
122
+
123
+ constructor() {
124
+ // Separate pools for different operations
125
+ this.pools.set("critical", new Semaphore(100));
126
+ this.pools.set("standard", new Semaphore(50));
127
+ this.pools.set("background", new Semaphore(10));
128
+ }
129
+
130
+ async execute(priority: string, operation: () => Promise<any>) {
131
+ const pool = this.pools.get(priority);
132
+ await pool.acquire();
133
+
134
+ try {
135
+ return await operation();
136
+ } finally {
137
+ pool.release();
138
+ }
139
+ }
140
+ }
141
+ ```
142
+
143
+ ## SLO Definitions
144
+
145
+ ### SLO Template
146
+
147
+ ```yaml
148
+ service: user-api
149
+ slos:
150
+ - name: Availability
151
+ description: API should be available for successful requests
152
+ target: 99.9%
153
+ measurement:
154
+ type: ratio
155
+ success: status_code < 500
156
+ total: all_requests
157
+ window: 30 days
158
+
159
+ - name: Latency
160
+ description: 95% of requests complete within 500ms
161
+ target: 95%
162
+ measurement:
163
+ type: percentile
164
+ metric: request_duration_ms
165
+ threshold: 500
166
+ percentile: 95
167
+ window: 7 days
168
+
169
+ - name: Error Rate
170
+ description: Less than 1% of requests result in errors
171
+ target: 99%
172
+ measurement:
173
+ type: ratio
174
+ success: status_code < 400 OR status_code IN [401, 403, 404]
175
+ total: all_requests
176
+ window: 24 hours
177
+ ```
178
+
179
+ ### Error Budget
180
+
181
+ ```
182
+ Error Budget = 100% - SLO
183
+
184
+ Example:
185
+ SLO: 99.9% availability
186
+ Error Budget: 0.1% = 43.2 minutes/month downtime allowed
187
+ ```
188
+
189
+ ## Failure Mode Analysis
190
+
191
+ ```markdown
192
+ | Component | Failure Mode | Impact | Probability | Detection | Mitigation |
193
+ | ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
194
+ | Database | Unresponsive | HIGH | Medium | Health checks every 10s | Circuit breaker, read replicas |
195
+ | API Gateway | Overload | HIGH | Low | Request queue depth | Rate limiting, auto-scaling |
196
+ | Cache | Eviction | MEDIUM | High | Cache hit rate | Fallback to DB, larger cache |
197
+ | Queue | Backed up | LOW | Medium | Queue depth metric | Add workers, DLQ |
198
+ ```
199
+
200
+ ## Reliability Checklist
201
+
202
+ ### Infrastructure
203
+
204
+ - [ ] Load balancer with health checks
205
+ - [ ] Multiple availability zones
206
+ - [ ] Auto-scaling configured
207
+ - [ ] Database replication
208
+ - [ ] Regular backups (tested!)
209
+
210
+ ### Application
211
+
212
+ - [ ] Circuit breakers on external calls
213
+ - [ ] Retry logic with backoff
214
+ - [ ] Timeouts on all I/O
215
+ - [ ] Fallback mechanisms
216
+ - [ ] Graceful degradation
217
+
218
+ ### Monitoring
219
+
220
+ - [ ] SLO dashboard
221
+ - [ ] Error budgets tracked
222
+ - [ ] Alerting on SLO violations
223
+ - [ ] Latency percentiles (p50, p95, p99)
224
+ - [ ] Dependency health checks
225
+
226
+ ### Operations
227
+
228
+ - [ ] Incident response runbook
229
+ - [ ] On-call rotation
230
+ - [ ] Postmortem template
231
+ - [ ] Disaster recovery plan
232
+ - [ ] Chaos engineering tests
233
+
234
+ ## Incident Response Plan
235
+
236
+ ### Severity Levels
237
+
238
+ ```
239
+ SEV1 (Critical): Complete service outage, data loss
240
+ - Response time: <15 minutes
241
+ - Page on-call immediately
242
+
243
+ SEV2 (High): Partial outage, degraded performance
244
+ - Response time: <1 hour
245
+ - Alert on-call
246
+
247
+ SEV3 (Medium): Minor issues, workarounds available
248
+ - Response time: <4 hours
249
+ - Create ticket
250
+
251
+ SEV4 (Low): Cosmetic issues, no user impact
252
+ - Response time: Next business day
253
+ - Backlog
254
+ ```
255
+
256
+ ### Incident Response Steps
257
+
258
+ 1. **Acknowledge**: Confirm receipt within SLA
259
+ 2. **Assess**: Determine severity and impact
260
+ 3. **Communicate**: Update status page
261
+ 4. **Mitigate**: Stop the bleeding (rollback, scale, disable)
262
+ 5. **Resolve**: Fix root cause
263
+ 6. **Document**: Write postmortem
264
+
265
+ ## Best Practices
266
+
267
+ 1. **Design for failure**: Assume components will fail
268
+ 2. **Fail fast**: Don't let slow failures cascade
269
+ 3. **Isolate failures**: Bulkhead pattern
270
+ 4. **Graceful degradation**: Reduce functionality, don't crash
271
+ 5. **Monitor SLOs**: Track error budgets
272
+ 6. **Test failure modes**: Chaos engineering
273
+ 7. **Document runbooks**: Clear incident response
274
+
275
+ ## Output Checklist
276
+
277
+ - [ ] Circuit breakers implemented
278
+ - [ ] Retry logic with backoff
279
+ - [ ] Fallback mechanisms
280
+ - [ ] Bulkhead isolation
281
+ - [ ] SLOs defined (availability, latency, errors)
282
+ - [ ] Error budgets calculated
283
+ - [ ] Failure mode analysis
284
+ - [ ] Monitoring dashboard
285
+ - [ ] Incident response plan
286
+ - [ ] Runbooks documented
@@ -0,0 +1,362 @@
1
+ ---
2
+ name: rfc-generator
3
+ description: Generates Request for Comments documents for technical proposals including problem statement, solution design, alternatives, risks, and rollout plans. Use for "RFC", "technical proposals", "design docs", or "architecture proposals".
4
+ ---
5
+
6
+ # RFC Generator
7
+
8
+ Create comprehensive technical proposals with RFCs.
9
+
10
+ ## RFC Template
11
+
12
+ ```markdown
13
+ # RFC-042: Implement Read Replicas for Analytics
14
+
15
+ **Status:** Draft | In Review | Accepted | Rejected | Implemented
16
+ **Author:** Alice (alice@example.com)
17
+ **Reviewers:** Bob, Charlie, David
18
+ **Created:** 2024-01-15
19
+ **Updated:** 2024-01-20
20
+ **Target Date:** Q1 2024
21
+
22
+ ## Summary
23
+
24
+ Add PostgreSQL read replicas to separate analytical queries from transactional workload, improving database performance and enabling new analytics features.
25
+
26
+ ## Problem Statement
27
+
28
+ ### Current Situation
29
+
30
+ Our PostgreSQL database serves both transactional (OLTP) and analytical (OLAP) workloads:
31
+
32
+ - 1000 writes/min (checkout, orders, inventory)
33
+ - 5000 reads/min (user browsing, search)
34
+ - 500 analytics queries/min (dashboards, reports)
35
+
36
+ ### Issues
37
+
38
+ 1. **Performance degradation**: Analytics queries slow down transactions
39
+ 2. **Resource contention**: Complex reports consume CPU/memory
40
+ 3. **Blocking features**: Can't add more dashboards without impacting users
41
+ 4. **Peak hour problems**: Analytics scheduled during business hours
42
+
43
+ ### Impact
44
+
45
+ - Checkout p95 latency: 800ms (target: <300ms)
46
+ - Database CPU: 75% average, 95% peak
47
+ - Customer complaints about slow pages
48
+ - Product team blocked on analytics features
49
+
50
+ ### Success Criteria
51
+
52
+ - Checkout latency <300ms p95
53
+ - Database CPU <50%
54
+ - Support 2x more analytics queries
55
+ - Zero impact on transactional performance
56
+
57
+ ## Proposed Solution
58
+
59
+ ### High-Level Design
60
+ ```
61
+
62
+ ┌─────────────┐
63
+ │ Primary │────────────────┐
64
+ │ (Write) │ │
65
+ └─────────────┘ │
66
+
67
+ ┌─────────────┐
68
+ │ Replica 1 │
69
+ │ (Read) │
70
+ └─────────────┘
71
+
72
+ ┌─────────────┐
73
+ │ Replica 2 │
74
+ │ (Analytics)│
75
+ └─────────────┘
76
+
77
+ ````
78
+
79
+ ### Architecture
80
+ 1. **Primary database**: Handles all writes and critical reads
81
+ 2. **Read Replica 1**: Serves user-facing read queries
82
+ 3. **Read Replica 2**: Dedicated to analytics/reporting
83
+
84
+ ### Routing Strategy
85
+ ```typescript
86
+ const db = {
87
+ primary: primaryConnection,
88
+ read: replicaConnection,
89
+ analytics: analyticsConnection,
90
+ };
91
+
92
+ // Write
93
+ await db.primary.users.create(data);
94
+
95
+ // Critical read (always fresh)
96
+ await db.primary.users.findById(id);
97
+
98
+ // Non-critical read (can be slightly stale)
99
+ await db.read.products.search(query);
100
+
101
+ // Analytics
102
+ await db.analytics.orders.aggregate(pipeline);
103
+ ````
104
+
105
+ ### Replication
106
+
107
+ - **Type:** Streaming replication
108
+ - **Lag:** <1 second for read replica, <5 seconds acceptable for analytics
109
+ - **Monitoring:** Alert if lag >5 seconds
110
+
111
+ ## Detailed Design
112
+
113
+ ### Database Configuration
114
+
115
+ ```yaml
116
+ # Primary
117
+ max_connections: 200
118
+ shared_buffers: 4GB
119
+ work_mem: 16MB
120
+
121
+ # Read Replica
122
+ max_connections: 100
123
+ shared_buffers: 8GB
124
+ work_mem: 32MB
125
+
126
+ # Analytics Replica
127
+ max_connections: 50
128
+ shared_buffers: 16GB
129
+ work_mem: 64MB
130
+ ```
131
+
132
+ ### Connection Pooling
133
+
134
+ ```typescript
135
+ const pools = {
136
+ primary: new Pool({ max: 20, min: 5 }),
137
+ read: new Pool({ max: 50, min: 10 }),
138
+ analytics: new Pool({ max: 10, min: 2 }),
139
+ };
140
+ ```
141
+
142
+ ### Query Classification
143
+
144
+ ```typescript
145
+ enum QueryType {
146
+ WRITE = "primary",
147
+ CRITICAL_READ = "primary",
148
+ READ = "read",
149
+ ANALYTICS = "analytics",
150
+ }
151
+
152
+ function route(queryType: QueryType) {
153
+ return pools[queryType];
154
+ }
155
+ ```
156
+
157
+ ## Alternatives Considered
158
+
159
+ ### Alternative 1: Vertical Scaling
160
+
161
+ **Approach:** Upgrade to larger database instance
162
+
163
+ - **Pros:** Simple, no code changes
164
+ - **Cons:** Expensive ($500 → $2000/month), doesn't separate workloads, still hits limits
165
+ - **Verdict:** Rejected - doesn't solve isolation problem
166
+
167
+ ### Alternative 2: Separate Analytics Database
168
+
169
+ **Approach:** Copy data to dedicated analytics DB (e.g., ClickHouse)
170
+
171
+ - **Pros:** Optimal for analytics, no impact on primary
172
+ - **Cons:** Complex ETL pipeline, eventual consistency, high maintenance
173
+ - **Verdict:** Defer - consider for future if replicas insufficient
174
+
175
+ ### Alternative 3: Materialized Views
176
+
177
+ **Approach:** Pre-compute analytics results
178
+
179
+ - **Pros:** Fast queries, no replicas needed
180
+ - **Cons:** Limited to known queries, maintenance overhead
181
+ - **Verdict:** Complement to replicas, not replacement
182
+
183
+ ## Tradeoffs
184
+
185
+ ### What We're Optimizing For
186
+
187
+ - Performance isolation
188
+ - Cost efficiency
189
+ - Quick implementation
190
+ - Operational simplicity
191
+
192
+ ### What We're Sacrificing
193
+
194
+ - Slight data staleness (acceptable for analytics)
195
+ - Additional infrastructure complexity
196
+ - Higher operational costs
197
+
198
+ ## Risks & Mitigations
199
+
200
+ ### Risk 1: Replication Lag
201
+
202
+ **Impact:** Analytics sees stale data
203
+ **Probability:** Medium
204
+ **Mitigation:**
205
+
206
+ - Monitor lag continuously
207
+ - Alert if >5 seconds
208
+ - Document expected lag for users
209
+
210
+ ### Risk 2: Configuration Complexity
211
+
212
+ **Impact:** Routing errors, performance issues
213
+ **Probability:** Low
214
+ **Mitigation:**
215
+
216
+ - Comprehensive testing
217
+ - Gradual rollout
218
+ - Easy rollback mechanism
219
+
220
+ ### Risk 3: Cost Overrun
221
+
222
+ **Impact:** Budget exceeded
223
+ **Probability:** Low
224
+ **Mitigation:**
225
+
226
+ - Use smaller instance for analytics ($300/month)
227
+ - Monitor usage
228
+ - Right-size after 1 month
229
+
230
+ ## Rollout Plan
231
+
232
+ ### Phase 1: Setup (Week 1-2)
233
+
234
+ - [ ] Provision read replica 1
235
+ - [ ] Provision analytics replica 2
236
+ - [ ] Configure replication
237
+ - [ ] Verify lag <1 second
238
+ - [ ] Load testing
239
+
240
+ ### Phase 2: Read Replica (Week 3)
241
+
242
+ - [ ] Deploy routing logic
243
+ - [ ] Route 10% search queries to replica
244
+ - [ ] Monitor errors and latency
245
+ - [ ] Ramp to 100%
246
+
247
+ ### Phase 3: Analytics Migration (Week 4-5)
248
+
249
+ - [ ] Identify analytics queries
250
+ - [ ] Update dashboard queries to analytics replica
251
+ - [ ] Test reports
252
+ - [ ] Migrate all analytics
253
+
254
+ ### Phase 4: Validation (Week 6)
255
+
256
+ - [ ] Measure checkout latency improvement
257
+ - [ ] Verify CPU reduction
258
+ - [ ] User acceptance testing
259
+ - [ ] Mark as complete
260
+
261
+ ## Success Metrics
262
+
263
+ ### Primary Goals
264
+
265
+ - ✅ Checkout latency <300ms p95 (currently 800ms)
266
+ - ✅ Primary DB CPU <50% (currently 75%)
267
+ - ✅ Zero errors from replication lag
268
+
269
+ ### Secondary Goals
270
+
271
+ - Support 2x analytics queries
272
+ - Enable new dashboard features
273
+ - Team satisfaction survey >8/10
274
+
275
+ ## Cost Analysis
276
+
277
+ | Component | Current | Proposed | Delta |
278
+ | ----------------- | ----------- | ------------- | ------------ |
279
+ | Primary DB | $500/mo | $500/mo | $0 |
280
+ | Read Replica | - | $500/mo | +$500 |
281
+ | Analytics Replica | - | $300/mo | +$300 |
282
+ | **Total** | **$500/mo** | **$1,300/mo** | **+$800/mo** |
283
+
284
+ **ROI:** Better performance enables revenue growth; analytics unlocks product insights
285
+
286
+ ## Open Questions
287
+
288
+ 1. What's acceptable replication lag for analytics? (Proposed: <5 sec)
289
+ 2. How do we handle replica failure? (Proposed: Fallback to primary)
290
+ 3. Should we add more replicas later? (Proposed: Monitor and decide in Q2)
291
+
292
+ ## Timeline
293
+
294
+ - Week 1-2: Provisioning and setup
295
+ - Week 3: Read replica migration
296
+ - Week 4-5: Analytics migration
297
+ - Week 6: Validation
298
+ - **Total: 6 weeks**
299
+
300
+ ## Appendix
301
+
302
+ ### References
303
+
304
+ - [PostgreSQL Replication Docs](https://postgresql.org/docs/replication)
305
+ - [Cost Analysis Spreadsheet](https://docs.google.com/)
306
+ - [Load Test Results](https://example.com)
307
+
308
+ ### Review History
309
+
310
+ - 2024-01-15: Initial draft (Alice)
311
+ - 2024-01-17: Added cost analysis (Bob)
312
+ - 2024-01-20: Addressed review comments
313
+
314
+ ```
315
+
316
+ ## RFC Process
317
+
318
+ ### 1. Draft (1 week)
319
+ - Author writes RFC
320
+ - Include problem, solution, alternatives
321
+ - Share with team for early feedback
322
+
323
+ ### 2. Review (1-2 weeks)
324
+ - Distribute to reviewers
325
+ - Collect comments
326
+ - Address feedback
327
+ - Iterate on design
328
+
329
+ ### 3. Approval (1 week)
330
+ - Present to architecture review
331
+ - Resolve remaining concerns
332
+ - Vote: Accept/Reject
333
+ - Update status
334
+
335
+ ### 4. Implementation
336
+ - Track progress
337
+ - Update RFC with learnings
338
+ - Mark as implemented
339
+
340
+ ## Best Practices
341
+
342
+ 1. **Clear problem**: Start with why
343
+ 2. **Concrete solution**: Be specific
344
+ 3. **Consider alternatives**: Show you explored options
345
+ 4. **Honest tradeoffs**: Every choice has costs
346
+ 5. **Measurable success**: Define done
347
+ 6. **Risk mitigation**: Plan for failure
348
+ 7. **Iterative**: Update based on feedback
349
+
350
+ ## Output Checklist
351
+
352
+ - [ ] Problem statement
353
+ - [ ] Proposed solution with architecture
354
+ - [ ] 2+ alternatives considered
355
+ - [ ] Tradeoffs documented
356
+ - [ ] Risks with mitigations
357
+ - [ ] Rollout plan with phases
358
+ - [ ] Success metrics defined
359
+ - [ ] Cost analysis
360
+ - [ ] Timeline estimated
361
+ - [ ] Reviewers assigned
362
+ ```