agy-superpowers 5.0.6 → 5.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. package/README.md +23 -0
  2. package/package.json +1 -1
  3. package/template/agent/config.yml +9 -0
  4. package/template/agent/patches/skills-patches.md +74 -0
  5. package/template/agent/skills/api-design/SKILL.md +193 -0
  6. package/template/agent/skills/app-store-optimizer/SKILL.md +127 -0
  7. package/template/agent/skills/auth-and-identity/SKILL.md +167 -0
  8. package/template/agent/skills/backend-developer/SKILL.md +148 -0
  9. package/template/agent/skills/brainstorming/SKILL.md +3 -1
  10. package/template/agent/skills/community-manager/SKILL.md +115 -0
  11. package/template/agent/skills/content-marketer/SKILL.md +111 -0
  12. package/template/agent/skills/conversion-optimizer/SKILL.md +142 -0
  13. package/template/agent/skills/copywriter/SKILL.md +114 -0
  14. package/template/agent/skills/cto-architect/SKILL.md +133 -0
  15. package/template/agent/skills/customer-success-manager/SKILL.md +126 -0
  16. package/template/agent/skills/data-analyst/SKILL.md +147 -0
  17. package/template/agent/skills/devops-engineer/SKILL.md +117 -0
  18. package/template/agent/skills/email-infrastructure/SKILL.md +164 -0
  19. package/template/agent/skills/frontend-developer/SKILL.md +133 -0
  20. package/template/agent/skills/game-design/SKILL.md +194 -0
  21. package/template/agent/skills/game-developer/SKILL.md +175 -0
  22. package/template/agent/skills/growth-hacker/SKILL.md +122 -0
  23. package/template/agent/skills/i18n-localization/SKILL.md +126 -0
  24. package/template/agent/skills/influencer-marketer/SKILL.md +141 -0
  25. package/template/agent/skills/mobile-developer/SKILL.md +142 -0
  26. package/template/agent/skills/monetization-strategist/SKILL.md +119 -0
  27. package/template/agent/skills/paid-acquisition-specialist/SKILL.md +119 -0
  28. package/template/agent/skills/product-manager/SKILL.md +105 -0
  29. package/template/agent/skills/real-time-features/SKILL.md +194 -0
  30. package/template/agent/skills/retention-specialist/SKILL.md +123 -0
  31. package/template/agent/skills/saas-architect/SKILL.md +139 -0
  32. package/template/agent/skills/security-engineer/SKILL.md +133 -0
  33. package/template/agent/skills/seo-specialist/SKILL.md +130 -0
  34. package/template/agent/skills/subagent-driven-development/SKILL.md +7 -3
  35. package/template/agent/skills/subscription-billing/SKILL.md +179 -0
  36. package/template/agent/skills/ux-designer/SKILL.md +128 -0
  37. package/template/agent/workflows/update-superpowers.md +27 -8
@@ -0,0 +1,133 @@
1
+ ---
2
+ name: cto-architect
3
+ description: Use when making system design decisions, managing technical debt, planning for scale, hiring engineers, or reviewing overall architecture
4
+ ---
5
+
6
+ # CTO / Architect Lens
7
+
8
+ > **Philosophy:** Architecture is the decisions that are hard to reverse. Make them deliberately.
9
+ > The best architecture is the simplest one that handles today's scale and doesn't prevent tomorrow's.
10
+
11
+ ---
12
+
13
+ ## Core Instincts
14
+
15
+ - **YAGNI at architecture scale** — don't build for 10M users when you have 10K
16
+ - **Reversibility > correctness** — prefer decisions you can change over theoretically perfect ones
17
+ - **Observability is not optional** — if you can't see your system failing, you can't fix it
18
+ - **Write ADRs** — architectural decisions without documentation will be re-debated and reversed
19
+ - **Boring technology wins** — proven, well-understood tools > novel shiny tools in production
20
+
21
+ ---
22
+
23
+ ## Scale Progression (Don't Over-Engineer Early)
24
+
25
+ | User scale | Recommended architecture |
26
+ |-----------|--------------------------|
27
+ | 0 – 10K MAU | Monolith + managed DB + single server (Railway/Fly/Render) |
28
+ | 10K – 100K MAU | Monolith + read replica + CDN + caching layer (Redis) |
29
+ | 100K – 1M MAU | Modular monolith, background job queues, horizontal scaling |
30
+ | 1M+ MAU | Consider service extraction, streaming (Kafka), dedicated infra team |
31
+
32
+ **Indie hacker signal:** If you don't have 100K MAU yet, you almost certainly don't need microservices.
33
+
34
+ ---
35
+
36
+ ## Tech Debt Management
37
+
38
+ | Category | Action |
39
+ |---------|--------|
40
+ | **Critical** (breaks things now or soon) | Fix in current sprint |
41
+ | **Important** (slowing team down) | Schedule within next 2 sprints |
42
+ | **Nice to fix** (code smell, not blocking) | Add to backlog; don't block on it |
43
+
44
+ **Healthy tech debt budget:** ≤ 20% of sprint capacity. > 30% = team velocity will degrade.
45
+
46
+ **ADR (Architecture Decision Record):** Every significant architectural choice should have: Context → Decision → Consequences. Store in `/docs/architecture/`.
47
+
48
+ ---
49
+
50
+ ## System Design Principles
51
+
52
+ **For APIs:**
53
+ - Design for idempotence — same request = same result (safe to retry)
54
+ - Version from day 1: `/v1/` prefix
55
+ - Paginate everything that returns lists
56
+
57
+ **For databases:**
58
+ - Single writer, multiple readers (read replicas) before sharding
59
+ - Index on columns you query/sort/filter by
60
+ - Schema migrations: always backward-compatible + rollback script
61
+
62
+ **For reliability:**
63
+ - Circuit breakers around external API calls
64
+ - Bulkhead pattern — isolate failures so one component can't take down others
65
+ - Graceful degradation > hard failure
66
+
67
+ ---
68
+
69
+ ## ❌ Anti-Patterns to Avoid
70
+
71
+ | ❌ NEVER DO | Why | ✅ DO INSTEAD |
72
+ |------------|-----|--------------|
73
+ | Microservices at day 1 | Distributed systems complexity with no scale benefit | Monolith until pain forces it |
74
+ | No caching strategy | DB bottleneck at moderate scale | Cache at CDN, application, and DB levels |
75
+ | Shared mutable state between services | Impossible to reason about, cascading failures | Each service owns its data |
76
+ | Schema migration as afterthought | One deploy breaks prod for hours | Migration in deploy pipeline, tested in staging |
77
+ | Hire senior engineers only | Expensive, over-engineered, slow iteration | 1 senior for every 3–4 junior/mid |
78
+ | Rewrite instead of refactor | "We'll rewrite it right this time" → new mess | Strangler fig pattern for legacy rewrites |
79
+ | Single point of failure | One crash = all users down | Load balancer + multiple instances from early |
80
+
81
+ ---
82
+
83
+ ## Hiring Benchmarks
84
+
85
+ | Ratio | Rule |
86
+ |-------|------|
87
+ | Senior : Mid/Junior | 1 : 3–4 (sustainable) |
88
+ | Engineer : product (B2B SaaS) | 1 : 0.5–1 PM per 3–5 engineers |
89
+ | Time to hire (senior) | 6–12 weeks |
90
+ | Engineering velocity signal | Feature cycle time (spec to production) < 2 weeks = healthy |
91
+
92
+ ---
93
+
94
+ ## Questions You Always Ask
95
+
96
+ **When reviewing architecture:**
97
+ - What's the biggest single point of failure right now?
98
+ - If traffic 10×'d tonight, what breaks first?
99
+ - Can we roll back the last deploy in under 5 minutes?
100
+ - Is there an ADR for this decision?
101
+
102
+ **When evaluating tech stack choices:**
103
+ - Is this technology boring and proven, or novel and risky?
104
+ - How well do we understand the failure modes?
105
+ - What does the hiring market look like for this technology?
106
+
107
+ ---
108
+
109
+ ## Red Flags
110
+
111
+ **Must fix:**
112
+ - [ ] No staging environment (production = first place bugs appear)
113
+ - [ ] No observability (no logs, metrics, or traces in production)
114
+ - [ ] Single point of failure with no redundancy
115
+ - [ ] Architectural decisions undocumented (will be re-debated)
116
+
117
+ **Should fix:**
118
+ - [ ] Tech debt consuming > 30% of sprint capacity
119
+ - [ ] Microservices with < 100K MAU (unnecessary complexity)
120
+ - [ ] No on-call rotation for production incidents
121
+ - [ ] No disaster recovery / restore-from-backup drill in last 6 months
122
+
123
+ ---
124
+
125
+ ## Who to Pair With
126
+ - `devops-engineer` — for infrastructure execution and reliability engineering
127
+ - `backend-developer` — for API and database architecture decisions
128
+ - `data-analyst` — for observability and metrics infrastructure
129
+
130
+ ---
131
+
132
+ ## Tools
133
+ draw.io / Miro / Excalidraw (architecture diagrams) · ADR tools (adr-tools) · SonarQube (code quality) · Snyk (security scanning) · Linear / Jira (tech debt tracking)
@@ -0,0 +1,126 @@
1
+ ---
2
+ name: customer-success-manager
3
+ description: Use when managing user support, building feedback loops, tracking NPS/CSAT, handling churn, or building a customer-centric culture
4
+ ---
5
+
6
+ # Customer Success Manager Lens
7
+
8
+ > **Philosophy:** Customer success is proactive, not reactive. Support is reactive.
9
+ > A user who succeeds doesn't churn. A user who churns was never fully successful.
10
+
11
+ ---
12
+
13
+ ## Core Instincts
14
+
15
+ - **Proactive > reactive** — reach out before users struggle, not after they cancel
16
+ - **Churn happens before cancellation** — disengagement is the real churn event (usually 2–4 weeks before cancel)
17
+ - **Every complaint is a gift** — unhappy users who complain are giving you free product research; silent churners aren't
18
+ - **Response time = trust signal** — slow responses signal that you don't care about users
19
+ - **Success = users achieving their goal** — not "user didn't cancel yet"
20
+
21
+ ---
22
+
23
+ ## Response Time SLAs
24
+
25
+ | Priority | Situation | SLA |
26
+ |----------|-----------|-----|
27
+ | **P1 — Critical** | App is down, data loss, payment issue | < 1 hour |
28
+ | **P2 — High** | Core feature broken, blocking user's work | < 4 hours |
29
+ | **P3 — Medium** | Non-blocking bug, confusion with feature | < 24 hours |
30
+ | **P4 — Low** | Feature request, general question | < 48 hours |
31
+
32
+ **First response time benchmarks:** < 5 minutes = exceptional; < 1 hour = good; > 4 hours = churn risk.
33
+
34
+ ---
35
+
36
+ ## NPS Interpretation
37
+
38
+ | Score | Interpretation | Action |
39
+ |-------|---------------|--------|
40
+ | > 70 | World-class | Leverage promoters for referrals and testimonials |
41
+ | 50–70 | Excellent | Double down on what promoters love |
42
+ | 30–50 | Good | Investigate and convert passives (7–8) |
43
+ | 0–30 | Needs work | Focus on detractors — what's the consistent complaint? |
44
+ | < 0 | Crisis | Deep qualitative research required immediately |
45
+
46
+ **NPS survey timing:** Send after first success event, not on sign-up. Re-survey every 90 days.
47
+
48
+ ---
49
+
50
+ ## Churn Signal Detection
51
+
52
+ | Signal | Days before churn (avg) | Action |
53
+ |--------|------------------------|--------|
54
+ | No login for 7 days | 14–21 days | Automated re-engagement + personal email |
55
+ | Support ticket marked unresolved | 3–7 days | Escalate and personal follow-up |
56
+ | Downgrade plan | 0–14 days | Check-in call or personalized offer |
57
+ | Opened cancellation page | 0–3 days | Trigger save flow immediately |
58
+ | Multiple failed payments | 0–7 days | Dunning email sequence (3 emails over 7 days) |
59
+
60
+ ---
61
+
62
+ ## ❌ Anti-Patterns to Avoid
63
+
64
+ | ❌ NEVER DO | Why | ✅ DO INSTEAD |
65
+ |------------|-----|--------------|
66
+ | Auto-close support tickets without resolution | Users re-open, feel dismissed | Confirmation before closing: "Did we solve this?" |
67
+ | Generic reply templates | Feel like a robot, destroy trust | Personalize every reply (use name, reference issue) |
68
+ | No cancel/churn flow | 20–40% of cancellers are saveable | Pause option, downgrade option, discount offer |
69
+ | Collect NPS without acting on feedback | Users stop responding ("useless surveys") | Close the loop: tell users what you changed |
70
+ | Reply only to 5-star reviews | 1-star respondents are the most valuable | Respond to every 1–3 star review publicly |
71
+ | Treat all churn equally | Different churn reasons need different solutions | Segment: voluntary vs involuntary, reason codes |
72
+
73
+ ---
74
+
75
+ ## Dunning (Failed Payment Recovery)
76
+
77
+ ```
78
+ Day 0: First failed charge → Email: friendly heads-up, update card CTA
79
+ Day 3: Second attempt + Email: "Is this the right card?"
80
+ Day 7: Third attempt + Email: "Your account access is at risk"
81
+ Day 14: Cancellation + Email: "We hate to see you go — here's how to reactivate"
82
+ ```
83
+
84
+ **Involuntary churn (failed payments) = typically 20–40% of all churn.** Always set up dunning.
85
+
86
+ ---
87
+
88
+ ## Questions You Always Ask
89
+
90
+ **When reviewing CS operations:**
91
+ - What's the current first response time? (P2 benchmark: < 4 hours)
92
+ - What's the most common support ticket category? (Pattern = product/UX issue)
93
+ - What's the NPS, and are we surveying at the right time?
94
+ - What % of churn is voluntary vs involuntary?
95
+
96
+ **When a user churns:**
97
+ - Did we get exit survey data? What was the stated reason?
98
+ - Were there warning signals we could have acted on earlier?
99
+ - Is this a one-off or a pattern we see across multiple users?
100
+
101
+ ---
102
+
103
+ ## Red Flags
104
+
105
+ **Must fix:**
106
+ - [ ] P1 response time > 4 hours
107
+ - [ ] No exit survey on cancellation
108
+ - [ ] No dunning email sequence for failed payments
109
+ - [ ] NPS < 0 with no active investigation
110
+
111
+ **Should fix:**
112
+ - [ ] No churn signal tracking (usage drop, login frequency)
113
+ - [ ] Support tickets closed without user confirmation
114
+ - [ ] NPS survey sent on day 1 (too early)
115
+
116
+ ---
117
+
118
+ ## Who to Pair With
119
+ - `retention-specialist` — for proactive retention and churn prevention
120
+ - `product-manager` — convert support patterns into product improvements
121
+ - `data-analyst` — for churn analysis and cohort health monitoring
122
+
123
+ ---
124
+
125
+ ## Tools
126
+ Intercom · Crisp · HelpScout · Zendesk (support) · Delighted · Typeform (NPS) · Stripe Radar / Chargebee (dunning)
@@ -0,0 +1,147 @@
1
+ ---
2
+ name: data-analyst
3
+ description: Use when setting up metrics frameworks, analyzing funnels, running cohort analysis, designing dashboards, or evaluating A/B test results
4
+ ---
5
+
6
+ # Data Analyst Lens
7
+
8
+ > **Philosophy:** If you can't measure it, you can't improve it — but measuring the wrong thing is worse than measuring nothing.
9
+ > Good data asks better questions. It rarely answers them alone.
10
+
11
+ ---
12
+
13
+ ## Core Instincts
14
+
15
+ - **North Star Metric first** — one metric that best captures value delivered to users
16
+ - **Correlation ≠ causation** — always ask "what else changed?" before attributing a result
17
+ - **Segment always** — averages hide everything; cohort and segment data reveals reality
18
+ - **Lagging vs leading indicators** — revenue is lagging (past); activation is leading (predicts future)
19
+ - **Statistical significance is a bar, not a target** — p < 0.05 means 1 in 20 tests will false-positive
20
+
21
+ ---
22
+
23
+ ## North Star Metric Selection
24
+
25
+ | Product Type | Example North Star Metric |
26
+ |-------------|--------------------------|
27
+ | Productivity / Utility | Tasks completed per week |
28
+ | Health / Fitness | Workouts logged per month |
29
+ | Social | Messages sent per DAU |
30
+ | E-commerce | Revenue per monthly visitor |
31
+ | SaaS / B2B | Weekly active seats |
32
+ | Mobile subscription | D30 retained paying users |
33
+
34
+ **NSM must:** correlate with revenue, be measurable weekly, be understandable by the whole team.
35
+
36
+ ---
37
+
38
+ ## Standard Metrics Framework
39
+
40
+ ```
41
+ Acquisition: CAC, installs, signups, traffic source breakdown
42
+ Activation: Activation rate, time-to-aha-moment, onboarding completion %
43
+ Retention: D1/D7/D30 retention, DAU/MAU ratio, session frequency
44
+ Revenue: MRR, ARR, ARPU, LTV, churn rate (voluntary + involuntary)
45
+ Referral: Viral coefficient K, NPS, referral program conversion
46
+ ```
47
+
48
+ **DAU/MAU ratio** = engagement quality indicator:
49
+ - > 50% = highly engaging (social / gaming)
50
+ - 20–40% = good (productivity tools)
51
+ - < 10% = low engagement / retention problem
52
+
53
+ ---
54
+
55
+ ## A/B Test Significance
56
+
57
+ | Metric | Requirement |
58
+ |--------|-------------|
59
+ | Sample size per variant | ≥ 1,000 (for conversion rates) |
60
+ | Minimum test duration | 2 weeks (captures weekly patterns) |
61
+ | Statistical significance | p < 0.05 (95% confidence) |
62
+ | Practical significance | Δ > 5% (otherwise not actionable) |
63
+ | Type I error risk | 5% — 1 in 20 "significant" results is false positive |
64
+ | Type II error | Run power analysis before test (sample size calculator) |
65
+
66
+ **Never stop a test early** — stopping when significance is first reached inflates Type I error rate.
67
+
68
+ ---
69
+
70
+ ## Cohort Analysis Interpretation
71
+
72
+ ```
73
+ Week 0 cohort: users who signed up in week 0
74
+ Retention at Day 30 = % of week 0 cohort still active on day 30
75
+
76
+ Healthy retention curve: steep drop Day 0→7, then flattens (users who stay, stay)
77
+ Unhealthy curve: no flattening, continues declining → no core retained audience
78
+ ```
79
+
80
+ ---
81
+
82
+ ## ❌ Anti-Patterns to Avoid
83
+
84
+ | ❌ NEVER DO | Why | ✅ DO INSTEAD |
85
+ |------------|-----|--------------|
86
+ | Report averages only | Averages hide bimodal distributions | Report medians + percentiles (p50, p90, p99) |
87
+ | Declare test winner before reaching significance | False positive — winner may be noise | Predetermined sample size + duration |
88
+ | Track everything, focus on nothing | Data overload → analysis paralysis | 3–5 top metrics per team |
89
+ | Compare dissimilar cohorts | Apples vs oranges | Cohort by signup date, not current period |
90
+ | Attribute all growth to last-click | Multi-touch attribution required | Use first-touch + last-touch + time-decay models |
91
+ | Ignore data quality | Garbage in, garbage out | Instrument → validate → trust |
92
+
93
+ ---
94
+
95
+ ## Questions You Always Ask
96
+
97
+ **When setting up metrics:**
98
+ - What is the North Star Metric, and how often can we measure it?
99
+ - Is this a leading or lagging indicator?
100
+ - How will data be collected — are there gaps in instrumentation?
101
+
102
+ **When analyzing a result:**
103
+ - Is the sample size large enough for significance?
104
+ - Could a confounding variable explain this change?
105
+ - Does the result hold when segmented by cohort/device/acquisition source?
106
+
107
+ ---
108
+
109
+ ## Red Flags
110
+
111
+ **Must fix:**
112
+ - [ ] No North Star Metric defined
113
+ - [ ] A/B tests declared significant before reaching 1,000 per variant
114
+ - [ ] No event tracking on key activation events
115
+ - [ ] Reporting only total signups / installs (not activated users)
116
+
117
+ **Should fix:**
118
+ - [ ] No cohort retention analysis (only aggregate retention)
119
+ - [ ] All metrics reported as averages (no percentiles)
120
+ - [ ] Dashboard not reviewed in weekly team ritual
121
+
122
+ ---
123
+
124
+ ## Who to Pair With
125
+ - `growth-hacker` — for AARRR funnel analysis and experiment design
126
+ - `product-manager` — for North Star Metric definition and outcome tracking
127
+ - `retention-specialist` — for retention curve and churn cohort analysis
128
+
129
+ ---
130
+
131
+ ## Key Formulas
132
+
133
+ ```
134
+ MRR = paying_users × ARPU
135
+ ARR = MRR × 12
136
+ LTV = ARPU / monthly_churn_rate
137
+ CAC = total_acquisition_spend / new_customers
138
+ LTV:CAC ratio ≥ 3:1
139
+ DAU/MAU ratio = (DAU / MAU) × 100%
140
+ Viral coeff. K = invites_per_user × invite_conversion_rate
141
+ Monthly churn = churned_this_month / users_start_of_month
142
+ ```
143
+
144
+ ---
145
+
146
+ ## Tools
147
+ Mixpanel · Amplitude · PostHog (self-hosted) · Metabase · Google Looker Studio · Statsig / LaunchDarkly (experiment platform) · Segment (data pipeline) · BigQuery / Redshift (data warehouse)
@@ -0,0 +1,117 @@
1
+ ---
2
+ name: devops-engineer
3
+ description: Use when working on CI/CD pipelines, infrastructure, deployment, monitoring, or reliability engineering — regardless of cloud provider
4
+ ---
5
+
6
+ # DevOps Engineer Lens
7
+
8
+ > **Philosophy:** Automate everything deployable. Observe everything running. Fail safely, recover fast.
9
+ > If it's not in version control, it doesn't exist. If it's not monitored, it will fail silently.
10
+
11
+ ---
12
+
13
+ ## ⚠️ ASK BEFORE ASSUMING
14
+
15
+ | What | Why it matters |
16
+ |------|----------------|
17
+ | **Cloud provider?** AWS / GCP / Azure / Fly / Railway | Determines services and tooling |
18
+ | **Team size?** Solo / small team | Determines complexity vs value trade-offs |
19
+ | **Current deploy process?** Manual / CI/CD | Determines where to start |
20
+ | **SLO requirements?** 99.9% / 99.99% | Drives infrastructure decisions |
21
+
22
+ When unspecified, assume small team + Docker + GitHub Actions + managed cloud (Railway/Fly/Render).
23
+
24
+ ---
25
+
26
+ ## Core Instincts
27
+
28
+ - **Immutable infrastructure** — never SSH to patch production; redeploy instead
29
+ - **Observability-first** — logs, metrics, traces before adding features
30
+ - **Fail fast, recover faster** — MTTR matters more than MTBF for indie hackers
31
+ - **Automate the deploy path** — every manual step is a future incident waiting to happen
32
+ - **Secrets are not config** — credentials never live in code or environment variables baked into images
33
+
34
+ ---
35
+
36
+ ## Reliability Thresholds
37
+
38
+ | SLO | Allowed downtime/month | Allowed downtime/year |
39
+ |-----|----------------------|----------------------|
40
+ | 99% | 7.3 hours | 3.65 days |
41
+ | 99.5% | 3.6 hours | 1.83 days |
42
+ | **99.9%** | **43 minutes** | **8.7 hours** |
43
+ | 99.95% | 21 minutes | 4.4 hours |
44
+ | 99.99% | 4.3 minutes | 52 minutes |
45
+
46
+ **For indie hackers:** 99.9% is the right target. 99.99% requires significant investment — only worth it when downtime costs > infra cost.
47
+
48
+ **Key metrics:**
49
+ - **MTTR** (Mean Time to Recovery): target < 15min for P1 incidents
50
+ - **MTBF** (Mean Time Between Failures): track over rolling 30 days
51
+ - **Deploy frequency**: healthy = multiple times/day; red flag = < once/week
52
+
53
+ ---
54
+
55
+ ## ❌ Anti-Patterns to Avoid
56
+
57
+ | ❌ NEVER DO | Why | ✅ DO INSTEAD |
58
+ |------------|-----|--------------|
59
+ | Deploy directly from local machine | "Works on my machine" incidents, no audit trail | CI/CD pipeline always |
60
+ | No staging environment | Production = first time bugs are discovered | Staging that mirrors prod |
61
+ | Secrets in `.env` committed to git | One git history leak = all creds compromised | Doppler / AWS Secrets Manager / Vault |
62
+ | Long-lived feature branches | Merge conflicts, integration hell | Trunk-based dev + feature flags |
63
+ | No rollback plan | Bad deploy = extended outage | Blue-green or canary + 1-click rollback |
64
+ | Alerts on everything | Alert fatigue = ignored alerts | Page only on SLO breaches, not symptoms |
65
+ | Manual database migrations | Easy to forget, easy to run wrong order | Migration runner in deploy pipeline |
66
+
67
+ ---
68
+
69
+ ## Questions You Always Ask
70
+
71
+ **When designing infrastructure:**
72
+ - What's the rollback plan if this deploy goes wrong?
73
+ - What does a 10x traffic spike do to this setup?
74
+ - How long does a full restore from backup take?
75
+ - Who gets paged when this fails at 3am?
76
+
77
+ **When reviewing CI/CD:**
78
+ - Does every PR get tested before merge?
79
+ - Are secrets injected at runtime, not baked into images?
80
+ - Is the deploy pipeline idempotent (safe to re-run)?
81
+
82
+ ---
83
+
84
+ ## Red Flags in Code Review / Infrastructure Review
85
+
86
+ **Must fix:**
87
+ - [ ] Secrets in source code, Dockerfiles, or `.env` committed to repo
88
+ - [ ] No health check endpoint on services
89
+ - [ ] No automated tests in CI pipeline
90
+ - [ ] Manual production deploys with no audit trail
91
+
92
+ **Should fix:**
93
+ - [ ] No staging environment (or staging diverged from prod)
94
+ - [ ] Database backups untested (backup ≠ restore test)
95
+ - [ ] Alerts firing on every error (not SLO-based)
96
+ - [ ] Single point of failure with no redundancy
97
+
98
+ ---
99
+
100
+ ## Who to Pair With
101
+ - `backend-developer` — for deployment architecture of APIs
102
+ - `data-analyst` — for metrics pipeline and observability stack
103
+ - `cto-architect` — for scaling decisions and infrastructure design
104
+
105
+ ---
106
+
107
+ ## Tool Reference
108
+
109
+ | Category | Tools |
110
+ |----------|-------|
111
+ | CI/CD | GitHub Actions, GitLab CI, CircleCI |
112
+ | Container | Docker, Kubernetes (k8s when you have a team), Fly.io, Railway |
113
+ | Secrets management | Doppler, AWS Secrets Manager, 1Password Secrets |
114
+ | Monitoring | Datadog, Grafana + Prometheus, Better Uptime |
115
+ | Error tracking | Sentry, Bugsnag |
116
+ | Logging | Papertrail, Logtail, CloudWatch |
117
+ | IaC | Terraform, Pulumi (for teams), SST (for AWS serverless) |
@@ -0,0 +1,164 @@
1
+ ---
2
+ name: email-infrastructure
3
+ description: Use when setting up transactional email, managing deliverability, configuring SPF/DKIM/DMARC, building email templates, or debugging email delivery issues
4
+ ---
5
+
6
+ # Email Infrastructure Lens
7
+
8
+ > **Philosophy:** Deliverability is a reputation game. One spam complaint can blacklist your domain for weeks.
9
+ > Transactional email is infrastructure — it must be reliable, observable, and tenant-isolated.
10
+
11
+ ---
12
+
13
+ ## Core Instincts
14
+
15
+ - **Domain reputation is fragile** — separate transactional from marketing; don't let bulk mail ruin auth emails
16
+ - **Sending ≠ delivering** — always verify delivery via bounce/open tracking and suppression lists
17
+ - **Never send from your root domain** — use a subdomain (`mail.yourdomain.com`) to protect your primary domain's reputation
18
+ - **Warm up new IPs/domains** — cold domains go to spam; ramp gradually
19
+ - **Unsubscribes are legal obligations** — CAN-SPAM, GDPR require easy opt-out
20
+
21
+ ---
22
+
23
+ ## Email Type Separation
24
+
25
+ | Type | Examples | Volume | Sender domain | Provider pool |
26
+ |------|----------|--------|---------------|--------------|
27
+ | **Transactional** | Password reset, invoice, welcome | Low | `mail.yourdomain.com` | Dedicated / transactional |
28
+ | **Lifecycle / product** | Trial ending, usage nudges | Medium | `mail.yourdomain.com` | Dedicated / transactional |
29
+ | **Marketing / newsletters** | Product updates, promotions | High | `newsletter.yourdomain.com` | Separate / marketing |
30
+
31
+ ❗ **Critical:** Marketing and transactional must use separate sending pools. A spam complaint on a newsletter should never affect password reset delivery.
32
+
33
+ ---
34
+
35
+ ## DNS Authentication (Must Have)
36
+
37
+ ```
38
+ SPF (Sender Policy Framework)
39
+ → Declares which IPs are allowed to send email from your domain
40
+ → Add to DNS: TXT record on yourdomain.com
41
+ → Example: "v=spf1 include:sendgrid.net include:resend.com ~all"
42
+ → Max 10 DNS lookups (hard limit); use flattening tools if exceeded
43
+
44
+ DKIM (DomainKeys Identified Mail)
45
+ → Cryptographic signature proving email wasn't tampered with
46
+ → Your ESP generates CNAME records; add to DNS
47
+ → Check: "selector._domainkey.yourdomain.com"
48
+
49
+ DMARC (Domain-based Message Authentication)
50
+ → Policy: what to do with emails that fail SPF/DKIM
51
+ → Start: p=none (monitor) → move to p=quarantine → p=reject
52
+ → Add: _dmarc.yourdomain.com TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@yourdomain.com"
53
+ → DMARC aggregate reports tell you who's failing and why
54
+
55
+ Required order: SPF → DKIM → DMARC
56
+ Without all three: Google/Yahoo bulk sender requirements (2024) → emails rejected
57
+ ```
58
+
59
+ ---
60
+
61
+ ## Deliverability Rules
62
+
63
+ | Rule | Why |
64
+ |------|-----|
65
+ | Spam complaint rate < **0.08%** | Google/Yahoo threshold; above = Gmail blocks |
66
+ | Hard bounce rate < **2%** | Remove bounced emails immediately |
67
+ | List hygiene: unverified emails | Never send to addresses that haven't confirmed |
68
+ | Unsubscribe link required | CAN-SPAM (US) + GDPR (EU) legal requirement |
69
+ | One-click unsubscribe (RFC 8058) | Gmail requires for bulk senders (> 5K/day) |
70
+ | Text version alongside HTML | Many spam filters penalize HTML-only |
71
+
72
+ ---
73
+
74
+ ## Email Queue Architecture
75
+
76
+ ```
77
+ ❌ NEVER send email synchronously in request handler:
78
+ POST /reset-password → send email → respond
79
+
80
+ ✅ Queue email jobs:
81
+ POST /reset-password → create job in queue → respond 200
82
+ ↓ (async)
83
+ Worker picks up job → send via ESP → log result
84
+
85
+ Why: Email sending can take 1–3 seconds; timeouts → duplicate sends → user frustration
86
+ Queue retry: 3 attempts with exponential backoff (1s, 5s, 30s)
87
+ ```
88
+
89
+ ---
90
+
91
+ ## Template Best Practices
92
+
93
+ ```
94
+ Structure:
95
+ - Max width: 600px (renders correctly in all clients)
96
+ - Always include plaintext alternative
97
+ - Inline CSS only (Gmail strips <style> blocks)
98
+ - Images: always include alt text; assume images are blocked
99
+ - CTA button: use table-based HTML (VML for Outlook)
100
+
101
+ Testing:
102
+ - Litmus / Email on Acid for client rendering
103
+ - SpamAssassin score < 2 (most spam filters use SA)
104
+ - Check: mail-tester.com (free quick test)
105
+ ```
106
+
107
+ ---
108
+
109
+ ## ❌ Anti-Patterns to Avoid
110
+
111
+ | ❌ NEVER DO | Why | ✅ DO INSTEAD |
112
+ |------------|-----|--------------|
113
+ | Send from root domain | Spam complaints = root domain blacklisted | Use `mail.yourdomain.com` subdomain |
114
+ | Marketing + transactional same pool | Marketing spam rates kill auth email delivery | Separate sender pools |
115
+ | No SPF/DKIM/DMARC | Emails rejected by Gmail/Yahoo (2024 policy) | Configure all three before launch |
116
+ | Retry email without checking bounces | Sending to bounced emails = reputation damage | Remove hard bounces immediately |
117
+ | Suppress all email on one unsubscribe | User unsubscribes from marketing, loses auth emails | Separate marketing vs transactional opt-out lists |
118
+ | Send email synchronously in API handler | Timeouts → duplicate sends → user sees email twice | Job queue always |
119
+
120
+ ---
121
+
122
+ ## Questions You Always Ask
123
+
124
+ **When setting up email:**
125
+ - Are SPF, DKIM, and DMARC configured? (Check: `mxtoolbox.com`)
126
+ - Are transactional and marketing emails on separate sending pools?
127
+ - Is email sending queued (not synchronous in the request)?
128
+ - What happens when an email bounces? Is the address suppressed?
129
+
130
+ **When debugging delivery issues:**
131
+ - What does the ESP delivery log show? Was it accepted or rejected?
132
+ - Is the DMARC report showing authentication failures?
133
+ - What's the spam complaint rate this week?
134
+
135
+ ---
136
+
137
+ ## Red Flags
138
+
139
+ **Must fix:**
140
+ - [ ] No DKIM/SPF/DMARC configured (emails fail Gmail/Yahoo)
141
+ - [ ] Transactional and marketing sent from same pool
142
+ - [ ] Bounced addresses not being suppressed
143
+ - [ ] Email sent synchronously in request handler
144
+
145
+ **Should fix:**
146
+ - [ ] No plaintext version of HTML emails
147
+ - [ ] No DMARC report monitoring
148
+ - [ ] Unsubscribe doesn't work within 10 seconds (CAN-SPAM requirement)
149
+
150
+ ---
151
+
152
+ ## Who to Pair With
153
+ - `backend-developer` — for queue implementation and webhook handling
154
+ - `security-engineer` — for email token security (reset links, magic links)
155
+ - `devops-engineer` — for DNS configuration and monitoring
156
+
157
+ ---
158
+
159
+ ## Tools
160
+ **ESP:** Resend · SendGrid · Postmark · AWS SES
161
+ **Testing:** mail-tester.com · Litmus · Email on Acid
162
+ **DNS check:** MXToolbox · DMARC Analyzer
163
+ **Templates:** React Email · MJML
164
+ **Queue:** BullMQ / Inngest / Trigger.dev