engsys 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (173) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +202 -0
  3. package/core/agents/aaron.md +152 -0
  4. package/core/agents/bert.md +115 -0
  5. package/core/agents/isabelle.md +136 -0
  6. package/core/agents/jody.md +150 -0
  7. package/core/agents/leith.md +111 -0
  8. package/core/agents/marcelo.md +282 -0
  9. package/core/agents/melvin.md +101 -0
  10. package/core/agents/nyx.md +152 -0
  11. package/core/agents/otto.md +168 -0
  12. package/core/agents/patricia.md +283 -0
  13. package/core/commands/design-audit-local.md +155 -0
  14. package/core/commands/design-audit.md +235 -0
  15. package/core/commands/design-critique.md +96 -0
  16. package/core/commands/file-issue.md +22 -0
  17. package/core/commands/generate-project.md +45 -0
  18. package/core/commands/implement-issue.md +37 -0
  19. package/core/commands/implement-project.md +40 -0
  20. package/core/commands/naturalize.md +61 -0
  21. package/core/commands/pre-push.md +29 -0
  22. package/core/commands/prep-review-collect.md +130 -0
  23. package/core/commands/prep-review-finalize.md +121 -0
  24. package/core/commands/prep-review-publish.md +113 -0
  25. package/core/commands/prep-review.md +65 -0
  26. package/core/commands/project-closeout.md +25 -0
  27. package/core/skills/agentic-eval/SKILL.md +195 -0
  28. package/core/skills/chrome-devtools/SKILL.md +97 -0
  29. package/core/skills/code-review/SKILL.md +26 -0
  30. package/core/skills/gh-cli/SKILL.md +2202 -0
  31. package/core/skills/git-commit/SKILL.md +124 -0
  32. package/core/skills/git-workflow-agents/SKILL.md +462 -0
  33. package/core/skills/git-workflow-agents/reference.md +220 -0
  34. package/core/skills/github-actions/SKILL.md +190 -0
  35. package/core/skills/github-issues/SKILL.md +154 -0
  36. package/core/skills/llm-structured-outputs/SKILL.md +323 -0
  37. package/core/skills/llm-structured-outputs/references/provider-details.md +392 -0
  38. package/core/skills/pre-push/SKILL.md +115 -0
  39. package/core/skills/refactor/SKILL.md +645 -0
  40. package/core/skills/web-design-reviewer/SKILL.md +371 -0
  41. package/core/skills/webapp-testing/SKILL.md +127 -0
  42. package/core/skills/webapp-testing/test-helper.js +56 -0
  43. package/core/templates/CLAUDE.md.tmpl +98 -0
  44. package/core/templates/adr-template.md +67 -0
  45. package/core/templates/gh-issue-templates/bug.md +39 -0
  46. package/core/templates/gh-issue-templates/content.md +42 -0
  47. package/core/templates/gh-issue-templates/enhancement.md +36 -0
  48. package/core/templates/gh-issue-templates/feature.md +39 -0
  49. package/core/templates/gh-issue-templates/infrastructure.md +41 -0
  50. package/core/templates/post-edit-reminders.sh.tmpl +19 -0
  51. package/core/templates/settings.json.tmpl +90 -0
  52. package/core/templates/settings.local.json.tmpl +3 -0
  53. package/core/workflows/agent-implementation-workflow.md +346 -0
  54. package/core/workflows/generate-project.md +258 -0
  55. package/core/workflows/implement-project-workflow.md +190 -0
  56. package/core/workflows/issue-tracking.md +89 -0
  57. package/core/workflows/project-closeout-ceremony.md +77 -0
  58. package/core/workflows/review-workflow.md +266 -0
  59. package/engsys.config.example.yaml +46 -0
  60. package/install +202 -0
  61. package/lessons-library/README.md +80 -0
  62. package/lessons-library/async-callbacks-verify-liveness.md +15 -0
  63. package/lessons-library/change-isnt-done-until-every-surface-updated.md +15 -0
  64. package/lessons-library/claim-then-act-for-irreversible-ops.md +16 -0
  65. package/lessons-library/co-commit-entangled-work.md +15 -0
  66. package/lessons-library/dependabot-triage-playbook.md +17 -0
  67. package/lessons-library/deploy-by-digest-and-verify-the-running-revision.md +15 -0
  68. package/lessons-library/enforce-your-guarantee-at-your-boundary.md +16 -0
  69. package/lessons-library/gate-changes-on-measurement-not-vibes.md +15 -0
  70. package/lessons-library/iac-first-no-console-changes.md +15 -0
  71. package/lessons-library/independent-objective-review-gate.md +15 -0
  72. package/lessons-library/keep-an-immutable-source-of-truth.md +15 -0
  73. package/lessons-library/long-agent-runs-checkpoint-not-poll.md +15 -0
  74. package/lessons-library/model-identity-with-stable-ids-and-provenance.md +15 -0
  75. package/lessons-library/operator-choices-are-first-class.md +15 -0
  76. package/lessons-library/prefer-tool-enforced-structured-output.md +15 -0
  77. package/lessons-library/prove-causation-before-acting.md +15 -0
  78. package/lessons-library/re-read-state-before-acting.md +14 -0
  79. package/lessons-library/read-layer-tolerates-unbackfilled-rows.md +15 -0
  80. package/lessons-library/shell-safety-pipefail-and-validate-before-teardown.md +14 -0
  81. package/lessons-library/shift-correctness-left-and-distrust-false-greens.md +15 -0
  82. package/lessons-library/stray-control-bytes-hide-changes.md +14 -0
  83. package/lessons-library/tests-can-assert-the-bug.md +15 -0
  84. package/lessons-library/verify-ground-truth-not-reports.md +15 -0
  85. package/lessons-library/worktrees-need-bootstrap-from-origin-main.md +15 -0
  86. package/lib/commands.js +356 -0
  87. package/lib/generate-team-avatars.mjs +251 -0
  88. package/lib/manifest.js +155 -0
  89. package/lib/render.js +135 -0
  90. package/lib/selftest.js +90 -0
  91. package/lib/util.js +89 -0
  92. package/lib/yaml.js +156 -0
  93. package/optional-agents/gary.md +86 -0
  94. package/optional-agents/jos.md +136 -0
  95. package/optional-agents/sandy.md +101 -0
  96. package/optional-agents/steve.md +161 -0
  97. package/package.json +43 -0
  98. package/stacks/cloud/aws/claude.fragment.md +17 -0
  99. package/stacks/cloud/aws/settings.fragment.json +39 -0
  100. package/stacks/cloud/aws/skills/aws-deployment-preflight/SKILL.md +165 -0
  101. package/stacks/cloud/aws/skills/cloud-architecture-aws/SKILL.md +265 -0
  102. package/stacks/cloud/azure/claude.fragment.md +17 -0
  103. package/stacks/cloud/azure/settings.fragment.json +45 -0
  104. package/stacks/cloud/azure/skills/azure-deployment-preflight/SKILL.md +175 -0
  105. package/stacks/cloud/azure/skills/cloud-architecture-azure/SKILL.md +211 -0
  106. package/stacks/cloud/cloudflare/claude.fragment.md +21 -0
  107. package/stacks/cloud/cloudflare/settings.fragment.json +31 -0
  108. package/stacks/cloud/cloudflare/skills/cloud-architecture-cloudflare/SKILL.md +294 -0
  109. package/stacks/cloud/cloudflare/skills/cloudflare-deployment-preflight/SKILL.md +175 -0
  110. package/stacks/cloud/gcp/claude.fragment.md +17 -0
  111. package/stacks/cloud/gcp/settings.fragment.json +40 -0
  112. package/stacks/cloud/gcp/skills/cloud-architecture-gcp/SKILL.md +208 -0
  113. package/stacks/cloud/gcp/skills/gcp-deployment-preflight/SKILL.md +137 -0
  114. package/stacks/db/mongo/skills/mongo-conventions/SKILL.md +96 -0
  115. package/stacks/db/prisma/claude.fragment.md +49 -0
  116. package/stacks/db/prisma/skills/docker-database-package-copy/SKILL.md +44 -0
  117. package/stacks/db/prisma/skills/prisma-conventions/SKILL.md +37 -0
  118. package/stacks/domain/mobile-growth/skills/apple-ads/SKILL.md +184 -0
  119. package/stacks/domain/mobile-growth/skills/apple-ads/references/benchmark-notes.md +47 -0
  120. package/stacks/domain/mobile-growth/skills/apple-ads/references/official-links.md +53 -0
  121. package/stacks/domain/mobile-growth/skills/google-play-growth/SKILL.md +197 -0
  122. package/stacks/domain/mobile-growth/skills/google-play-growth/references/benchmark-notes.md +47 -0
  123. package/stacks/domain/mobile-growth/skills/google-play-growth/references/official-links.md +45 -0
  124. package/stacks/iac/bicep/claude.fragment.md +14 -0
  125. package/stacks/iac/bicep/settings.fragment.json +20 -0
  126. package/stacks/iac/bicep/skills/iac-bicep/SKILL.md +113 -0
  127. package/stacks/iac/cdk/claude.fragment.md +14 -0
  128. package/stacks/iac/cdk/settings.fragment.json +23 -0
  129. package/stacks/iac/cdk/skills/iac-cdk/SKILL.md +104 -0
  130. package/stacks/iac/terraform/claude.fragment.md +13 -0
  131. package/stacks/iac/terraform/settings.fragment.json +25 -0
  132. package/stacks/iac/terraform/skills/iac-terraform/SKILL.md +93 -0
  133. package/stacks/iac/terraform/skills/terraform-conventions/SKILL.md +87 -0
  134. package/stacks/lang/kotlin/skills/android-testing/SKILL.md +263 -0
  135. package/stacks/lang/kotlin/skills/jetpack-compose/SKILL.md +264 -0
  136. package/stacks/lang/kotlin/skills/kotlin-coroutines/SKILL.md +329 -0
  137. package/stacks/lang/python/skills/python-conventions/SKILL.md +61 -0
  138. package/stacks/lang/shell/skills/shell-scripting/SKILL.md +110 -0
  139. package/stacks/lang/swift/skills/swift-concurrency/SKILL.md +423 -0
  140. package/stacks/lang/swift/skills/swift-concurrency/references/approachable-concurrency.md +80 -0
  141. package/stacks/lang/swift/skills/swift-concurrency/references/concurrency-patterns.md +233 -0
  142. package/stacks/lang/swift/skills/swift-concurrency/references/swiftui-concurrency.md +187 -0
  143. package/stacks/lang/swift/skills/swift-concurrency/references/synchronization-primitives.md +341 -0
  144. package/stacks/lang/swift/skills/swift-testing/SKILL.md +497 -0
  145. package/stacks/lang/swift/skills/swift-testing/references/testing-advanced.md +106 -0
  146. package/stacks/lang/swift/skills/swift-testing/references/testing-patterns.md +504 -0
  147. package/stacks/lang/swift/skills/swiftdata/SKILL.md +334 -0
  148. package/stacks/lang/swift/skills/swiftdata/references/core-data-coexistence.md +504 -0
  149. package/stacks/lang/swift/skills/swiftdata/references/swiftdata-advanced.md +975 -0
  150. package/stacks/lang/swift/skills/swiftdata/references/swiftdata-queries.md +675 -0
  151. package/stacks/lang/swift/skills/swiftui-patterns/SKILL.md +371 -0
  152. package/stacks/lang/swift/skills/swiftui-patterns/references/architecture-patterns.md +486 -0
  153. package/stacks/lang/swift/skills/swiftui-patterns/references/deprecated-migration.md +1097 -0
  154. package/stacks/lang/swift/skills/swiftui-patterns/references/design-polish.md +780 -0
  155. package/stacks/lang/swift/skills/swiftui-patterns/references/platform-and-sharing.md +696 -0
  156. package/stacks/lang/typescript/skills/typescript-conventions/SKILL.md +91 -0
  157. package/stacks/platform/android/claude.fragment.md +40 -0
  158. package/stacks/platform/android/hooks/pre-push-gradle.sh +70 -0
  159. package/stacks/platform/android/settings.fragment.json +13 -0
  160. package/stacks/platform/android/skills/android-build-conventions/SKILL.md +247 -0
  161. package/stacks/platform/ios/claude.fragment.md +24 -0
  162. package/stacks/platform/ios/hooks/pre-push-xcodebuild.sh +82 -0
  163. package/stacks/platform/ios/settings.fragment.json +21 -0
  164. package/stacks/platform/ios/skills/xcodebuildmcp-simulator-logs/SKILL.md +76 -0
  165. package/stacks/platform/web/skills/frontend-testing/SKILL.md +246 -0
  166. package/stacks/platform/web/skills/react-conventions/SKILL.md +261 -0
  167. package/stacks/platform/web/skills/web-platform-conventions/SKILL.md +55 -0
  168. package/stacks/tooling/issue-tracker-github/claude.fragment.md +10 -0
  169. package/stacks/tooling/issue-tracker-github/settings.fragment.json +24 -0
  170. package/stacks/tooling/issue-tracker-github/skills/issue-tracker-github/SKILL.md +278 -0
  171. package/stacks/tooling/issue-tracker-linear/claude.fragment.md +17 -0
  172. package/stacks/tooling/issue-tracker-linear/settings.fragment.json +9 -0
  173. package/stacks/tooling/issue-tracker-linear/skills/issue-tracker-linear/SKILL.md +183 -0
@@ -0,0 +1,211 @@
1
+ ---
2
+ name: cloud-architecture-azure
3
+ description: Azure service-level architecture knowledge — compute (Container Apps/AKS/Functions), data (PostgreSQL Flexible Server/Cosmos DB), messaging (Service Bus/Event Grid/Event Hubs), edge (Front Door/API Management), identity (Entra), storage + secrets (Blob/Key Vault/ACR), and Azure OpenAI. Cost models, service limits, failure modes, and cold-start gotchas. Activate when the active cloud is Azure and the work involves designing, scaling, costing, or diagnosing Azure architecture (Container Apps cold starts, PostgreSQL connection limits, Service Bus quotas, Front Door, NAT egress).
4
+ ---
5
+
6
+ # Azure Architecture Knowledge
7
+
8
+ Service-level detail for an Azure-backed project. Pairs with Melvin's cloud-agnostic
9
+ diagnostic checklist (traffic pattern, state location, SLAs, blast radius, cost
10
+ explosion, coordination, limits, observability) — this pack supplies the Azure-specific
11
+ answers. For concrete topology, cost tiers, and stack context, read the architecture
12
+ docs named in `CLAUDE.md`. For deep service docs use the `microsoft-docs` skill
13
+ (Microsoft Learn MCP).
14
+
15
+ ## Compute
16
+
17
+ ### Azure Container Apps (ACA)
18
+
19
+ - Serverless containers on managed Kubernetes (KEDA-based autoscaling). **Scale to
20
+ zero** is the cost win — and the latency trap: a scaled-to-zero app pays a **cold
21
+ start** (image pull + container start, seconds) on the next request. For latency-
22
+ sensitive services set **`minReplicas >= 1`** (a warm replica) — the ACA analogue of
23
+ provisioned concurrency.
24
+ - **Revisions:** each config/image change creates a revision; traffic-split between
25
+ them for blue/green and canary. The phantom "revision nobody recognizes" is usually a
26
+ stale active revision still taking traffic — check the revision list and traffic split.
27
+ - **Scaling:** KEDA scale rules on HTTP concurrency, CPU/memory, or queue length
28
+ (Service Bus, etc.). Per-app and per-environment replica ceilings apply — know them
29
+ before you bet a spike on autoscale.
30
+ - **Good for:** microservices, HTTP APIs, queue/event workers. Prefer it over AKS unless
31
+ you genuinely need Kubernetes primitives.
32
+
33
+ ### AKS
34
+
35
+ - Full Kubernetes — reach for it only when you need operators, complex scheduling,
36
+ service mesh, or a multi-tenant platform. Otherwise it's operational overhead you'll
37
+ regret; Container Apps or Functions usually suffice.
38
+
39
+ ### Azure Functions
40
+
41
+ - Event-driven serverless. **Consumption** plan scales to zero (cold starts) and bills
42
+ per execution+GB-s; **Premium** keeps pre-warmed instances (no cold start, VNet
43
+ integration); **Dedicated/App Service** for predictable steady load. Durable Functions
44
+ for orchestration/fan-out-fan-in workflows.
45
+
46
+ ## Data
47
+
48
+ ### PostgreSQL Flexible Server
49
+
50
+ - **Connection limits** scale with tier/size and are the classic bottleneck —
51
+ serverless/many-replica apps exhaust them. Use **PgBouncer** (built-in pooler) — but
52
+ it is **not available on the Burstable tier**; only GeneralPurpose+ supports it (see
53
+ the IaC lessons). On Burstable, pool in-app or upsize.
54
+ - **Tiers:** Burstable (`B`-series — dev/cheap, throttled baseline CPU, no PgBouncer) →
55
+ GeneralPurpose (`D`-series) → MemoryOptimized (`E`-series). HA = zone-redundant standby
56
+ (failover drops connections — apps must reconnect/retry). Read replicas scale reads.
57
+ - **The metric is `active_connections`**, not `connection_percent` (that's Azure SQL) —
58
+ matters for alerts (see IaC lessons).
59
+ - IOPS and storage scale together up to a point; provisioned IOPS available on higher
60
+ tiers. Watch IOPS on write-heavy workloads.
61
+
62
+ ### Cosmos DB
63
+
64
+ - Globally-distributed multi-model. **Partition key design is everything** — a hot
65
+ partition throttles (429) regardless of total RU/s; each physical partition caps
66
+ ~10,000 RU/s. Pick a high-cardinality, evenly-accessed key.
67
+ - **Throughput:** provisioned RU/s (manual or autoscale) vs **serverless** (pay-per-
68
+ request, good for spiky/dev). Five tunable consistency levels (Strong → Eventual) —
69
+ stronger costs more RU and latency; Session is the usual default.
70
+ - Cost is RU-driven: large items, cross-partition queries, and indexing-everything
71
+ inflate it. Tune the indexing policy.
72
+
73
+ ## Messaging & Orchestration
74
+
75
+ ### Service Bus
76
+
77
+ - Enterprise broker: **queues** (point-to-point) and **topics/subscriptions** (pub/sub
78
+ with SQL/correlation filters). **Sessions** for ordered/grouped processing, **DLQ**
79
+ built-in (configure max delivery count), scheduled + deferred messages, duplicate
80
+ detection.
81
+ - **Tiers:** Basic (queues only) / Standard (topics, pay-per-op) / **Premium**
82
+ (dedicated capacity, predictable latency, VNet, larger messages, required for serious
83
+ throughput). Standard has per-namespace throughput limits — Premium for high volume.
84
+ - Lock duration must exceed processing time or you get redelivery — idempotent consumers
85
+ required.
86
+
87
+ ### Event Grid / Event Hubs
88
+
89
+ - **Event Grid:** lightweight reactive pub/sub for discrete events (resource changes,
90
+ custom topics) with filtering — fan-out, low latency, cheap.
91
+ - **Event Hubs:** high-throughput streaming/ingestion (Kafka-compatible), partitioned,
92
+ consumer groups — telemetry, log/event firehose. Use Event Hubs for *streams*, Event
93
+ Grid for *discrete events*, Service Bus for *transactional work queues*.
94
+
95
+ ### Durable Functions / Logic Apps
96
+
97
+ - Durable Functions for code-first orchestration; Logic Apps for low-code connector-
98
+ driven integration workflows.
99
+
100
+ ## Edge & Networking
101
+
102
+ ### Front Door
103
+
104
+ - Global L7 load balancer + CDN + WAF at the edge. Anycast routing, TLS termination,
105
+ caching, path/host routing to origins. Use it for global entry, edge caching, and WAF;
106
+ cache key + rules drive hit ratio.
107
+
108
+ ### API Management (APIM)
109
+
110
+ - Full API gateway: policies (rate limiting, transformation, auth), product/subscription
111
+ keys, developer portal. Heavier (and pricier — Consumption tier for serverless-style
112
+ billing, Developer/Standard/Premium otherwise) than Front Door routing; use when you
113
+ need real API-management features.
114
+
115
+ ### VNet / Networking — cost landmines
116
+
117
+ - **NAT Gateway** + outbound data processing, and **cross-zone / cross-region data
118
+ transfer** are the egress budget-eaters (same shape as every cloud). Use **Private
119
+ Endpoints / Private Link** to keep traffic off the public path and **Service
120
+ Endpoints** where applicable. Default to private; public only where required.
121
+
122
+ ## Identity
123
+
124
+ ### Microsoft Entra
125
+
126
+ - **Entra ID** (workforce) and **Entra External ID** (customers/CIAM — the successor to
127
+ Azure AD B2C) for app auth: OIDC/OAuth2, social + federated IdPs, MFA, conditional
128
+ access. **Managed identities** (system- or user-assigned) are the right way for
129
+ Azure resources to authenticate to each other — no secrets in code. Prefer managed
130
+ identity + Key Vault references over connection strings everywhere.
131
+
132
+ ## Storage, Secrets & Registry
133
+
134
+ ### Blob Storage
135
+
136
+ - Tiers: Hot → Cool → Cold → Archive (retrieval latency/cost). Lifecycle policies to
137
+ age data down. Encryption at rest by default. Cost = storage + transactions + egress.
138
+
139
+ ### Key Vault
140
+
141
+ - Secrets, keys, certs. Reference from Container Apps / App Service as secret refs and
142
+ from Bicep via `getSecret()` — keep secrets out of templates and env files. **Name is
143
+ globally unique** and **soft-deleted on delete** (purge protection can block name
144
+ reuse) — see IaC lessons. RBAC or access-policy authorization model; managed identity
145
+ for access.
146
+
147
+ ### Azure Container Registry (ACR)
148
+
149
+ - Private image registry. **SKU restrictions bite** (see IaC lessons): Basic may be
150
+ unavailable on some subscriptions; `retentionPolicy` requires **Premium** (remove it
151
+ from Standard dev/staging). Use managed identity / ACR tokens for pulls.
152
+
153
+ ## Azure OpenAI / AI
154
+
155
+ - Managed OpenAI + Azure-hosted models via deployments. **Capacity is in TPM (tokens-
156
+ per-minute) quota per deployment per region** — the throughput ceiling; a naive high-
157
+ volume pipeline hits 429s, so request quota early and add backoff. **Provisioned
158
+ Throughput Units (PTUs)** reserve guaranteed capacity/latency (committed spend) vs
159
+ pay-as-you-go. Model availability varies by region. Pair with **Azure AI Search** for
160
+ managed RAG. Cost is token-driven — right-size the model per task.
161
+
162
+ ## Cost realism (where Azure bills explode)
163
+
164
+ 1. **NAT Gateway / outbound data processing** — per-GB egress. Use Private Endpoints.
165
+ 2. **Cross-zone / cross-region data transfer** — per-GB.
166
+ 3. **Container Apps / Functions over-provisioning** — vCPU-s on idle min-replicas.
167
+ 4. **PostgreSQL IOPS + over-sized tier**; Cosmos RU/s over-provisioning.
168
+ 5. **Service Bus Premium** dedicated capacity; APIM tier.
169
+ 6. **Azure OpenAI tokens / PTUs**.
170
+ 7. **Front Door + Blob egress**.
171
+
172
+ Levers: Reservations / Savings Plans (steady baseline), Spot VMs (fault-tolerant batch),
173
+ right-sizing from Azure Monitor, Private Endpoints, Blob lifecycle tiering, Cost
174
+ Management budgets + alerts.
175
+
176
+ ## Service limits (check before betting on them)
177
+
178
+ Container Apps replicas per app/environment, Functions scale limits, Service Bus per-
179
+ namespace throughput (Standard) + entity counts, PostgreSQL `max_connections` per tier,
180
+ Cosmos RU/s per partition, Front Door routes, APIM throughput per tier, Azure OpenAI
181
+ TPM/PTU per region, Key Vault transactions/sec, subscription-level vCPU quotas per
182
+ region/family. Raise soft quotas via support with lead time.
183
+
184
+ ## Observability
185
+
186
+ Azure Monitor (metrics + alerts), **Log Analytics workspace** (the store the
187
+ `AppRequests`/`ContainerAppConsoleLogs` tables live in — KQL queries scope to the
188
+ *workspace*, not App Insights), Application Insights (APM/distributed tracing). Alarm on
189
+ Container Apps replica count + restarts, Service Bus DLQ depth + active message count,
190
+ PostgreSQL `active_connections` + CPU, Cosmos 429 rate + RU consumption, Front Door
191
+ backend health.
192
+
193
+ ## Hard-won lessons
194
+
195
+ ### Container Apps VNet integration cannot be retrofitted
196
+ **Symptom:** You need Private Endpoints (e.g. a private DB path) but the Container
197
+ Apps environment was created without VNet integration; there's no in-place toggle.
198
+ **Cause:** A VNet-integrated ACA environment routes egress through the VNet — that's
199
+ a creation-time property of the *environment*, not a setting you can flip later.
200
+ **Fix:** Create the environment **VNet-integrated from the start** whenever Private
201
+ Endpoints/Link are on the roadmap (do it while the resource group is still empty).
202
+ Retrofitting means tearing down and recreating the environment.
203
+
204
+ ### Deploy by immutable digest + unique revision; verify the active revision
205
+ **Symptom:** A "green" deploy reports success but the running app is still the old
206
+ build — the new code never actually took traffic.
207
+ **Cause:** A mutable image tag (`:latest`) can resolve to a stale layer, or a config
208
+ change that produces no new revision leaves the prior revision active.
209
+ **Fix:** Deploy by **immutable image digest** with a **unique revision suffix**, then
210
+ **verify the active revision** after deploy (revision list + traffic split). Treat
211
+ the post-deploy revision check as part of the deploy, not an afterthought.
@@ -0,0 +1,21 @@
1
+ ## Cloud stack
2
+
3
+ - **Active cloud: Cloudflare.** Architecture and deploys target Cloudflare's edge
4
+ (Workers/Pages + R2/D1/KV/Durable Objects/Queues); agents load the
5
+ `cloud-architecture-cloudflare` and `cloudflare-deployment-preflight` skill packs.
6
+ - **Tool preference order** (when investigating or validating cloud state):
7
+ 1. **Wrangler CLI, read-only** — `wrangler whoami`, `wrangler deploy --dry-run`,
8
+ `wrangler d1 list` / `wrangler d1 info`, `wrangler kv namespace list`,
9
+ `wrangler r2 bucket list`, `wrangler queues list`, `wrangler versions list`,
10
+ `wrangler secret list`, `wrangler tail` and similar inspection commands. Never
11
+ mutate state to answer a question.
12
+ 2. **Docs source** — official Cloudflare documentation (developers.cloudflare.com) for
13
+ service limits, plan tiers, pricing, and binding/API behavior. Limits are plan-tier
14
+ dependent and change — verify against docs rather than from memory.
15
+ - Mutating actions (`wrangler deploy`, `secret put`, `d1 execute`, `d1 migrations apply`,
16
+ `r2 object delete`, `delete`) go through the `cloudflare-deployment-preflight` gate and
17
+ a staged `versions upload` + gradual rollout, never an ad-hoc bare deploy.
18
+
19
+ <!-- naturalize: confirm the Cloudflare account, the plan tier (Free/Paid/Enterprise —
20
+ limits depend on it), the state primitives in use (KV/D1/DO/R2), and the path to the
21
+ architecture/cost docs Melvin and Aaron should read for concrete topology. -->
@@ -0,0 +1,31 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "Bash(wrangler whoami:*)",
5
+ "Bash(wrangler deploy --dry-run:*)",
6
+ "Bash(wrangler pages functions build:*)",
7
+ "Bash(wrangler d1 list:*)",
8
+ "Bash(wrangler d1 info:*)",
9
+ "Bash(wrangler d1 migrations list:*)",
10
+ "Bash(wrangler kv:key list:*)",
11
+ "Bash(wrangler kv:namespace list:*)",
12
+ "Bash(wrangler kv namespace list:*)",
13
+ "Bash(wrangler r2 bucket list:*)",
14
+ "Bash(wrangler queues list:*)",
15
+ "Bash(wrangler versions list:*)",
16
+ "Bash(wrangler versions view:*)",
17
+ "Bash(wrangler secret list:*)",
18
+ "Bash(wrangler tail:*)"
19
+ ],
20
+ "deny": [
21
+ "Bash(wrangler delete:*)",
22
+ "Bash(wrangler d1 execute:*)",
23
+ "Bash(wrangler d1 migrations apply:*)",
24
+ "Bash(wrangler secret put:*)",
25
+ "Bash(wrangler secret delete:*)",
26
+ "Bash(wrangler r2 object delete:*)",
27
+ "Bash(wrangler kv:key delete:*)"
28
+ ]
29
+ },
30
+ "mcpServers": {}
31
+ }
@@ -0,0 +1,294 @@
1
+ ---
2
+ name: cloud-architecture-cloudflare
3
+ description: Cloudflare edge/serverless architecture knowledge — compute (Workers isolates/Containers, smart placement), Pages, storage (R2/D1/KV/Durable Objects/Vectorize), messaging (Queues), AI (Workers AI/AI Gateway), and origin pooling (Hyperdrive/Cache API). CPU-time and subrequest limits, consistency tradeoffs, failure modes, and request+CPU pricing gotchas (no R2 egress). Activate when the active cloud is Cloudflare and the work involves designing, scaling, costing, or diagnosing Cloudflare architecture (Workers CPU limits, D1 write limits, KV eventual consistency, Durable Object single-threading, R2 egress savings, subrequest caps).
4
+ ---
5
+
6
+ # Cloudflare Architecture Knowledge
7
+
8
+ Service-level detail for a Cloudflare-backed project. Pairs with Melvin's cloud-agnostic
9
+ diagnostic checklist (traffic pattern, state location, SLAs, blast radius, cost
10
+ explosion, coordination, limits, observability) — this pack supplies the Cloudflare-specific
11
+ answers for each. Cloudflare's model is fundamentally different from AWS/Azure: there are
12
+ no regions you provision into and no VMs — code runs in **V8 isolates** at the edge POP
13
+ nearest the user, and state lives in purpose-built edge primitives. For concrete project
14
+ topology, cost tiers, and stack context, read the architecture docs named in `CLAUDE.md`.
15
+
16
+ ## Compute
17
+
18
+ ### Workers (isolates)
19
+
20
+ - **No cold starts.** Workers run as **V8 isolates**, not containers/VMs — a new isolate
21
+ spins up in <5ms (often "zero" perceived) because there's no OS/runtime boot. This is
22
+ the headline architectural difference from Lambda/Cloud Functions/Container Apps: the
23
+ "scale to zero costs you a cold start" tradeoff that dominates AWS/Azure design simply
24
+ doesn't exist here. Design for it — short-lived, stateless request handlers are ideal.
25
+ - **CPU-time limit, not wall-clock.** The cap is **CPU time**, default **30s** on paid,
26
+ configurable up to **5 minutes** (`limits.cpu_ms` in `wrangler.toml`, max 300000).
27
+ Free plan is **10ms** CPU/request. Crucially, **time spent awaiting I/O (fetch, KV, D1)
28
+ does not count** against CPU time — a Worker can wait minutes on a slow subrequest and
29
+ burn near-zero CPU. The killer is *compute* (crypto, JSON parsing huge payloads, image
30
+ work, regex backtracking), not waiting. The historical "50ms CPU" number was the old
31
+ default; the platform now allows far more, but **a CPU-bound hot path is still where you
32
+ get `Exceeded CPU` (Error 1102) errors** — profile and offload heavy compute.
33
+ - **Subrequest limit** is the other hard ceiling: **50 subrequests/request on Free, 1000
34
+ on paid** (`fetch` + binding calls to KV/D1/R2/service bindings each count). A Worker
35
+ that fans out per-item to an API or DB will hit this — batch, cache, or move the fan-out
36
+ into a Durable Object / Queue consumer. Simultaneous open connections cap at ~6.
37
+ - **Memory** is **128MB per isolate** — hard limit, `Exceeded Memory` (1102) on overflow.
38
+ No tuning knob like Lambda's memory→CPU slider. Stream large bodies; never buffer a big
39
+ R2 object fully into memory.
40
+ - **Bundle size:** 3MB (Free) / 10MB compressed (paid) per Worker script. Heavy npm deps
41
+ (especially Node built-ins needing `nodejs_compat`) bloat this — tree-shake.
42
+ - **Good for:** edge APIs, auth/routing/transform middleware, request shaping, glue.
43
+ **Bad for:** CPU-heavy batch, anything needing >128MB RAM, long-running stateful compute
44
+ (use Containers or push to a real backend via Hyperdrive).
45
+
46
+ ### Workers Containers
47
+
48
+ - For workloads that don't fit the isolate model (full Linux, large memory, arbitrary
49
+ runtimes, heavy CPU, existing Docker images), **Workers Containers** run actual
50
+ containers, *orchestrated by a Durable Object* and programmatically started/stopped from
51
+ a Worker. Unlike isolates, containers **do have cold starts** (image pull + boot) and
52
+ bill for the time they're running — the AWS Fargate tradeoff reappears here. Use them as
53
+ the escape hatch, not the default; keep the hot path on isolates.
54
+
55
+ ### Smart Placement
56
+
57
+ - By default a Worker runs at the POP nearest the *user*. If the Worker makes several
58
+ round trips to a **centralized origin** (a DB in one region, a slow upstream API), edge
59
+ placement means each subrequest pays the full user→origin latency. **Smart Placement**
60
+ (`placement = { mode = "smart" }`) lets Cloudflare instead run the Worker near the
61
+ *origin*, collapsing N origin round-trips into one user round-trip. Win when the Worker
62
+ is back-end-chatty; no benefit (or slightly worse) for a Worker that mostly serves edge
63
+ data (KV/cache/static). Pair origin-DB Workers with Smart Placement + Hyperdrive.
64
+
65
+ ### Pages
66
+
67
+ - Git-integrated hosting for static sites + SPA/SSR frameworks, with **Pages Functions**
68
+ (Workers under the hood, file-based routing in `functions/`). Automatic preview
69
+ deployments per branch/PR. Increasingly converging with Workers (Workers now also serves
70
+ static assets via the `assets` binding) — for new projects prefer **Workers + static
71
+ assets** unless you specifically want Pages' Git/CI ergonomics. Same isolate runtime,
72
+ same limits apply to the Functions.
73
+
74
+ ## Storage & Data
75
+
76
+ The hardest part of Cloudflare design is **picking the right state primitive** — each has
77
+ a sharp consistency/latency/cost profile and they are *not* interchangeable.
78
+
79
+ ### R2 (object storage)
80
+
81
+ - S3-compatible object storage with the headline feature: **zero egress fees.** You pay
82
+ storage + per-operation (Class A writes/lists ~$4.50/M, Class B reads ~$0.36/M) but
83
+ **never** for data transferred out. This inverts the AWS cost calculus — for
84
+ egress-heavy workloads (media, model weights, large downloads, multi-cloud data sharing)
85
+ R2 can be dramatically cheaper than S3. The cost trap moves to **operation counts**: a
86
+ workload doing millions of tiny PUTs/LISTs pays on Class A ops even though bytes are
87
+ cheap.
88
+ - Strongly read-after-write consistent for new objects. Supports multipart upload,
89
+ presigned URLs, lifecycle rules, event notifications (→ Queues), and bucket-level
90
+ jurisdiction (EU). Access from a Worker via an R2 binding (no egress, no auth round
91
+ trip) or via the S3 API. **Stream** objects through the Worker — don't buffer (128MB
92
+ isolate cap).
93
+ - **Failure mode:** R2 ops still count as subrequests from a Worker and are subject to the
94
+ per-request subrequest cap.
95
+
96
+ ### D1 (SQLite at the edge)
97
+
98
+ - Managed **SQLite**, exposed to Workers via a binding. Real SQL, transactions, and now
99
+ **read replicas** (Sessions API routes reads to a nearby replica and guarantees
100
+ read-your-writes for that session). The primary is single-region; replicas are eventually
101
+ consistent — writes always go to the primary, so **write latency from a far POP includes
102
+ the round trip to the primary region.**
103
+ - **Limits that bite:** **10GB max per database** (it's SQLite, not a sharded cluster —
104
+ this is a hard product ceiling, plan sharding/partitioning early), max **~50 databases**
105
+ worth of bindings per Worker, **100MB max** per query result / row size limits,
106
+ parameter limits per statement. Billing is by **rows read + rows written** (not queries)
107
+ — an unindexed query that scans the table bills every row scanned. **Index aggressively**;
108
+ watch `rows_read` in the query metadata.
109
+ - **Good for:** per-tenant/per-app relational data, config, low-to-moderate write volume.
110
+ **Bad for:** a single large multi-tenant OLTP database (10GB ceiling, single-writer),
111
+ high-write-fan-in. For those, use Hyperdrive → a real Postgres, or many D1s sharded by
112
+ tenant.
113
+
114
+ ### KV (key-value)
115
+
116
+ - Global, **eventually consistent** key-value store optimized for **high-read, low-write,
117
+ read-from-everywhere** (config, feature flags, routing tables, cached tokens, session
118
+ lookups). Reads are fast at the edge (cached at the POP after first read); writes
119
+ propagate globally with **eventual consistency — up to ~60s** to be visible everywhere.
120
+ - **Do not use KV where you need read-after-write or coordination.** Last-write-wins, no
121
+ transactions, no conditional writes across the global view. A counter or a "did I already
122
+ process this" flag in KV will lose updates and read stale — that's a Durable Object job.
123
+ - **Limits:** value max 25MB, key max 512 bytes, and a soft guidance of **~1 write/sec per
124
+ key** (writes to the *same* key are rate-limited and the slow-propagation makes
125
+ write-heavy patterns wrong). Billing per read/write/delete/list op + storage.
126
+
127
+ ### Durable Objects (single-threaded coordination)
128
+
129
+ - The coordination primitive — a **single-threaded, globally-unique, addressable** object
130
+ instance (one per ID, with **transactional, strongly-consistent storage** colocated with
131
+ the compute). Because exactly one instance handles all requests for a given ID,
132
+ **serially**, it's the right tool for anything KV/D1 can't do safely: counters,
133
+ rate limiters, locks, leader election, real-time **WebSocket** rooms/hubs, collaborative
134
+ state, per-entity state machines.
135
+ - **Single-threaded = the throughput ceiling.** All requests to one DO ID queue and run one
136
+ at a time. A "hot object" (one room/tenant taking all the traffic) becomes a bottleneck
137
+ no horizontal scaling fixes — **shard the keyspace** so load spreads across many DO IDs.
138
+ This is the DO analogue of DynamoDB's hot partition.
139
+ - **WebSockets:** DOs are *the* way to do stateful WebSockets at the edge. Use the
140
+ **Hibernation API** — a DO with idle WebSockets can be evicted from memory (you stop
141
+ paying for active duration) while keeping connections open, rehydrating on the next
142
+ message. Without hibernation, thousands of idle long-lived sockets pin the DO in memory
143
+ and bill continuously.
144
+ - **Alarms:** a DO can schedule itself to wake later (`storage.setAlarm`) — the primitive
145
+ for per-object timers, retries, scheduled flushes, debouncing, and reliable background
146
+ work without a cron. Survives eviction.
147
+ - **Placement:** a DO lives in one location (near first access, or pinned by jurisdiction).
148
+ Cross-region access to a DO pays that latency — colocate the DO with its primary traffic.
149
+ - **Cost:** billed on **active duration (GB-s) + requests**; SQLite-backed DOs add
150
+ rows-read/written billing. Hibernation is the key lever to avoid paying for idle.
151
+
152
+ ### Vectorize
153
+
154
+ - Managed **vector database** for embeddings — semantic search and RAG over Workers AI /
155
+ external embeddings. Create indexes with a fixed dimension + distance metric (cosine/
156
+ euclidean/dot). Supports metadata filtering and namespaces. **Limits** on vectors per
157
+ index, dimensions, and metadata size — check before betting a large corpus on it; for
158
+ very large/complex vector workloads a dedicated vector DB via Hyperdrive may fit better.
159
+ Pairs naturally with Workers AI (embed) + an LLM (generate) for an all-edge RAG stack.
160
+
161
+ ### Cache API
162
+
163
+ - Programmatic access to Cloudflare's **CDN cache** from inside a Worker
164
+ (`caches.default.match/put`). Distinct from KV: it's **per-POP** (not global — a cache
165
+ put in one POP isn't visible in another), tied to the CDN, and ideal for caching
166
+ subrequest/compute results at the edge with normal HTTP cache semantics. Use it to
167
+ collapse repeated origin/compute work per-POP and stay under the subrequest cap. Respect
168
+ `Cache-Control`; a bad cache key (unique query param) tanks hit ratio just like CloudFront.
169
+
170
+ ## Messaging
171
+
172
+ ### Queues
173
+
174
+ - Managed **message queue** with Worker producers and consumers, **at-least-once** delivery
175
+ (idempotent consumers mandatory), **batching** (consumer gets a batch — tune
176
+ `max_batch_size` / `max_batch_timeout`), **retries** with configurable `max_retries`, and
177
+ a **dead-letter queue** for poison messages (always configure one, exactly as with SQS).
178
+ - The right tool to **decouple** work from the request path and to **get under the
179
+ subrequest limit**: instead of fanning out 500 API calls inside one Worker, enqueue 500
180
+ messages and let the consumer process them in batches across many invocations. Also
181
+ smooths spikes and isolates a slow downstream from user latency.
182
+ - Throughput/message-size limits apply (message ≤128KB, throughput quotas per queue) —
183
+ check before betting a very high-volume pipeline on it.
184
+
185
+ ## AI
186
+
187
+ ### Workers AI
188
+
189
+ - Run inference (LLMs, embeddings, image, speech, classification) on Cloudflare's **GPU
190
+ edge network** via a binding — no infra, no GPU to provision. Billed on **Neurons** (a
191
+ normalized compute unit) with a daily free allocation. **Quotas/rate limits per model**
192
+ will throttle a naive high-volume pipeline — back off and batch. Model availability and
193
+ context limits vary by model; pick the smallest model that does the job (don't run a
194
+ large LLM for a classification). Pairs with Vectorize for edge RAG.
195
+
196
+ ### AI Gateway
197
+
198
+ - A **proxy/control-plane in front of *any* model provider** (Workers AI, OpenAI,
199
+ Anthropic, etc.) that adds **caching** (dedup identical prompts → big cost saver),
200
+ **rate limiting**, **retries/fallbacks** across providers, request logging, and analytics
201
+ — without changing your application code beyond the endpoint. Use it as the single
202
+ chokepoint for all LLM traffic to control cost, observe spend, and add resilience. The
203
+ caching layer alone often pays for itself on repetitive prompts.
204
+
205
+ ## Origin connectivity
206
+
207
+ ### Hyperdrive
208
+
209
+ - **Connection pooling + query caching in front of an external origin database**
210
+ (Postgres/MySQL — e.g. RDS, Cloud SQL, Neon, Supabase). Solves the exact problem that
211
+ bites serverless + Postgres everywhere: each Worker invocation would otherwise open a new
212
+ DB connection and exhaust `max_connections` (the Lambda+RDS mismatch). Hyperdrive
213
+ **pools** connections at the edge so thousands of Workers share a small pool, and
214
+ **caches** read queries to cut round trips. It also keeps a warm connection to the
215
+ origin, hiding connection-setup latency.
216
+ - Use it whenever Workers talk to a traditional regional SQL database. Pair with **Smart
217
+ Placement** so the Worker runs near the origin. This is the Cloudflare answer to RDS
218
+ Proxy / PgBouncer — without it, a busy Worker fleet will knock over the origin DB on
219
+ connection count alone.
220
+
221
+ ## Consistency model cheat-sheet (pick the right primitive)
222
+
223
+ | Need | Use | Consistency |
224
+ | --- | --- | --- |
225
+ | High-read config / flags / global lookups | **KV** | Eventual (~60s) |
226
+ | Relational data, real SQL, transactions | **D1** | Strong on primary; replicas eventual |
227
+ | Coordination / counters / locks / WebSockets / per-entity state | **Durable Objects** | Strong, serialized, single-writer |
228
+ | Large blobs / media / egress-heavy | **R2** | Read-after-write (new objects) |
229
+ | Per-POP HTTP/compute caching | **Cache API** | Per-POP, TTL-based |
230
+ | Pool/cache to an external SQL DB | **Hyperdrive** | Inherits origin |
231
+ | Vector/embedding search | **Vectorize** | Index-level |
232
+
233
+ The classic mistake: reaching for KV because it's simple, then needing read-after-write or
234
+ a counter — that's always a Durable Object. And reaching for D1 for a single big
235
+ multi-tenant DB — that's the 10GB ceiling and single-writer, so shard D1 or use Hyperdrive.
236
+
237
+ ## Failure modes (what breaks and how it shows up)
238
+
239
+ - **`Exceeded CPU` / `Exceeded Memory` (Error 1102)** — CPU-bound or >128MB hot path.
240
+ Profile, offload heavy compute, stream large bodies, raise `limits.cpu_ms` if it's
241
+ legitimately compute-heavy and on paid.
242
+ - **Subrequest limit exceeded** — per-item fan-out inside one Worker. Batch, cache (Cache
243
+ API), or move fan-out to Queues / a Durable Object.
244
+ - **KV stale reads / lost writes** — using KV where read-after-write or coordination was
245
+ needed. Move to a Durable Object.
246
+ - **D1 `rows_read` blowup / slow queries** — unindexed scans (bills every row) or hitting
247
+ the 10GB / single-writer ceiling. Index, shard, or move to Hyperdrive+Postgres.
248
+ - **Durable Object hot-object bottleneck** — all traffic to one DO ID serializes. Shard
249
+ the keyspace across many IDs.
250
+ - **DB connections exhausted** — Workers opening direct connections to a regional Postgres.
251
+ Put **Hyperdrive** in front.
252
+ - **Idle WebSocket memory/billing** — DOs pinned by idle sockets. Use the Hibernation API.
253
+
254
+ ## Cost realism (where Cloudflare bills behave differently)
255
+
256
+ 1. **No R2 egress** — the headline saving; egress-heavy workloads are far cheaper than S3.
257
+ The cost moves to **Class A/B operation counts** — millions of tiny ops add up.
258
+ 2. **Workers = requests + CPU time**, not GB-seconds of wall-clock. Idle-waiting on I/O is
259
+ nearly free; the bill is driven by request count and *compute*. A handler that does
260
+ little CPU and waits on fetches is cheap even if slow.
261
+ 3. **D1 = rows read + rows written**, not queries — unindexed scans are the silent
262
+ multiplier. Index and watch `rows_read`.
263
+ 4. **Durable Objects = active duration (GB-s) + requests** — idle DOs pinned in memory
264
+ (esp. WebSockets without hibernation) bill continuously. Hibernate.
265
+ 5. **KV = per-op + storage** — read-heavy is its sweet spot; write-heavy is both wrong and
266
+ costly.
267
+ 6. **Workers AI = Neurons per inference** — model size and volume drive it; AI Gateway
268
+ caching cuts repetitive spend.
269
+ 7. **Queues = per-operation** on push/pull/retry — fine, but DLQ loops on poison messages
270
+ waste ops.
271
+
272
+ Levers: AI Gateway caching, Cache API + good cache keys, batching to stay under subrequest
273
+ limits, indexing D1, DO hibernation, sharding hot DOs, and Smart Placement to cut origin
274
+ round trips.
275
+
276
+ ## Limits to check before betting on them (request increases early)
277
+
278
+ Workers CPU-time (30s default / 300s max paid, 10ms Free), **subrequests 50 Free / 1000
279
+ paid**, 128MB isolate memory, bundle size (3MB/10MB), **D1 10GB/database** + rows-billing,
280
+ KV value 25MB / ~1 write/s per key / ~60s propagation, R2 operation classes, Durable Object
281
+ single-thread throughput, Queues message ≤128KB + throughput quotas, Workers AI per-model
282
+ rate limits, Vectorize index dimension/count limits. Many are plan-tier (Free vs Paid vs
283
+ Enterprise) — verify the current numbers against Cloudflare docs (they change), and confirm
284
+ the plan tier before designing around a limit.
285
+
286
+ ## Observability
287
+
288
+ `wrangler tail` for live request logs, **Workers Logs / Logpush** for persisted logs to a
289
+ destination, **Analytics Engine** for high-cardinality custom metrics written from a
290
+ Worker, **Workers Trace Events / Tail Workers** for structured per-invocation traces, and
291
+ the dashboard's per-binding analytics (D1 query metrics, KV/R2 op counts, DO duration,
292
+ Queue depth, AI Gateway logs). Alarm on the things that predict pain: Worker error rate +
293
+ CPU-limit (1102) exceptions, subrequest-limit errors, D1 `rows_read` trends, DO duration +
294
+ queue-up, Queue backlog/DLQ depth, and Workers AI rate-limit (429) responses.