npm - engsys - Versions diffs - 1.0.0 - Mend

engsys 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (173) hide show

package/LICENSE +21 -0
package/README.md +202 -0
package/core/agents/aaron.md +152 -0
package/core/agents/bert.md +115 -0
package/core/agents/isabelle.md +136 -0
package/core/agents/jody.md +150 -0
package/core/agents/leith.md +111 -0
package/core/agents/marcelo.md +282 -0
package/core/agents/melvin.md +101 -0
package/core/agents/nyx.md +152 -0
package/core/agents/otto.md +168 -0
package/core/agents/patricia.md +283 -0
package/core/commands/design-audit-local.md +155 -0
package/core/commands/design-audit.md +235 -0
package/core/commands/design-critique.md +96 -0
package/core/commands/file-issue.md +22 -0
package/core/commands/generate-project.md +45 -0
package/core/commands/implement-issue.md +37 -0
package/core/commands/implement-project.md +40 -0
package/core/commands/naturalize.md +61 -0
package/core/commands/pre-push.md +29 -0
package/core/commands/prep-review-collect.md +130 -0
package/core/commands/prep-review-finalize.md +121 -0
package/core/commands/prep-review-publish.md +113 -0
package/core/commands/prep-review.md +65 -0
package/core/commands/project-closeout.md +25 -0
package/core/skills/agentic-eval/SKILL.md +195 -0
package/core/skills/chrome-devtools/SKILL.md +97 -0
package/core/skills/code-review/SKILL.md +26 -0
package/core/skills/gh-cli/SKILL.md +2202 -0
package/core/skills/git-commit/SKILL.md +124 -0
package/core/skills/git-workflow-agents/SKILL.md +462 -0
package/core/skills/git-workflow-agents/reference.md +220 -0
package/core/skills/github-actions/SKILL.md +190 -0
package/core/skills/github-issues/SKILL.md +154 -0
package/core/skills/llm-structured-outputs/SKILL.md +323 -0
package/core/skills/llm-structured-outputs/references/provider-details.md +392 -0
package/core/skills/pre-push/SKILL.md +115 -0
package/core/skills/refactor/SKILL.md +645 -0
package/core/skills/web-design-reviewer/SKILL.md +371 -0
package/core/skills/webapp-testing/SKILL.md +127 -0
package/core/skills/webapp-testing/test-helper.js +56 -0
package/core/templates/CLAUDE.md.tmpl +98 -0
package/core/templates/adr-template.md +67 -0
package/core/templates/gh-issue-templates/bug.md +39 -0
package/core/templates/gh-issue-templates/content.md +42 -0
package/core/templates/gh-issue-templates/enhancement.md +36 -0
package/core/templates/gh-issue-templates/feature.md +39 -0
package/core/templates/gh-issue-templates/infrastructure.md +41 -0
package/core/templates/post-edit-reminders.sh.tmpl +19 -0
package/core/templates/settings.json.tmpl +90 -0
package/core/templates/settings.local.json.tmpl +3 -0
package/core/workflows/agent-implementation-workflow.md +346 -0
package/core/workflows/generate-project.md +258 -0
package/core/workflows/implement-project-workflow.md +190 -0
package/core/workflows/issue-tracking.md +89 -0
package/core/workflows/project-closeout-ceremony.md +77 -0
package/core/workflows/review-workflow.md +266 -0
package/engsys.config.example.yaml +46 -0
package/install +202 -0
package/lessons-library/README.md +80 -0
package/lessons-library/async-callbacks-verify-liveness.md +15 -0
package/lessons-library/change-isnt-done-until-every-surface-updated.md +15 -0
package/lessons-library/claim-then-act-for-irreversible-ops.md +16 -0
package/lessons-library/co-commit-entangled-work.md +15 -0
package/lessons-library/dependabot-triage-playbook.md +17 -0
package/lessons-library/deploy-by-digest-and-verify-the-running-revision.md +15 -0
package/lessons-library/enforce-your-guarantee-at-your-boundary.md +16 -0
package/lessons-library/gate-changes-on-measurement-not-vibes.md +15 -0
package/lessons-library/iac-first-no-console-changes.md +15 -0
package/lessons-library/independent-objective-review-gate.md +15 -0
package/lessons-library/keep-an-immutable-source-of-truth.md +15 -0
package/lessons-library/long-agent-runs-checkpoint-not-poll.md +15 -0
package/lessons-library/model-identity-with-stable-ids-and-provenance.md +15 -0
package/lessons-library/operator-choices-are-first-class.md +15 -0
package/lessons-library/prefer-tool-enforced-structured-output.md +15 -0
package/lessons-library/prove-causation-before-acting.md +15 -0
package/lessons-library/re-read-state-before-acting.md +14 -0
package/lessons-library/read-layer-tolerates-unbackfilled-rows.md +15 -0
package/lessons-library/shell-safety-pipefail-and-validate-before-teardown.md +14 -0
package/lessons-library/shift-correctness-left-and-distrust-false-greens.md +15 -0
package/lessons-library/stray-control-bytes-hide-changes.md +14 -0
package/lessons-library/tests-can-assert-the-bug.md +15 -0
package/lessons-library/verify-ground-truth-not-reports.md +15 -0
package/lessons-library/worktrees-need-bootstrap-from-origin-main.md +15 -0
package/lib/commands.js +356 -0
package/lib/generate-team-avatars.mjs +251 -0
package/lib/manifest.js +155 -0
package/lib/render.js +135 -0
package/lib/selftest.js +90 -0
package/lib/util.js +89 -0
package/lib/yaml.js +156 -0
package/optional-agents/gary.md +86 -0
package/optional-agents/jos.md +136 -0
package/optional-agents/sandy.md +101 -0
package/optional-agents/steve.md +161 -0
package/package.json +43 -0
package/stacks/cloud/aws/claude.fragment.md +17 -0
package/stacks/cloud/aws/settings.fragment.json +39 -0
package/stacks/cloud/aws/skills/aws-deployment-preflight/SKILL.md +165 -0
package/stacks/cloud/aws/skills/cloud-architecture-aws/SKILL.md +265 -0
package/stacks/cloud/azure/claude.fragment.md +17 -0
package/stacks/cloud/azure/settings.fragment.json +45 -0
package/stacks/cloud/azure/skills/azure-deployment-preflight/SKILL.md +175 -0
package/stacks/cloud/azure/skills/cloud-architecture-azure/SKILL.md +211 -0
package/stacks/cloud/cloudflare/claude.fragment.md +21 -0
package/stacks/cloud/cloudflare/settings.fragment.json +31 -0
package/stacks/cloud/cloudflare/skills/cloud-architecture-cloudflare/SKILL.md +294 -0
package/stacks/cloud/cloudflare/skills/cloudflare-deployment-preflight/SKILL.md +175 -0
package/stacks/cloud/gcp/claude.fragment.md +17 -0
package/stacks/cloud/gcp/settings.fragment.json +40 -0
package/stacks/cloud/gcp/skills/cloud-architecture-gcp/SKILL.md +208 -0
package/stacks/cloud/gcp/skills/gcp-deployment-preflight/SKILL.md +137 -0
package/stacks/db/mongo/skills/mongo-conventions/SKILL.md +96 -0
package/stacks/db/prisma/claude.fragment.md +49 -0
package/stacks/db/prisma/skills/docker-database-package-copy/SKILL.md +44 -0
package/stacks/db/prisma/skills/prisma-conventions/SKILL.md +37 -0
package/stacks/domain/mobile-growth/skills/apple-ads/SKILL.md +184 -0
package/stacks/domain/mobile-growth/skills/apple-ads/references/benchmark-notes.md +47 -0
package/stacks/domain/mobile-growth/skills/apple-ads/references/official-links.md +53 -0
package/stacks/domain/mobile-growth/skills/google-play-growth/SKILL.md +197 -0
package/stacks/domain/mobile-growth/skills/google-play-growth/references/benchmark-notes.md +47 -0
package/stacks/domain/mobile-growth/skills/google-play-growth/references/official-links.md +45 -0
package/stacks/iac/bicep/claude.fragment.md +14 -0
package/stacks/iac/bicep/settings.fragment.json +20 -0
package/stacks/iac/bicep/skills/iac-bicep/SKILL.md +113 -0
package/stacks/iac/cdk/claude.fragment.md +14 -0
package/stacks/iac/cdk/settings.fragment.json +23 -0
package/stacks/iac/cdk/skills/iac-cdk/SKILL.md +104 -0
package/stacks/iac/terraform/claude.fragment.md +13 -0
package/stacks/iac/terraform/settings.fragment.json +25 -0
package/stacks/iac/terraform/skills/iac-terraform/SKILL.md +93 -0
package/stacks/iac/terraform/skills/terraform-conventions/SKILL.md +87 -0
package/stacks/lang/kotlin/skills/android-testing/SKILL.md +263 -0
package/stacks/lang/kotlin/skills/jetpack-compose/SKILL.md +264 -0
package/stacks/lang/kotlin/skills/kotlin-coroutines/SKILL.md +329 -0
package/stacks/lang/python/skills/python-conventions/SKILL.md +61 -0
package/stacks/lang/shell/skills/shell-scripting/SKILL.md +110 -0
package/stacks/lang/swift/skills/swift-concurrency/SKILL.md +423 -0
package/stacks/lang/swift/skills/swift-concurrency/references/approachable-concurrency.md +80 -0
package/stacks/lang/swift/skills/swift-concurrency/references/concurrency-patterns.md +233 -0
package/stacks/lang/swift/skills/swift-concurrency/references/swiftui-concurrency.md +187 -0
package/stacks/lang/swift/skills/swift-concurrency/references/synchronization-primitives.md +341 -0
package/stacks/lang/swift/skills/swift-testing/SKILL.md +497 -0
package/stacks/lang/swift/skills/swift-testing/references/testing-advanced.md +106 -0
package/stacks/lang/swift/skills/swift-testing/references/testing-patterns.md +504 -0
package/stacks/lang/swift/skills/swiftdata/SKILL.md +334 -0
package/stacks/lang/swift/skills/swiftdata/references/core-data-coexistence.md +504 -0
package/stacks/lang/swift/skills/swiftdata/references/swiftdata-advanced.md +975 -0
package/stacks/lang/swift/skills/swiftdata/references/swiftdata-queries.md +675 -0
package/stacks/lang/swift/skills/swiftui-patterns/SKILL.md +371 -0
package/stacks/lang/swift/skills/swiftui-patterns/references/architecture-patterns.md +486 -0
package/stacks/lang/swift/skills/swiftui-patterns/references/deprecated-migration.md +1097 -0
package/stacks/lang/swift/skills/swiftui-patterns/references/design-polish.md +780 -0
package/stacks/lang/swift/skills/swiftui-patterns/references/platform-and-sharing.md +696 -0
package/stacks/lang/typescript/skills/typescript-conventions/SKILL.md +91 -0
package/stacks/platform/android/claude.fragment.md +40 -0
package/stacks/platform/android/hooks/pre-push-gradle.sh +70 -0
package/stacks/platform/android/settings.fragment.json +13 -0
package/stacks/platform/android/skills/android-build-conventions/SKILL.md +247 -0
package/stacks/platform/ios/claude.fragment.md +24 -0
package/stacks/platform/ios/hooks/pre-push-xcodebuild.sh +82 -0
package/stacks/platform/ios/settings.fragment.json +21 -0
package/stacks/platform/ios/skills/xcodebuildmcp-simulator-logs/SKILL.md +76 -0
package/stacks/platform/web/skills/frontend-testing/SKILL.md +246 -0
package/stacks/platform/web/skills/react-conventions/SKILL.md +261 -0
package/stacks/platform/web/skills/web-platform-conventions/SKILL.md +55 -0
package/stacks/tooling/issue-tracker-github/claude.fragment.md +10 -0
package/stacks/tooling/issue-tracker-github/settings.fragment.json +24 -0
package/stacks/tooling/issue-tracker-github/skills/issue-tracker-github/SKILL.md +278 -0
package/stacks/tooling/issue-tracker-linear/claude.fragment.md +17 -0
package/stacks/tooling/issue-tracker-linear/settings.fragment.json +9 -0
package/stacks/tooling/issue-tracker-linear/skills/issue-tracker-linear/SKILL.md +183 -0

package/stacks/cloud/aws/skills/cloud-architecture-aws/SKILL.md ADDED Viewed

@@ -0,0 +1,265 @@
+---
+name: cloud-architecture-aws
+description: AWS service-level architecture knowledge — compute (Lambda/Fargate/ECS/EC2), data (DynamoDB/Aurora/RDS), messaging (SQS/SNS/EventBridge/Step Functions), edge (CloudFront/API Gateway), storage (S3/KMS), and Bedrock. Cost models, service quotas, failure modes, and p99/cold-start gotchas. Activate when the active cloud is AWS and the work involves designing, scaling, costing, or diagnosing AWS architecture (Lambda cold starts, Aurora connection limits, DynamoDB hot partitions, NAT egress, Fargate task warm-up).
+---
+# AWS Architecture Knowledge
+Service-level detail for an AWS-backed project. Pairs with Melvin's cloud-agnostic
+diagnostic checklist (traffic pattern, state location, SLAs, blast radius, cost
+explosion, coordination, limits, observability) — this pack supplies the AWS-specific
+answers for each. For concrete project topology, cost tiers, and stack context, read
+the architecture docs named in `CLAUDE.md`.
+## Compute
+### Lambda
+- **Cold starts** are the p99 killer. Node/Python ~100-400ms; JVM/.NET worse;
+  VPC-attached used to add seconds (now ~sub-100ms with Hyperplane ENI, but still
+  non-zero). A function called rarely is *always* cold.
+- **Mitigations:** provisioned concurrency (you pay for warm instances), keeping the
+  bundle small (esbuild/tree-shake), avoiding heavy module-level init, SnapStart for
+  JVM. Provisioned concurrency defeats the "scale to zero" cost story — only buy it
+  where the latency SLA demands it.
+- **Concurrency limits:** default account limit is 1,000 concurrent executions (raise
+  via quota request). Burst concurrency adds a few thousand instantly then scales
+  +500/min. A traffic spike past burst headroom = throttles (429), not infinite scale.
+- **Reserved concurrency** carves a function's share out of the account pool AND caps
+  it — useful to protect a downstream (e.g. a DB) from a stampede, but it can starve
+  other functions.
+- **15-minute max** execution; 10GB memory ceiling (CPU scales with memory). Payload
+  6MB sync / 256KB async. `/tmp` is 512MB default (up to 10GB).
+- **Good for:** spiky/event-driven work, glue, APIs with tolerant latency budgets.
+  **Bad for:** sustained high-throughput, long-running, or latency-critical-at-tail.
+### Fargate / ECS
+- Serverless containers — no node management, but **task warm-up** (image pull +
+  container start) is tens of seconds. Scale-out is *not* instant; keep a warm pool /
+  min task count for latency-sensitive services. This is the Fargate analogue of cold
+  starts.
+- **Cost:** billed per vCPU-second and GB-second. A perpetually-running over-provisioned
+  task (the proverbial 8-vCPU task nobody can explain) bleeds money. Right-size from
+  CloudWatch container metrics, don't guess.
+- **ECS on EC2** vs Fargate: EC2 is cheaper at steady high utilization and gives GPU /
+  special-instance access; Fargate wins on operational simplicity and bursty workloads.
+- **EKS** only when you genuinely need the Kubernetes ecosystem (operators, complex
+  scheduling, multi-tenant platform). For most app workloads it's operational overhead
+  you'll regret — prefer Fargate/ECS or Lambda.
+### EC2
+- The escape hatch: full control, any instance family (compute/memory/GPU-optimized),
+  Spot for fault-tolerant batch (up to ~90% off, can be reclaimed with 2-min warning).
+- Savings Plans / Reserved Instances for steady baseline; on-demand for the spiky top.
+- You own patching, AMIs, autoscaling groups, health checks. Reach for it when managed
+  compute can't meet a hardware, licensing, or latency requirement.
+## Data
+### DynamoDB
+- **Partition key design is everything.** A hot partition (uneven key distribution)
+  throttles even when table-level capacity looks fine — each partition has its own
+  throughput ceiling (~3,000 RCU / 1,000 WCU). Design keys for even spread; use
+  write-sharding for naturally-skewed keys.
+- **Capacity:** on-demand (pay-per-request, instant scaling, ~7x the per-request cost)
+  vs provisioned (cheaper at predictable load, autoscaling lags spikes). On-demand for
+  unknown/spiky; provisioned+autoscaling for steady, known traffic.
+- **Consistency:** eventually consistent reads by default (half the cost); strongly
+  consistent on request (single-region, not for GSIs). Global tables = multi-region,
+  last-writer-wins, eventually consistent across regions.
+- **Single-table design** maximizes performance but is a modeling discipline — get the
+  access patterns right *first*. Item size max 400KB. Use conditional writes for
+  optimistic concurrency / idempotency.
+- **Cost traps:** GSIs double the write cost (every write replicates to each index);
+  scans are anti-patterns; large items inflate RCU/WCU.
+### Aurora / RDS PostgreSQL
+- **Connection limits** are the classic bottleneck. Postgres connections are expensive
+  (memory per connection); instance size caps `max_connections`. Lambda + RDS is a
+  notorious mismatch — N concurrent Lambdas = N connections. Use **RDS Proxy** (or
+  PgBouncer) to pool, or you'll exhaust connections under burst.
+- **Aurora** vs RDS: Aurora separates compute from a distributed storage layer — faster
+  failover (~30s), up to 15 read replicas sharing storage (no replication lag from
+  storage copy), auto-scaling storage. Costs more per hour but often wins on HA + read
+  scaling. **Aurora Serverless v2** scales ACUs with load for spiky workloads.
+- **Read replicas** scale reads, not writes. Route read traffic deliberately; beware
+  replica lag for read-after-write.
+- **IOPS cost:** Aurora bills per I/O on the standard config (can dominate the bill on
+  I/O-heavy workloads — consider Aurora I/O-Optimized). Provisioned IOPS on RDS gp3/io2
+  is a real line item.
+- **Failover:** Multi-AZ is a standby promotion (brief downtime, connections drop —
+  apps must reconnect/retry). Know what breaks when the primary fails over.
+## Messaging & Orchestration
+### SQS
+- Standard (at-least-once, best-effort ordering, near-unlimited throughput) vs **FIFO**
+  (exactly-once-processing, strict order, 300 msg/s without batching / 3,000 with).
+  Default to standard unless ordering/dedup is a hard requirement — FIFO throughput is
+  a real ceiling.
+- **Always configure a DLQ** with a sane `maxReceiveCount`. Poison messages without a
+  DLQ loop forever and burn money.
+- **Visibility timeout** must exceed worst-case processing time, or you get duplicate
+  delivery. Idempotent consumers are mandatory (at-least-once).
+- Lambda+SQS event source scales pollers automatically; watch the batch size and the
+  interaction with reserved concurrency (can throttle and silently back up the queue).
+### SNS / EventBridge
+- **SNS:** pub/sub fan-out (topic → many subscribers: SQS, Lambda, HTTP). Simple, high
+  throughput.
+- **EventBridge:** content-based routing with rule filtering, schema registry, many SaaS
+  + AWS-service event sources, scheduler. Higher latency than SNS but far richer routing.
+  Use EventBridge when you need to *route by content*; SNS when you just need fan-out.
+- EventBridge has per-rule and PutEvents throughput quotas — check before betting a
+  high-volume pipeline on it.
+### Step Functions
+- Managed orchestration for multi-step workflows — retries, error handling, parallel,
+  map, wait, human-approval. **Standard** workflows (long-running, up to 1 year, billed
+  per state transition — gets expensive at high volume) vs **Express** (≤5 min, billed
+  per request+duration, high throughput, cheaper for short high-volume flows).
+- Use it to make coordination explicit and observable instead of hiding a state machine
+  inside Lambda code. Don't use it for tight inner loops (per-transition cost).
+## Edge & Networking
+### CloudFront
+- CDN + edge caching. Fronts S3 (static), ALB/API Gateway (dynamic), Lambda@Edge /
+  CloudFront Functions for edge logic. Cuts origin load and egress cost (CloudFront
+  egress is cheaper than direct S3/EC2 egress and the origin fetch is free within AWS).
+- Cache key design and TTLs determine hit ratio — a bad cache key (e.g. including a
+  unique query param) tanks it. Honor `Cache-Control`; use cache policies deliberately.
+### API Gateway
+- REST (feature-rich, caching, usage plans, request validation — higher cost/latency)
+  vs **HTTP API** (cheaper, lower latency, fewer features — prefer it unless you need
+  REST-only features) vs WebSocket. Built-in throttling (account + per-method) protects
+  backends; tune it.
+### VPC / Networking — the cost landmines
+- **NAT Gateway** is the silent budget-eater: hourly charge **plus per-GB processing on
+  all egress through it**. High-egress workloads behind NAT bleed money. Mitigate with
+  **VPC Gateway/Interface Endpoints** (S3 + DynamoDB gateway endpoints are *free* and
+  bypass NAT entirely; PrivateLink interface endpoints have an hourly+per-GB cost but
+  beat NAT for AWS-service traffic).
+- **Cross-AZ data transfer** is charged per GB *each direction* — chatty cross-AZ
+  traffic adds up fast. Cross-region is more expensive still.
+- Default to private subnets; public only for ALBs / NAT / bastions.
+## Storage & Crypto
+### S3
+- Eleven 9s durability. Storage classes: Standard → Intelligent-Tiering (auto, good
+  default for unknown access) → Standard-IA / One Zone-IA → Glacier tiers (retrieval
+  latency + cost). Lifecycle policies to age data down.
+- **Cost:** storage + **per-request** (GET/PUT/LIST add up at high volume) + egress.
+  Globally-unique bucket names (see preflight). Enable encryption (SSE-S3/KMS),
+  versioning, and block public access by default.
+### KMS
+- Envelope encryption: KMS-managed CMK wraps per-object data keys (DEKs). Pattern for
+  crypto-shred (delete the key → data is unrecoverable). KMS request volume is cheap —
+  it's a rounding error on most bills; don't over-optimize it. Customer-managed keys
+  cost a small monthly fee + per-request; key rotation is built-in.
+## Bedrock
+- Managed access to foundation models (Anthropic Claude, Amazon Nova/Titan, Meta Llama,
+  Mistral, etc.) via a single API — no infra to run. On-demand (per-token) vs
+  **Provisioned Throughput** (reserved model units for guaranteed capacity/latency,
+  committed spend).
+- **Quotas matter:** per-model requests-per-minute and tokens-per-minute limits will
+  throttle a naive high-volume pipeline — request increases early and build in backoff.
+- Knobs: Guardrails (content filtering), Knowledge Bases (managed RAG), Agents. Cold
+  region availability varies by model — check the model is available in your region.
+- Cost is token-driven; long contexts and large outputs dominate. Cache/trim prompts,
+  pick the right model size per task (don't use a frontier model for a classification).
+## Cost realism (where AWS bills explode)
+1. **NAT Gateway egress** — per-GB on everything leaving via NAT. Use VPC endpoints.
+2. **Cross-AZ / cross-region data transfer** — per-GB each way.
+3. **Fargate/EC2 over-provisioning** — vCPU-seconds on idle headroom.
+4. **Aurora/RDS IOPS** — per-I/O can exceed compute cost on I/O-heavy workloads.
+5. **DynamoDB** — on-demand premium, GSI write amplification, scans.
+6. **Lambda provisioned concurrency** — pays for warm = no scale-to-zero savings.
+7. **S3 request volume** — per-request charges at high object counts.
+8. **Bedrock tokens** — context + output length.
+Levers: Savings Plans / Reserved Instances (steady baseline), Spot (fault-tolerant
+batch), right-sizing from CloudWatch, VPC endpoints, Intelligent-Tiering, Cost Explorer
++ budgets/alarms.
+## Service quotas (request increases *before* they bite)
+Lambda concurrent executions (1,000 default), VPC/EIP/ENI counts, Fargate task limits,
+DynamoDB table/account throughput, RDS instances + Postgres `max_connections`, API
+Gateway throttle rates, SQS FIFO throughput, Bedrock per-model RPM/TPM, S3 globally-
+unique bucket namespace, ECR repos. Many are soft — raise via Service Quotas console /
+support — but a few are hard. Check `aws service-quotas list-service-quotas` for the
+service and plan around the hard ones.
+## Observability
+CloudWatch (metrics, logs, alarms), CloudWatch Embedded Metric Format for high-cardinality
+custom metrics, X-Ray for distributed tracing, Cost Explorer + Budgets for spend. Alarm
+on the things that predict pain: Lambda throttles/errors/duration p99, SQS queue depth +
+age-of-oldest-message, DynamoDB throttled requests, Aurora connections + replica lag, NAT
+bytes processed.
+## Hard-won lessons
+### CloudFront VPC Origins preserve CloudFront's PUBLIC source IP
+**Symptom:** CloudFront → internal-ALB VPC Origin returns 504; ALB `RequestCount`
+is exactly 0; flow logs show the managed ENI's SYN with no SYN-ACK back.
+**Cause:** VPC Origins forward to the origin with **CloudFront's public origin-facing
+source IP preserved** (`130.176.x` ∈ prefix list `pl-3b927c52`), not the managed
+ENI's private VPC IP — so an ALB SG that admits only the VPC CIDR (`10.20.0.0/16`)
+REJECTs every CloudFront packet at the ALB-node ENI's ingress.
+**Fix:** Admit the service-managed `CloudFront-VPCOrigins-Service-SG` (an SG-to-SG
+rule, one per listener port) — or the origin-facing managed prefix list. A
+"looks-equivalent" VPC-CIDR rule is not a substitute.
+### Black-box 5xx: is the origin even hit? RequestCount + Flow Logs first
+**Symptom:** A fronted service (CloudFront/ELB) returns 5xx and you can't tell
+whether the request reached the origin, the targets, or the app.
+**Cause:** Reasoning from indirect proofs (Reachability Analyzer, intra-VPC curl)
+validates *adjacent* customer-controlled hops, not the actual managed path — they
+mislead you into "propagation" or "subnet" theories.
+**Fix:** Read ALB `RequestCount` first (0 = nothing completed → look upstream), then
+VPC Flow Logs filtered to the two ENI IPs. For an SG drop, read the **destination**
+ENI's *ingress* ACCEPT/REJECT — a source-side egress ACCEPT proves nothing about
+admission. Localize before hypothesizing.
+### Endpoint-only (NAT-less) VPC silently breaks public-internet egress
+**Symptom:** Server-side SSO token exchange (OAuth/JWKS) and SES email fail with a
+generic error, but only on the first *real* user — cloud smoke tests pass.
+**Cause:** VPC interface/gateway endpoints cover only **AWS** services (ECR, KMS,
+S3, Secrets Manager, Logs). Anything on the public internet that isn't an AWS
+PrivateLink service has no route out without NAT.
+**Fix:** Enumerate every public-internet dependency (OAuth/JWKS, SES/SMTP, webhooks,
+third-party APIs) before calling a private network "done"; each needs a NAT gateway
+or proxy. Add NAT additively (keep subnets isolated, attach NAT + `0.0.0.0/0`); a
+task whose SG denies egress stays isolated regardless.
+### One-off ECS task on a multi-container task def: judge by container NAME
+**Symptom:** A migration/job run on a shared multi-essential-container task def
+flakily fails the deploy with an exit-137 it didn't cause.
+**Cause:** When the overridden container (`host`) exits 0, ECS SIGKILLs the still-
+running sibling (`mcp` → 137). Reading `task.containers[0].exitCode` misattributes
+it because `DescribeTasks` container ordering is **not guaranteed**.
+**Fix:** Select the overridden container by **name** (`containers.find(c => c.name
+=== …)`), never by index. Better: give one-off jobs a dedicated single-container
+task definition so there's no sibling to kill.

package/stacks/cloud/azure/claude.fragment.md ADDED Viewed

@@ -0,0 +1,17 @@
+## Cloud stack
+- **Active cloud: Azure.** Architecture and IaC target Azure; agents load the
+  `cloud-architecture-azure` and `azure-deployment-preflight` skill packs.
+- **Tool preference order** (when investigating or validating cloud state):
+  1. **Azure CLI, read-only** — `az account show`, `az resource list`,
+     `az deployment group list`, `az monitor`, `az keyvault list`,
+     `az postgres flexible-server show` and similar inspection commands. Never mutate
+     state to answer a question.
+  2. **Docs source** — the `microsoft-docs` skill (Microsoft Learn MCP:
+     `microsoft_docs_search` / `microsoft_docs_fetch`) for service limits, SKU/tier
+     behavior, and API details. Verify quotas/SKUs against docs rather than from memory.
+- Mutating actions (deploy/provision/delete) go through Bicep + the
+  `azure-deployment-preflight` gate, never ad-hoc CLI writes.
+<!-- naturalize: confirm the subscription, region(s), resource-group naming, and the
+path to the architecture/cost docs Melvin and Aaron should read for concrete topology. -->

package/stacks/cloud/azure/settings.fragment.json ADDED Viewed

@@ -0,0 +1,45 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(az account show:*)",
+      "Bash(az account list:*)",
+      "Bash(az group list:*)",
+      "Bash(az group show:*)",
+      "Bash(az resource list:*)",
+      "Bash(az resource show:*)",
+      "Bash(az deployment group list:*)",
+      "Bash(az deployment group show:*)",
+      "Bash(az deployment group validate:*)",
+      "Bash(az deployment group what-if:*)",
+      "Bash(az deployment sub what-if:*)",
+      "Bash(az bicep build:*)",
+      "Bash(bicep build:*)",
+      "Bash(az keyvault list:*)",
+      "Bash(az keyvault show:*)",
+      "Bash(az acr show:*)",
+      "Bash(az acr list:*)",
+      "Bash(az postgres flexible-server show:*)",
+      "Bash(az postgres flexible-server list:*)",
+      "Bash(az containerapp show:*)",
+      "Bash(az containerapp list:*)",
+      "Bash(az containerapp revision list:*)",
+      "Bash(az monitor metrics list:*)",
+      "Bash(az monitor log-analytics query:*)",
+      "Bash(azd provision --preview:*)",
+      "Bash(azd env list:*)"
+    ],
+    "deny": [
+      "Bash(az deployment group create:*)",
+      "Bash(az deployment sub create:*)",
+      "Bash(az group delete:*)",
+      "Bash(az resource delete:*)",
+      "Bash(azd up:*)"
+    ]
+  },
+  "mcpServers": {
+    "microsoft-docs": {
+      "type": "http",
+      "url": "https://learn.microsoft.com/api/mcp"
+    }
+  }
+}

package/stacks/cloud/azure/skills/azure-deployment-preflight/SKILL.md ADDED Viewed

@@ -0,0 +1,175 @@
+---
+name: azure-deployment-preflight
+description: Preflight validation for Azure infrastructure deployments (Bicep/ARM). Run before any az deployment / azd up. Validates templates (bicep build, az deployment group validate / what-if), cleans up stale failed ARM deployments that block re-deploy, catches globally-unique naming conflicts (Key Vault/ACR/etc.), and checks SKU/tier and service-limit restrictions. Carries the hard-won Bicep lessons (PgBouncer Burstable limit, alert module location, KQL interpolation, ACR SKU). Activate when the active cloud is Azure and the user mentions deploying, validating Bicep, what-if, preview, az deployment, azd provision, or deploy failures.
+---
+# Azure Deployment Preflight
+Validate Bicep/ARM deployments locally and clear blocking state *before* you deploy,
+so CI doesn't discover what you could have caught. Supports both `az` CLI and `azd`
+workflows. Continue through all steps even if one fails — capture every issue, then fix
+them in a batch.
+> Discipline: **batch your fixes.** Each push triggers a ~15-30 min CI run. Read the
+> entire failing module, reason about *all* potential issues, fix them all, push once.
+> One CI run per problem cluster, not one per error message.
+## When to use
+- Before deploying infrastructure to Azure (`az deployment`, `azd up`, `azd provision`).
+- When preparing or reviewing Bicep files.
+- To preview what a deployment will change (what-if).
+- To verify permissions are sufficient.
+- After a failed deployment left ARM in a blocking state.
+## Step 1 — Detect project type & locate templates
+- **azd project:** `azure.yaml` at root → use the **azd workflow**. Bicep usually under
+  `infra/`. Otherwise use the **`az` CLI workflow**.
+- **Locate `.bicep` files** (common: `infra/`, `infrastructure/`, `deploy/`, root) and
+  the matching parameter file per template: `<name>.bicepparam` (preferred) or
+  `<name>.parameters.json`.
+- Determine the **deployment scope** from `targetScope` in the template
+  (`resourceGroup` default / `subscription` / `managementGroup` / `tenant`) — it picks
+  the validate/what-if command in Step 3.
+- Confirm context: `az account show` (subscription), resource group, location. Prompt
+  for any missing required value before proceeding.
+## Step 2 — Validate Bicep syntax
+```bash
+bicep build <bicep-file> --stdout     # or: az bicep build --file <bicep-file>
+```
+> **`bicep build` / `az bicep build` only checks syntax.** It will **not** catch metric
+> names, KQL scope, secret-ref mismatches, invalid property combinations, or naming
+> collisions. Treat it as the first gate, not the gate.
+If the Bicep CLI is missing, note it and continue — Azure validates syntax during
+validate/what-if anyway.
+## Step 3 — Full preflight validation (the real gate)
+### azd projects
+```bash
+azd provision --preview                       # or --environment <env>
+```
+### Standalone Bicep — `az deployment ... validate` then `what-if`
+`validate` catches deployment-time errors (property combos, secret refs, quota where
+checkable) that `bicep build` misses. Pass dummy values for required secure params so
+validation can run:
+```bash
+cd infrastructure
+az deployment group validate \
+  --resource-group <rg-name> \
+  --template-file main.bicep \
+  --parameters environments/<env>.bicepparam \
+  --parameters postgresAdminPassword="dummy" \
+  --parameters postgresAdminUsername="dummy"
+```
+**If it fails locally, fix it locally. Do not push and let CI discover it.**
+Then preview changes with what-if (command by scope):
+| targetScope | what-if command |
+| --- | --- |
+| `resourceGroup` (default) | `az deployment group what-if` |
+| `subscription` | `az deployment sub what-if --location <loc>` |
+| `managementGroup` | `az deployment mg what-if --management-group-id <id> --location <loc>` |
+| `tenant` | `az deployment tenant what-if --location <loc>` |
+```bash
+az deployment group what-if \
+  --resource-group <rg-name> \
+  --template-file main.bicep \
+  --parameters environments/<env>.bicepparam \
+  --validation-level Provider
+```
+**Fallback:** if `--validation-level Provider` fails on RBAC, retry with
+`ProviderNoRbac` and note in the report that the user may lack full deploy permissions.
+What-if change symbols: `+` create · `-` delete · `~` modify · `=` no change · `*`
+ignored · `!` deploy (unknown). Flag any **delete** or replacement of a stateful
+resource (PostgreSQL, Key Vault, storage).
+## Step 4 — Clean up stale failed ARM deployments
+ARM tracks deployment records **by name**. A failed sub-deployment (e.g.
+`<proj>-dev-alerts`) blocks a new run with **`DeploymentActive`** even while it sits in
+`Failed`. Clean up before each new attempt:
+```bash
+az deployment group list --resource-group <rg-name> \
+  --query "[?properties.provisioningState=='Failed'].name" -o tsv \
+  | grep -v "Failure-Anomalies" \
+  | xargs -I{} az deployment group delete --name {} --resource-group <rg-name> --no-wait
+```
+## Step 5 — Globally-unique naming conflicts
+Several Azure resource types have **globally-unique** names, and names from a prior /
+deleted deployment are either reserved or **soft-deleted** (Key Vault purge protection
+keeps the name reserved). Parameterize the name and override on conflict:
+| Resource | Typical default (gets taken) | Override pattern |
+| --- | --- | --- |
+| Key Vault | `<proj>-<env>-kv` | `<proj>-<env>-kv-<suffix>` |
+| Storage account | `<proj><env>sa` (3-24, lowercase alnum) | append short unique suffix |
+| ACR | `<proj><env>acr` | append short unique suffix |
+| Redis / Service Bus / Front Door / App | `<proj>-<env>-<x>` | `...-<suffix>` |
+**Pattern:** add `param keyVaultName` / `param <x>Name` to `main.bicep` and set the
+override in `environments/*.bicepparam` — don't hard-code the bare default. If a script
+(e.g. a Key Vault seeder) defaults the name internally, **pass the override explicitly**
+and make every workflow step that references it use the same resolved name.
+## Step 6 — SKU / tier & known Bicep gotchas (hard-won)
+These were discovered the expensive way — don't rediscover them.
+- **ACR SKU:** `Basic` may be unavailable on some subscriptions (e.g. Microsoft for
+  Startups). `Standard` works but **`retentionPolicy` requires `Premium`** — remove it
+  from dev/staging. If a failed deploy left a broken ACR, create it manually then let
+  Bicep treat it as no-change.
+- **PgBouncer not on Burstable:** dev/staging on `Standard_B*` (Burstable) cannot run
+  PgBouncer (needs GeneralPurpose+). Guard it:
+  `resource pgBouncer ... = if (currentSku.tier != 'Burstable') { ... }`.
+- **Alert module location:**
+  - `Microsoft.Insights/metricAlerts` → `location: 'global'` ✅
+  - `Microsoft.Insights/scheduledQueryRules` → `location: 'global'` ❌; use the real
+    region (e.g. `eastus2`).
+  - `scheduledQueryRules` must scope to the **Log Analytics workspace ID**, not the App
+    Insights ID — the `AppRequests` table lives in the workspace.
+- **KQL in Bicep:** queries in `'''` verbatim strings **do not interpolate** `${vars}`
+  — build the query with string-concatenation variables instead.
+- **Metric names:** PostgreSQL Flexible Server uses `active_connections`, not
+  `connection_percent` (that's Azure SQL).
+## Step 7 — Check for an in-flight deploy before triggering
+If CI auto-deploys on push to a branch, don't fire a manual deploy on top — the auto-run
+wins and yours sits queued (the user has to cancel it).
+```bash
+gh run list --workflow="<deploy workflow>" --limit 3
+```
+If a run is `queued`/`in_progress`, wait.
+## Step 8 — Report
+Summarize: validation + what-if results (creates / modifies / **deletes** /
+replacements), stale deployments cleaned, naming overrides applied, SKU/tier issues, and
+whether it's safe to deploy. Note any `ProviderNoRbac` fallback (permission gap).
+## Tool requirements
+`az` CLI 2.76+ (for `--validation-level`), `azd` (azd projects), `bicep` CLI, `gh` (if
+CI-driven). Verify auth: `az account show` / `azd auth login`. For deep Azure service
+docs, use the `microsoft-docs` skill (Microsoft Learn MCP).