@teckedd-code2save/b2dp 1.0.1 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +43 -119
- package/dist/index.js +29 -10
- package/dist/index.js.map +1 -1
- package/package.json +2 -1
- package/skills/api-test-generator/SKILL.md +72 -0
- package/skills/business-to-data-platform/SKILL.md +206 -0
- package/skills/cloud-solution-architect/SKILL.md +317 -0
- package/skills/cloud-solution-architect/references/acceptance-criteria.md +436 -0
- package/skills/cloud-solution-architect/references/architecture-styles.md +365 -0
- package/skills/cloud-solution-architect/references/best-practices.md +311 -0
- package/skills/cloud-solution-architect/references/design-patterns.md +873 -0
- package/skills/cloud-solution-architect/references/design-principles.md +328 -0
- package/skills/cloud-solution-architect/references/mission-critical.md +285 -0
- package/skills/cloud-solution-architect/references/performance-antipatterns.md +242 -0
- package/skills/cloud-solution-architect/references/technology-choices.md +159 -0
- package/skills/context7-mcp/SKILL.md +53 -0
- package/skills/frontend-data-consumer/SKILL.md +75 -0
- package/skills/frontend-design-review/SKILL.md +138 -0
- package/skills/frontend-design-review/references/pattern-examples.md +21 -0
- package/skills/frontend-design-review/references/quick-checklist.md +38 -0
- package/skills/frontend-design-review/references/review-output-format.md +68 -0
- package/skills/frontend-design-review/references/review-type-modifiers.md +31 -0
- package/skills/infrastructure-as-code-architect/SKILL.md +56 -0
|
@@ -0,0 +1,328 @@
|
|
|
1
|
+
# Azure Design Principles
|
|
2
|
+
|
|
3
|
+
Ten principles for building reliable, scalable, and manageable applications on Azure.
|
|
4
|
+
|
|
5
|
+
| # | Principle | Focus |
|
|
6
|
+
|---|-----------|-------|
|
|
7
|
+
| 1 | [Design for self-healing](#1-design-for-self-healing) | Resilience & automatic recovery |
|
|
8
|
+
| 2 | [Make all things redundant](#2-make-all-things-redundant) | Eliminate single points of failure |
|
|
9
|
+
| 3 | [Minimize coordination](#3-minimize-coordination) | Scalability through decoupling |
|
|
10
|
+
| 4 | [Design to scale out](#4-design-to-scale-out) | Horizontal scaling |
|
|
11
|
+
| 5 | [Partition around limits](#5-partition-around-limits) | Overcome service boundaries |
|
|
12
|
+
| 6 | [Design for operations](#6-design-for-operations) | Observability & automation |
|
|
13
|
+
| 7 | [Use managed services](#7-use-managed-services) | Reduce operational burden |
|
|
14
|
+
| 8 | [Use an identity service](#8-use-an-identity-service) | Centralized identity & access |
|
|
15
|
+
| 9 | [Design for evolution](#9-design-for-evolution) | Change-friendly architecture |
|
|
16
|
+
| 10 | [Build for the needs of business](#10-build-for-the-needs-of-business) | Align tech to business goals |
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 1. Design for self-healing
|
|
21
|
+
|
|
22
|
+
Design the application to detect failures, respond gracefully, and recover automatically without manual intervention.
|
|
23
|
+
|
|
24
|
+
### Recommendations
|
|
25
|
+
|
|
26
|
+
- **Implement retry logic with backoff** for transient failures in network calls, database connections, and external service interactions.
|
|
27
|
+
- **Use health endpoint monitoring** to expose liveness and readiness probes so orchestrators and load balancers can route traffic away from unhealthy instances.
|
|
28
|
+
- **Apply circuit breaker patterns** to prevent cascading failures — stop calling a failing dependency and allow it time to recover.
|
|
29
|
+
- **Degrade gracefully** by serving reduced functionality (cached data, default responses) rather than failing entirely when a dependency is unavailable.
|
|
30
|
+
- **Adopt chaos engineering** with Azure Chaos Studio to proactively inject faults and validate recovery paths before real incidents occur.
|
|
31
|
+
|
|
32
|
+
### Related design patterns
|
|
33
|
+
|
|
34
|
+
| Pattern | Purpose |
|
|
35
|
+
|---------|---------|
|
|
36
|
+
| Retry | Handle transient faults by transparently retrying failed operations |
|
|
37
|
+
| Circuit Breaker | Prevent repeated calls to a failing service |
|
|
38
|
+
| Bulkhead | Isolate failures so one component doesn't take down others |
|
|
39
|
+
| Health Endpoint Monitoring | Expose health checks for load balancers and orchestrators |
|
|
40
|
+
| Leader Election | Coordinate distributed instances by electing a leader |
|
|
41
|
+
| Throttling | Control resource consumption by limiting request rates |
|
|
42
|
+
|
|
43
|
+
### Azure services
|
|
44
|
+
|
|
45
|
+
- **Azure Chaos Studio** — fault injection and chaos experiments
|
|
46
|
+
- **Azure Monitor / Application Insights** — health monitoring, alerting, diagnostics
|
|
47
|
+
- **Azure Traffic Manager / Front Door** — DNS and global failover
|
|
48
|
+
- **Availability Zones** — zonal redundancy within a region
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## 2. Make all things redundant
|
|
53
|
+
|
|
54
|
+
Build redundancy into the application at every layer to avoid single points of failure. Composite availability formula: `1 - (1 - A)^N` where A is the availability of a single instance and N is the number of instances.
|
|
55
|
+
|
|
56
|
+
### Recommendations
|
|
57
|
+
|
|
58
|
+
- **Place VMs behind a load balancer** and deploy multiple instances to ensure requests can be served even if one instance fails.
|
|
59
|
+
- **Replicate databases** using read replicas, active geo-replication, or multi-region write to protect data and maintain read performance during outages.
|
|
60
|
+
- **Use multi-zone and multi-region deployments** to survive datacenter and regional failures — define clear RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets.
|
|
61
|
+
- **Partition workloads for availability** so that a failure in one partition doesn't affect others.
|
|
62
|
+
- **Design for automatic failover** with health probes and traffic routing that redirects users without manual intervention.
|
|
63
|
+
|
|
64
|
+
### Related design patterns
|
|
65
|
+
|
|
66
|
+
| Pattern | Purpose |
|
|
67
|
+
|---------|---------|
|
|
68
|
+
| Deployment Stamps | Deploy independent, identical copies of infrastructure |
|
|
69
|
+
| Geode | Deploy backend services across geographies |
|
|
70
|
+
| Health Endpoint Monitoring | Detect unhealthy instances for failover |
|
|
71
|
+
| Queue-Based Load Leveling | Buffer requests to smooth demand spikes |
|
|
72
|
+
|
|
73
|
+
### Azure services
|
|
74
|
+
|
|
75
|
+
- **Azure Load Balancer / Application Gateway** — distribute traffic across instances
|
|
76
|
+
- **Azure SQL geo-replication / Cosmos DB multi-region** — database redundancy
|
|
77
|
+
- **Availability Zones** — zonal redundancy within a region
|
|
78
|
+
- **Azure Site Recovery** — disaster recovery orchestration
|
|
79
|
+
- **Azure Front Door** — global load balancing with automatic failover
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## 3. Minimize coordination
|
|
84
|
+
|
|
85
|
+
Minimize coordination between application services to achieve scalability. Tightly coupled services that require synchronous calls create bottlenecks and reduce availability.
|
|
86
|
+
|
|
87
|
+
### Recommendations
|
|
88
|
+
|
|
89
|
+
- **Embrace eventual consistency** instead of requiring strong consistency across services — accept that data may be temporarily out of sync.
|
|
90
|
+
- **Use domain events and asynchronous messaging** to decouple producers and consumers so they can operate independently.
|
|
91
|
+
- **Consider CQRS** (Command Query Responsibility Segregation) to separate read and write workloads with independently optimized stores.
|
|
92
|
+
- **Design idempotent operations** so messages can be safely retried or delivered more than once without unintended side effects.
|
|
93
|
+
- **Use optimistic concurrency** with version tokens or ETags instead of pessimistic locks that create coordination bottlenecks.
|
|
94
|
+
|
|
95
|
+
### Related design patterns
|
|
96
|
+
|
|
97
|
+
| Pattern | Purpose |
|
|
98
|
+
|---------|---------|
|
|
99
|
+
| CQRS | Separate reads from writes for independent scaling |
|
|
100
|
+
| Event Sourcing | Capture all changes as an immutable sequence of events |
|
|
101
|
+
| Saga | Manage distributed transactions without two-phase commit |
|
|
102
|
+
| Asynchronous Request-Reply | Decouple request and response across services |
|
|
103
|
+
| Competing Consumers | Process messages in parallel across multiple consumers |
|
|
104
|
+
|
|
105
|
+
### Azure services
|
|
106
|
+
|
|
107
|
+
- **Azure Service Bus** — reliable enterprise messaging with queues and topics
|
|
108
|
+
- **Azure Event Grid** — event-driven routing at scale
|
|
109
|
+
- **Azure Event Hubs** — high-throughput event streaming
|
|
110
|
+
- **Azure Cosmos DB** — tunable consistency levels (eventual to strong)
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## 4. Design to scale out
|
|
115
|
+
|
|
116
|
+
Design the application so it can scale horizontally by adding or removing instances, rather than scaling up to larger hardware.
|
|
117
|
+
|
|
118
|
+
### Recommendations
|
|
119
|
+
|
|
120
|
+
- **Avoid instance stickiness and session affinity** — store session state externally (Redis, database) so any instance can handle any request.
|
|
121
|
+
- **Identify and resolve bottlenecks** that prevent horizontal scaling, such as shared databases, monolithic components, or stateful in-memory caches.
|
|
122
|
+
- **Decompose workloads** into discrete services that can be scaled independently based on their specific demand profiles.
|
|
123
|
+
- **Use autoscaling based on live metrics** (CPU, queue depth, request latency) rather than fixed schedules to match capacity to real demand.
|
|
124
|
+
- **Design for scale-in** — handle instance removal gracefully with connection draining and proper shutdown hooks.
|
|
125
|
+
|
|
126
|
+
### Related design patterns
|
|
127
|
+
|
|
128
|
+
| Pattern | Purpose |
|
|
129
|
+
|---------|---------|
|
|
130
|
+
| Competing Consumers | Distribute work across multiple consumers |
|
|
131
|
+
| Sharding | Distribute data across partitions for parallel processing |
|
|
132
|
+
| Deployment Stamps | Scale by deploying additional independent stamps |
|
|
133
|
+
| Static Content Hosting | Offload static assets to reduce compute load |
|
|
134
|
+
| Throttling | Protect the system from overload during scale events |
|
|
135
|
+
|
|
136
|
+
### Azure services
|
|
137
|
+
|
|
138
|
+
- **Azure Virtual Machine Scale Sets** — autoscale VM pools
|
|
139
|
+
- **Azure App Service / Azure Functions** — built-in autoscale
|
|
140
|
+
- **Azure Kubernetes Service (AKS)** — horizontal pod autoscaler and cluster autoscaler
|
|
141
|
+
- **Azure Cache for Redis** — externalize session state
|
|
142
|
+
- **Azure CDN / Front Door** — offload static content delivery
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## 5. Partition around limits
|
|
147
|
+
|
|
148
|
+
Use partitioning to work around database, network, and compute limits. Every Azure service has limits — partitioning allows you to scale beyond them.
|
|
149
|
+
|
|
150
|
+
### Recommendations
|
|
151
|
+
|
|
152
|
+
- **Partition databases** horizontally (sharding), vertically (splitting columns), or functionally (by bounded context) to distribute load and storage.
|
|
153
|
+
- **Design partition keys to avoid hotspots** — choose keys that distribute data and traffic evenly across partitions.
|
|
154
|
+
- **Partition at different levels** — database, queue, network, and compute — to address bottlenecks wherever they occur.
|
|
155
|
+
- **Understand service-specific limits** for throughput, connections, storage, and request rates and design partitioning strategies accordingly.
|
|
156
|
+
|
|
157
|
+
### Related design patterns
|
|
158
|
+
|
|
159
|
+
| Pattern | Purpose |
|
|
160
|
+
|---------|---------|
|
|
161
|
+
| Sharding | Distribute data across multiple databases or partitions |
|
|
162
|
+
| Priority Queue | Process high-priority work before lower-priority work |
|
|
163
|
+
| Queue-Based Load Leveling | Buffer writes to smooth spikes |
|
|
164
|
+
| Valet Key | Grant limited direct access to resources |
|
|
165
|
+
|
|
166
|
+
### Azure services
|
|
167
|
+
|
|
168
|
+
- **Azure Cosmos DB** — automatic partitioning with configurable partition keys
|
|
169
|
+
- **Azure SQL Elastic Pools** — manage and scale multiple databases
|
|
170
|
+
- **Azure Storage** — table, blob, and queue partitioning
|
|
171
|
+
- **Azure Service Bus** — partitioned queues and topics
|
|
172
|
+
|
|
173
|
+
---
|
|
174
|
+
|
|
175
|
+
## 6. Design for operations
|
|
176
|
+
|
|
177
|
+
Design the application so that the operations team has the tools they need to monitor, diagnose, and manage it in production.
|
|
178
|
+
|
|
179
|
+
### Recommendations
|
|
180
|
+
|
|
181
|
+
- **Instrument everything** with structured logging, distributed tracing, and metrics to make the system observable from day one.
|
|
182
|
+
- **Use distributed tracing** with correlation IDs that flow across service boundaries to diagnose issues in microservices architectures.
|
|
183
|
+
- **Automate operational tasks** — deployments, scaling, failover, and routine maintenance should require no manual steps.
|
|
184
|
+
- **Treat configuration as code** — store all environment configuration in version control and deploy it through the same CI/CD pipelines as application code.
|
|
185
|
+
- **Implement dashboards and alerts** that surface actionable information, not just raw data, so operators can respond quickly.
|
|
186
|
+
|
|
187
|
+
### Related design patterns
|
|
188
|
+
|
|
189
|
+
| Pattern | Purpose |
|
|
190
|
+
|---------|---------|
|
|
191
|
+
| Health Endpoint Monitoring | Expose operational health for monitoring tools |
|
|
192
|
+
| Ambassador | Offload cross-cutting concerns like logging and monitoring |
|
|
193
|
+
| Sidecar | Deploy monitoring agents alongside application containers |
|
|
194
|
+
| External Configuration Store | Centralize configuration management |
|
|
195
|
+
|
|
196
|
+
### Azure services
|
|
197
|
+
|
|
198
|
+
- **Azure Monitor** — metrics, logs, and alerts across all Azure resources
|
|
199
|
+
- **Application Insights** — application performance monitoring and distributed tracing
|
|
200
|
+
- **Azure Log Analytics** — centralized log querying with KQL
|
|
201
|
+
- **Azure Resource Manager (ARM) / Bicep** — infrastructure as code
|
|
202
|
+
- **Azure DevOps / GitHub Actions** — CI/CD pipelines
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## 7. Use managed services
|
|
207
|
+
|
|
208
|
+
Prefer platform as a service (PaaS) over infrastructure as a service (IaaS) wherever possible to reduce operational overhead.
|
|
209
|
+
|
|
210
|
+
### Recommendations
|
|
211
|
+
|
|
212
|
+
- **Default to PaaS** for compute, databases, messaging, and storage — let Azure handle OS patching, scaling, and high availability.
|
|
213
|
+
- **Use IaaS only when you need fine-grained control** over the operating system, runtime, or network configuration that PaaS cannot provide.
|
|
214
|
+
- **Leverage built-in scaling and redundancy** features of managed services instead of building and maintaining them yourself.
|
|
215
|
+
|
|
216
|
+
### Related design patterns
|
|
217
|
+
|
|
218
|
+
| Pattern | Purpose |
|
|
219
|
+
|---------|---------|
|
|
220
|
+
| Backends for Frontends | Use managed API gateways per client type |
|
|
221
|
+
| Gateway Aggregation | Aggregate calls through a managed gateway |
|
|
222
|
+
| Static Content Hosting | Use managed storage for static assets |
|
|
223
|
+
|
|
224
|
+
### Azure services
|
|
225
|
+
|
|
226
|
+
| IaaS | PaaS Alternative |
|
|
227
|
+
|------|-------------------|
|
|
228
|
+
| VMs with IIS/Nginx | Azure App Service |
|
|
229
|
+
| VMs with SQL Server | Azure SQL Database |
|
|
230
|
+
| VMs with RabbitMQ | Azure Service Bus |
|
|
231
|
+
| VMs with Kubernetes | Azure Kubernetes Service (AKS) |
|
|
232
|
+
| VMs with custom functions | Azure Functions |
|
|
233
|
+
| VMs with Redis | Azure Cache for Redis |
|
|
234
|
+
| VMs with Elasticsearch | Azure AI Search |
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## 8. Use an identity service
|
|
239
|
+
|
|
240
|
+
Use a centralized identity platform instead of building or managing your own authentication and authorization system.
|
|
241
|
+
|
|
242
|
+
### Recommendations
|
|
243
|
+
|
|
244
|
+
- **Use Microsoft Entra ID** (formerly Azure AD) as the single identity provider for users, applications, and service-to-service authentication.
|
|
245
|
+
- **Never store credentials in application code or configuration** — use managed identities, certificate-based auth, or federated credentials.
|
|
246
|
+
- **Implement federation protocols** (SAML, OIDC, OAuth 2.0) to integrate with external identity providers and enable single sign-on.
|
|
247
|
+
- **Adopt modern security features** — passwordless authentication (FIDO2, Windows Hello), conditional access policies, multi-factor authentication (MFA), and single sign-on (SSO).
|
|
248
|
+
- **Use managed identities for Azure resources** to eliminate credential management for service-to-service communication entirely.
|
|
249
|
+
|
|
250
|
+
### Related design patterns
|
|
251
|
+
|
|
252
|
+
| Pattern | Purpose |
|
|
253
|
+
|---------|---------|
|
|
254
|
+
| Federated Identity | Delegate authentication to an external identity provider |
|
|
255
|
+
| Gatekeeper | Protect backends by validating identity at the edge |
|
|
256
|
+
| Valet Key | Grant scoped, time-limited access to resources |
|
|
257
|
+
|
|
258
|
+
### Azure services
|
|
259
|
+
|
|
260
|
+
- **Microsoft Entra ID** — cloud identity and access management
|
|
261
|
+
- **Azure Managed Identities** — credential-free service-to-service auth
|
|
262
|
+
- **Azure Key Vault** — secrets, certificates, and key management
|
|
263
|
+
- **Microsoft Entra External ID** — customer and partner identity (B2C/B2B)
|
|
264
|
+
|
|
265
|
+
---
|
|
266
|
+
|
|
267
|
+
## 9. Design for evolution
|
|
268
|
+
|
|
269
|
+
Design the architecture so it can evolve over time as requirements, technologies, and team understanding change.
|
|
270
|
+
|
|
271
|
+
### Recommendations
|
|
272
|
+
|
|
273
|
+
- **Enforce loose coupling and high cohesion** — services should expose well-defined interfaces and encapsulate their internal implementation details.
|
|
274
|
+
- **Encapsulate domain knowledge** within service boundaries so changes to business logic don't ripple across the system.
|
|
275
|
+
- **Use asynchronous messaging** between services to reduce temporal coupling — services don't need to be available at the same time.
|
|
276
|
+
- **Version APIs** from day one so clients can migrate at their own pace and you can evolve without breaking existing consumers.
|
|
277
|
+
- **Deploy services independently** with their own release cadence — avoid coordinated "big bang" deployments.
|
|
278
|
+
|
|
279
|
+
### Related design patterns
|
|
280
|
+
|
|
281
|
+
| Pattern | Purpose |
|
|
282
|
+
|---------|---------|
|
|
283
|
+
| Anti-Corruption Layer | Isolate new services from legacy systems |
|
|
284
|
+
| Strangler Fig | Incrementally migrate a monolith to microservices |
|
|
285
|
+
| Backends for Frontends | Evolve APIs independently per client type |
|
|
286
|
+
| Gateway Routing | Route requests to different service versions |
|
|
287
|
+
|
|
288
|
+
### Azure services
|
|
289
|
+
|
|
290
|
+
- **Azure API Management** — API versioning, routing, and lifecycle management
|
|
291
|
+
- **Azure Kubernetes Service (AKS)** — independent service deployments with rolling updates
|
|
292
|
+
- **Azure Service Bus** — asynchronous inter-service messaging
|
|
293
|
+
- **Azure Container Apps** — revision-based deployments with traffic splitting
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
## 10. Build for the needs of business
|
|
298
|
+
|
|
299
|
+
Every design decision must be justified by a business requirement. Align technical choices with business goals, constraints, and growth plans.
|
|
300
|
+
|
|
301
|
+
### Recommendations
|
|
302
|
+
|
|
303
|
+
- **Define RTO, RPO, and MTO** (Recovery Time Objective, Recovery Point Objective, Maximum Tolerable Outage) for each workload based on business impact analysis.
|
|
304
|
+
- **Document SLAs and SLOs** — understand the composite SLA of your architecture and set internal SLOs that provide an error budget for engineering work.
|
|
305
|
+
- **Model the system around the business domain** using domain-driven design to ensure the architecture reflects how the business operates.
|
|
306
|
+
- **Define functional and nonfunctional requirements explicitly** — capture performance targets, compliance needs, data residency constraints, and user experience expectations.
|
|
307
|
+
- **Plan for growth** — design capacity models that account for business projections, seasonal peaks, and market expansion.
|
|
308
|
+
|
|
309
|
+
### Related design patterns
|
|
310
|
+
|
|
311
|
+
| Pattern | Purpose |
|
|
312
|
+
|---------|---------|
|
|
313
|
+
| Priority Queue | Process business-critical work first |
|
|
314
|
+
| Throttling | Protect SLOs under heavy load |
|
|
315
|
+
| Deployment Stamps | Scale to new markets and regions |
|
|
316
|
+
| Bulkhead | Isolate critical workloads from non-critical ones |
|
|
317
|
+
|
|
318
|
+
### Azure services
|
|
319
|
+
|
|
320
|
+
- **Azure Advisor** — cost, performance, reliability, and security recommendations
|
|
321
|
+
- **Azure Cost Management** — budget tracking and cost optimization
|
|
322
|
+
- **Azure Service Health** — SLA tracking and incident awareness
|
|
323
|
+
- **Azure Well-Architected Framework Review** — assess architecture against best practices
|
|
324
|
+
- **Azure Monitor SLO/SLI dashboards** — measure and track service level objectives
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
> Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)
|
|
@@ -0,0 +1,285 @@
|
|
|
1
|
+
# Mission-Critical Architecture on Azure
|
|
2
|
+
|
|
3
|
+
Guidance for designing mission-critical workloads on Azure that prioritize cloud-native capabilities to maximize reliability and operational effectiveness.
|
|
4
|
+
|
|
5
|
+
**Target SLO:** **99.99%** or higher — permitted annual downtime: **52 minutes 35 seconds**.
|
|
6
|
+
|
|
7
|
+
All encompassed design decisions are intended to accomplish this target SLO.
|
|
8
|
+
|
|
9
|
+
| SLO Target | Permitted Annual Downtime | Typical Use Case |
|
|
10
|
+
|---|---|---|
|
|
11
|
+
| 99.9% | 8 hours 45 minutes | Standard business apps |
|
|
12
|
+
| 99.95% | 4 hours 22 minutes | Important business apps |
|
|
13
|
+
| 99.99% | 52 minutes 35 seconds | Mission-critical workloads |
|
|
14
|
+
| 99.999% | 5 minutes 15 seconds | Safety-critical systems |
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## Key Design Strategies
|
|
19
|
+
|
|
20
|
+
### 1. Redundancy in Layers
|
|
21
|
+
|
|
22
|
+
Deploy redundancy at every layer of the architecture to eliminate single points of failure.
|
|
23
|
+
|
|
24
|
+
- Deploy to multiple regions in an **active-active** model — application distributed across 2+ Azure regions handling active user traffic simultaneously
|
|
25
|
+
- Utilize **availability zones** for all considered services — distributing components across physically separate datacenters inside a region
|
|
26
|
+
- Choose resources that support **global distribution** natively
|
|
27
|
+
- Apply zone-redundant configurations for all stateful services
|
|
28
|
+
- Ensure data replication meets RPO requirements across regions
|
|
29
|
+
|
|
30
|
+
**Azure services:** Azure Front Door (global routing), Azure Traffic Manager (DNS failover), Azure Cosmos DB (multi-region writes), Azure SQL (geo-replication)
|
|
31
|
+
|
|
32
|
+
### 2. Deployment Stamps
|
|
33
|
+
|
|
34
|
+
Deploy regional stamps as scale units — a logical set of resources that can be independently provisioned to keep up with demand changes.
|
|
35
|
+
|
|
36
|
+
- Each stamp is a **self-contained scale unit** with its own compute, caching, and local state
|
|
37
|
+
- Multiple nested scale units within a stamp (e.g., Frontend APIs and Background processors scale independently)
|
|
38
|
+
- **No dependencies between scale units** — they only communicate with shared services outside the stamp
|
|
39
|
+
- Scale units are **temporary/ephemeral** — store persistent system-of-record data only in the replicated database
|
|
40
|
+
- Use stamps for blue-green deployments by rolling out new units, validating, and gradually shifting traffic
|
|
41
|
+
|
|
42
|
+
**Key benefit:** Compartmentalization enables independent scaling and fault isolation per region.
|
|
43
|
+
|
|
44
|
+
### 3. Reliable and Repeatable Deployments
|
|
45
|
+
|
|
46
|
+
Apply the principle of Infrastructure as Code (IaC) for version control and standardized operations.
|
|
47
|
+
|
|
48
|
+
- Use **Terraform** or **Bicep** for infrastructure definition with version control
|
|
49
|
+
- Implement **zero-downtime blue/green deployment** pipelines — build and release pipelines fully automated
|
|
50
|
+
- Apply **environment consistency** — use the same deployment pipeline code across production and pre-production environments
|
|
51
|
+
- Integrate **continuous validation** — automated testing as part of DevOps processes
|
|
52
|
+
- Include synchronized **load and chaos testing** to validate both application code and underlying infrastructure
|
|
53
|
+
- Deploy stamps as a **single operational unit** — never partially deploy a stamp
|
|
54
|
+
|
|
55
|
+
### 4. Operational Insights
|
|
56
|
+
|
|
57
|
+
Build comprehensive observability without introducing single points of failure.
|
|
58
|
+
|
|
59
|
+
- Use **federated workspaces** for observability data — monitoring data for global and regional resources stored independently
|
|
60
|
+
- A centralized observability store is **NOT recommended** (it becomes a single point of failure)
|
|
61
|
+
- Use **cross-workspace querying** to achieve a unified data sink and single pane of glass for operations
|
|
62
|
+
- Construct a **layered health model** mapping application health to a traffic light model for contextualizing
|
|
63
|
+
- Health scores calculated for each **individual component**, then **aggregated at user flow level**
|
|
64
|
+
- Combine with key non-functional requirements (performance) as coefficients to quantify application health
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## Design Areas
|
|
69
|
+
|
|
70
|
+
Each design area must be addressed for a mission-critical architecture.
|
|
71
|
+
|
|
72
|
+
| Design Area | Description | Key Concerns |
|
|
73
|
+
|---|---|---|
|
|
74
|
+
| **Application platform** | Infrastructure choices and mitigations for potential failure cases | AKS vs App Service, availability zones, containerization |
|
|
75
|
+
| **Application design** | Design patterns that allow for scaling and error handling | Stateless services, async messaging, queue-based decoupling |
|
|
76
|
+
| **Networking and connectivity** | Network considerations for routing incoming traffic to stamps | Global load balancing, WAF, DDoS protection, private endpoints |
|
|
77
|
+
| **Data platform** | Choices in data store technologies | Volume, velocity, variety, veracity; active-active vs active-passive |
|
|
78
|
+
| **Deployment and testing** | Strategies for CI/CD pipelines and automation | Blue/green deployments, load testing, chaos testing |
|
|
79
|
+
| **Health modeling** | Observability through customer impact analysis | Correlated monitoring, traffic light model, health scores |
|
|
80
|
+
| **Security** | Mitigation of attack vectors | Microsoft Zero Trust model, identity-based access, encryption |
|
|
81
|
+
| **Operational procedures** | Processes related to runtime operations | Deployment SOPs, key management, patching, incident response |
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Active-Active Multi-Region Architecture
|
|
86
|
+
|
|
87
|
+
The core topology for mission-critical workloads distributes the application across multiple Azure regions.
|
|
88
|
+
|
|
89
|
+
### Architecture Characteristics
|
|
90
|
+
|
|
91
|
+
- Application distributed across **2+ Azure regions** handling active user traffic simultaneously
|
|
92
|
+
- Each region contains independent **deployment stamps** (scale units)
|
|
93
|
+
- **Azure Front Door** provides global routing, SSL termination, and WAF at the edge
|
|
94
|
+
- Scale units have **no cross-dependencies** — they communicate only with shared services (e.g., global database, DNS)
|
|
95
|
+
- Persistent data resides only in the **replicated database** — stamps store no durable local state
|
|
96
|
+
- When scale units are replaced or retired, applications reconnect transparently
|
|
97
|
+
|
|
98
|
+
### Data Replication Strategies
|
|
99
|
+
|
|
100
|
+
| Strategy | Writes | Reads | Consistency | Best For |
|
|
101
|
+
|---|---|---|---|---|
|
|
102
|
+
| Active-passive (Azure SQL) | Single primary region | All regions via read replicas | Strong | Relational data, ACID transactions |
|
|
103
|
+
| Active-active (Cosmos DB) | All regions | All regions | Tunable (5 levels) | Document/key-value data, global apps |
|
|
104
|
+
| Write-behind (Redis → SQL) | Redis first, async to SQL | Redis or SQL | Eventual | High-throughput writes, rate limiting |
|
|
105
|
+
|
|
106
|
+
### Regional Stamp Composition
|
|
107
|
+
|
|
108
|
+
Each stamp typically includes:
|
|
109
|
+
|
|
110
|
+
- **Compute tier** — App Service or AKS with multiple instances across availability zones
|
|
111
|
+
- **Caching tier** — Azure Managed Redis for session state, rate limiting, feature flags
|
|
112
|
+
- **Configuration** — Azure App Configuration for settings (capacity correlates with requests/second)
|
|
113
|
+
- **Secrets** — Azure Key Vault for certificates and secrets
|
|
114
|
+
- **Networking** — Virtual network with private endpoints, NSGs, and service endpoints
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## Health Modeling and Traffic Light Approach
|
|
119
|
+
|
|
120
|
+
Health modeling provides the foundation for automated operational decisions.
|
|
121
|
+
|
|
122
|
+
### Building the Health Model
|
|
123
|
+
|
|
124
|
+
1. **Identify user flows** — map critical paths through the application (e.g., "user login", "checkout", "search")
|
|
125
|
+
2. **Decompose into components** — each flow depends on specific compute, data, and network components
|
|
126
|
+
3. **Assign health scores** — each component reports a health score based on metrics (latency, error rate, saturation)
|
|
127
|
+
4. **Aggregate per flow** — combine component scores weighted by criticality to produce a flow-level health score
|
|
128
|
+
5. **Apply traffic light** — map aggregate scores to **Green** (healthy), **Yellow** (degraded), **Red** (unhealthy)
|
|
129
|
+
|
|
130
|
+
### Health Score Coefficients
|
|
131
|
+
|
|
132
|
+
| Factor | Metric Examples | Weight Guidance |
|
|
133
|
+
|---|---|---|
|
|
134
|
+
| Availability | Error rate, HTTP 5xx ratio | High — directly impacts users |
|
|
135
|
+
| Performance | P95 latency, request duration | Medium — affects user experience |
|
|
136
|
+
| Saturation | CPU %, memory %, queue depth | Medium — indicates future problems |
|
|
137
|
+
| Freshness | Data replication lag, cache age | Lower — depends on consistency needs |
|
|
138
|
+
|
|
139
|
+
### Operational Actions by Health State
|
|
140
|
+
|
|
141
|
+
| State | Meaning | Automated Action |
|
|
142
|
+
|---|---|---|
|
|
143
|
+
| 🟢 Green | All components healthy | Normal operations |
|
|
144
|
+
| 🟡 Yellow | Degraded but functional | Alert on-call, increase monitoring frequency |
|
|
145
|
+
| 🔴 Red | Critical failure detected | Trigger failover, page on-call, block deployments |
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Zero-Downtime Deployment (Blue/Green)
|
|
150
|
+
|
|
151
|
+
Deployment must never cause downtime in a mission-critical system.
|
|
152
|
+
|
|
153
|
+
### Blue/Green Process
|
|
154
|
+
|
|
155
|
+
1. **Provision new stamp** — deploy a complete new scale unit ("green") alongside the existing one ("blue")
|
|
156
|
+
2. **Run validation** — execute automated smoke tests, integration tests, and synthetic transactions against the green stamp
|
|
157
|
+
3. **Canary traffic** — route a small percentage of production traffic (e.g., 5%) to the green stamp
|
|
158
|
+
4. **Monitor health** — compare health scores between blue and green stamps over a defined observation period
|
|
159
|
+
5. **Gradual shift** — increase traffic to green stamp in increments (5% → 25% → 50% → 100%)
|
|
160
|
+
6. **Decommission blue** — once green is fully validated, tear down the blue stamp
|
|
161
|
+
|
|
162
|
+
### Key Requirements
|
|
163
|
+
|
|
164
|
+
- Build and release pipelines must be **fully automated** — no manual deployment steps
|
|
165
|
+
- Use the **same pipeline code** for all environments (dev, staging, production)
|
|
166
|
+
- Each stamp deployed as a **single operational unit** — never partial
|
|
167
|
+
- Rollback is achieved by **shifting traffic back** to the previous stamp (still running during validation)
|
|
168
|
+
- **Continuous validation** runs throughout the deployment, not just at the end
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## Chaos Engineering and Continuous Validation
|
|
173
|
+
|
|
174
|
+
Proactive failure testing ensures recovery mechanisms work before real incidents occur.
|
|
175
|
+
|
|
176
|
+
### Chaos Engineering Practices
|
|
177
|
+
|
|
178
|
+
- Use **Azure Chaos Studio** to run controlled experiments against production or pre-production environments
|
|
179
|
+
- Test failure modes: availability zone outage, network partition, dependency failure, CPU/memory pressure
|
|
180
|
+
- Run chaos experiments as part of the **CI/CD pipeline** — every deployment is validated under fault conditions
|
|
181
|
+
- **Synchronized load and chaos testing** — inject faults while the system is under realistic load
|
|
182
|
+
|
|
183
|
+
### Validation Checklist
|
|
184
|
+
|
|
185
|
+
- [ ] Health model detects injected faults within SLO-defined time windows
|
|
186
|
+
- [ ] Automated failover completes within target RTO
|
|
187
|
+
- [ ] No data loss exceeding target RPO during regional failover
|
|
188
|
+
- [ ] Application degrades gracefully (reduced functionality, not total failure)
|
|
189
|
+
- [ ] Alerts fire correctly and reach the on-call team
|
|
190
|
+
- [ ] Runbooks and automated remediation execute successfully
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Application Platform Considerations
|
|
195
|
+
|
|
196
|
+
### Platform Options
|
|
197
|
+
|
|
198
|
+
| Platform | Best For | Availability Zone Support | Complexity |
|
|
199
|
+
|---|---|---|---|
|
|
200
|
+
| **Azure App Service** | Web apps, APIs, PaaS-first approach | Yes (zone-redundant) | Low-Medium |
|
|
201
|
+
| **AKS** | Complex microservices, full K8s control | Yes (zone-redundant node pools) | High |
|
|
202
|
+
| **Container Apps** | Serverless containers, event-driven | Yes | Medium |
|
|
203
|
+
|
|
204
|
+
### Recommendations
|
|
205
|
+
|
|
206
|
+
- **Prioritize availability zones** for all production workloads — spread across physically separate datacenters
|
|
207
|
+
- **Containerize workloads** for reliability and portability between platforms
|
|
208
|
+
- Ensure all services in a scale unit support availability zones — don't mix zonal and non-zonal services
|
|
209
|
+
- For latency-sensitive or chatty workloads, consider tradeoffs of cross-zone traffic cost and latency
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## Data Platform Considerations
|
|
214
|
+
|
|
215
|
+
### Choosing a Primary Database
|
|
216
|
+
|
|
217
|
+
| Scenario | Recommended Service | Deployment Model |
|
|
218
|
+
|---|---|---|
|
|
219
|
+
| Relational data, ACID transactions | **Azure SQL** | Active-passive with geo-replication |
|
|
220
|
+
| Global distribution, multi-model | **Azure Cosmos DB** | Active-active with multi-region writes |
|
|
221
|
+
| Multiple microservice databases | **Mixed (polyglot)** | Per-service database with appropriate model |
|
|
222
|
+
|
|
223
|
+
### Azure SQL in Mission-Critical
|
|
224
|
+
|
|
225
|
+
- Azure SQL does **not** natively support active-active concurrent writes in multiple regions
|
|
226
|
+
- Use **active-passive** strategy: single primary region for writes, read replicas in secondary regions
|
|
227
|
+
- **Partial active-active** possible at the application tier — route reads to local replicas, writes to primary
|
|
228
|
+
- Configure **auto-failover groups** for automated regional failover
|
|
229
|
+
|
|
230
|
+
### Azure Managed Redis in Mission-Critical
|
|
231
|
+
|
|
232
|
+
- Use within or alongside each scale unit for:
|
|
233
|
+
- **Cache data** — rebuildable, repopulated on demand
|
|
234
|
+
- **Session state** — user sessions during scale unit lifetime
|
|
235
|
+
- **Rate limit counters** — per-user and per-tenant throttling
|
|
236
|
+
- **Feature flags** — dynamic configuration without redeployment
|
|
237
|
+
- **Coordination metadata** — distributed locks, leader election
|
|
238
|
+
- **Active geo-replication** enables Redis data to replicate asynchronously across regions
|
|
239
|
+
- Design cached data as either **rebuildable** (repopulate without availability impact) or **durable auxiliary state** (protected by persistence and geo-replication)
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## Security in Mission-Critical
|
|
244
|
+
|
|
245
|
+
### Zero Trust Principles
|
|
246
|
+
|
|
247
|
+
- **Verify explicitly** — authenticate and authorize based on all available data points (identity, location, device, service)
|
|
248
|
+
- **Use least privilege access** — limit user access with Just-In-Time and Just-Enough-Access (JIT/JEA)
|
|
249
|
+
- **Assume breach** — minimize blast radius and segment access, verify end-to-end encryption, use analytics for threat detection
|
|
250
|
+
|
|
251
|
+
### Security Controls
|
|
252
|
+
|
|
253
|
+
| Layer | Control | Azure Service |
|
|
254
|
+
|---|---|---|
|
|
255
|
+
| Edge | DDoS protection, WAF | Azure Front Door, Azure DDoS Protection |
|
|
256
|
+
| Identity | Managed identities, RBAC | Microsoft Entra ID, Azure RBAC |
|
|
257
|
+
| Network | Private endpoints, NSGs | Azure Private Link, Virtual Network |
|
|
258
|
+
| Data | Encryption at rest and in transit | Azure Key Vault, TDE, TLS 1.2+ |
|
|
259
|
+
| Operations | Privileged access management | Microsoft Entra PIM, Azure Bastion |
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## Operational Procedures
|
|
264
|
+
|
|
265
|
+
### Key Operational Processes
|
|
266
|
+
|
|
267
|
+
| Process | Description | Automation Level |
|
|
268
|
+
|---|---|---|
|
|
269
|
+
| **Deployment** | Blue/green with automated validation | Fully automated |
|
|
270
|
+
| **Scaling** | Stamp provisioning and decommissioning | Automated with manual approval gates |
|
|
271
|
+
| **Key rotation** | Certificate and secret rotation | Automated via Key Vault policies |
|
|
272
|
+
| **Patching** | OS and runtime updates | Automated via platform (PaaS) or pipeline (IaaS) |
|
|
273
|
+
| **Incident response** | Detection, triage, mitigation, resolution | Semi-automated (alert → runbook → human) |
|
|
274
|
+
| **Capacity planning** | Forecast demand, pre-provision stamps | Manual with data-driven analysis |
|
|
275
|
+
|
|
276
|
+
### Runbook Requirements
|
|
277
|
+
|
|
278
|
+
- All operational runbooks must be **tested in pre-production** with the same chaos/load scenarios as production
|
|
279
|
+
- **Automated remediation** preferred over manual intervention for known failure modes
|
|
280
|
+
- Runbooks must include **rollback procedures** for every change type
|
|
281
|
+
- **Post-incident reviews** (blameless) must feed back into health model and chaos experiment improvements
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
> Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)
|