@teckedd-code2save/b2dp 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,328 @@
1
+ # Azure Design Principles
2
+
3
+ Ten principles for building reliable, scalable, and manageable applications on Azure.
4
+
5
+ | # | Principle | Focus |
6
+ |---|-----------|-------|
7
+ | 1 | [Design for self-healing](#1-design-for-self-healing) | Resilience & automatic recovery |
8
+ | 2 | [Make all things redundant](#2-make-all-things-redundant) | Eliminate single points of failure |
9
+ | 3 | [Minimize coordination](#3-minimize-coordination) | Scalability through decoupling |
10
+ | 4 | [Design to scale out](#4-design-to-scale-out) | Horizontal scaling |
11
+ | 5 | [Partition around limits](#5-partition-around-limits) | Overcome service boundaries |
12
+ | 6 | [Design for operations](#6-design-for-operations) | Observability & automation |
13
+ | 7 | [Use managed services](#7-use-managed-services) | Reduce operational burden |
14
+ | 8 | [Use an identity service](#8-use-an-identity-service) | Centralized identity & access |
15
+ | 9 | [Design for evolution](#9-design-for-evolution) | Change-friendly architecture |
16
+ | 10 | [Build for the needs of business](#10-build-for-the-needs-of-business) | Align tech to business goals |
17
+
18
+ ---
19
+
20
+ ## 1. Design for self-healing
21
+
22
+ Design the application to detect failures, respond gracefully, and recover automatically without manual intervention.
23
+
24
+ ### Recommendations
25
+
26
+ - **Implement retry logic with backoff** for transient failures in network calls, database connections, and external service interactions.
27
+ - **Use health endpoint monitoring** to expose liveness and readiness probes so orchestrators and load balancers can route traffic away from unhealthy instances.
28
+ - **Apply circuit breaker patterns** to prevent cascading failures — stop calling a failing dependency and allow it time to recover.
29
+ - **Degrade gracefully** by serving reduced functionality (cached data, default responses) rather than failing entirely when a dependency is unavailable.
30
+ - **Adopt chaos engineering** with Azure Chaos Studio to proactively inject faults and validate recovery paths before real incidents occur.
31
+
32
+ ### Related design patterns
33
+
34
+ | Pattern | Purpose |
35
+ |---------|---------|
36
+ | Retry | Handle transient faults by transparently retrying failed operations |
37
+ | Circuit Breaker | Prevent repeated calls to a failing service |
38
+ | Bulkhead | Isolate failures so one component doesn't take down others |
39
+ | Health Endpoint Monitoring | Expose health checks for load balancers and orchestrators |
40
+ | Leader Election | Coordinate distributed instances by electing a leader |
41
+ | Throttling | Control resource consumption by limiting request rates |
42
+
43
+ ### Azure services
44
+
45
+ - **Azure Chaos Studio** — fault injection and chaos experiments
46
+ - **Azure Monitor / Application Insights** — health monitoring, alerting, diagnostics
47
+ - **Azure Traffic Manager / Front Door** — DNS and global failover
48
+ - **Availability Zones** — zonal redundancy within a region
49
+
50
+ ---
51
+
52
+ ## 2. Make all things redundant
53
+
54
+ Build redundancy into the application at every layer to avoid single points of failure. Composite availability formula: `1 - (1 - A)^N` where A is the availability of a single instance and N is the number of instances.
55
+
56
+ ### Recommendations
57
+
58
+ - **Place VMs behind a load balancer** and deploy multiple instances to ensure requests can be served even if one instance fails.
59
+ - **Replicate databases** using read replicas, active geo-replication, or multi-region write to protect data and maintain read performance during outages.
60
+ - **Use multi-zone and multi-region deployments** to survive datacenter and regional failures — define clear RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets.
61
+ - **Partition workloads for availability** so that a failure in one partition doesn't affect others.
62
+ - **Design for automatic failover** with health probes and traffic routing that redirects users without manual intervention.
63
+
64
+ ### Related design patterns
65
+
66
+ | Pattern | Purpose |
67
+ |---------|---------|
68
+ | Deployment Stamps | Deploy independent, identical copies of infrastructure |
69
+ | Geode | Deploy backend services across geographies |
70
+ | Health Endpoint Monitoring | Detect unhealthy instances for failover |
71
+ | Queue-Based Load Leveling | Buffer requests to smooth demand spikes |
72
+
73
+ ### Azure services
74
+
75
+ - **Azure Load Balancer / Application Gateway** — distribute traffic across instances
76
+ - **Azure SQL geo-replication / Cosmos DB multi-region** — database redundancy
77
+ - **Availability Zones** — zonal redundancy within a region
78
+ - **Azure Site Recovery** — disaster recovery orchestration
79
+ - **Azure Front Door** — global load balancing with automatic failover
80
+
81
+ ---
82
+
83
+ ## 3. Minimize coordination
84
+
85
+ Minimize coordination between application services to achieve scalability. Tightly coupled services that require synchronous calls create bottlenecks and reduce availability.
86
+
87
+ ### Recommendations
88
+
89
+ - **Embrace eventual consistency** instead of requiring strong consistency across services — accept that data may be temporarily out of sync.
90
+ - **Use domain events and asynchronous messaging** to decouple producers and consumers so they can operate independently.
91
+ - **Consider CQRS** (Command Query Responsibility Segregation) to separate read and write workloads with independently optimized stores.
92
+ - **Design idempotent operations** so messages can be safely retried or delivered more than once without unintended side effects.
93
+ - **Use optimistic concurrency** with version tokens or ETags instead of pessimistic locks that create coordination bottlenecks.
94
+
95
+ ### Related design patterns
96
+
97
+ | Pattern | Purpose |
98
+ |---------|---------|
99
+ | CQRS | Separate reads from writes for independent scaling |
100
+ | Event Sourcing | Capture all changes as an immutable sequence of events |
101
+ | Saga | Manage distributed transactions without two-phase commit |
102
+ | Asynchronous Request-Reply | Decouple request and response across services |
103
+ | Competing Consumers | Process messages in parallel across multiple consumers |
104
+
105
+ ### Azure services
106
+
107
+ - **Azure Service Bus** — reliable enterprise messaging with queues and topics
108
+ - **Azure Event Grid** — event-driven routing at scale
109
+ - **Azure Event Hubs** — high-throughput event streaming
110
+ - **Azure Cosmos DB** — tunable consistency levels (eventual to strong)
111
+
112
+ ---
113
+
114
+ ## 4. Design to scale out
115
+
116
+ Design the application so it can scale horizontally by adding or removing instances, rather than scaling up to larger hardware.
117
+
118
+ ### Recommendations
119
+
120
+ - **Avoid instance stickiness and session affinity** — store session state externally (Redis, database) so any instance can handle any request.
121
+ - **Identify and resolve bottlenecks** that prevent horizontal scaling, such as shared databases, monolithic components, or stateful in-memory caches.
122
+ - **Decompose workloads** into discrete services that can be scaled independently based on their specific demand profiles.
123
+ - **Use autoscaling based on live metrics** (CPU, queue depth, request latency) rather than fixed schedules to match capacity to real demand.
124
+ - **Design for scale-in** — handle instance removal gracefully with connection draining and proper shutdown hooks.
125
+
126
+ ### Related design patterns
127
+
128
+ | Pattern | Purpose |
129
+ |---------|---------|
130
+ | Competing Consumers | Distribute work across multiple consumers |
131
+ | Sharding | Distribute data across partitions for parallel processing |
132
+ | Deployment Stamps | Scale by deploying additional independent stamps |
133
+ | Static Content Hosting | Offload static assets to reduce compute load |
134
+ | Throttling | Protect the system from overload during scale events |
135
+
136
+ ### Azure services
137
+
138
+ - **Azure Virtual Machine Scale Sets** — autoscale VM pools
139
+ - **Azure App Service / Azure Functions** — built-in autoscale
140
+ - **Azure Kubernetes Service (AKS)** — horizontal pod autoscaler and cluster autoscaler
141
+ - **Azure Cache for Redis** — externalize session state
142
+ - **Azure CDN / Front Door** — offload static content delivery
143
+
144
+ ---
145
+
146
+ ## 5. Partition around limits
147
+
148
+ Use partitioning to work around database, network, and compute limits. Every Azure service has limits — partitioning allows you to scale beyond them.
149
+
150
+ ### Recommendations
151
+
152
+ - **Partition databases** horizontally (sharding), vertically (splitting columns), or functionally (by bounded context) to distribute load and storage.
153
+ - **Design partition keys to avoid hotspots** — choose keys that distribute data and traffic evenly across partitions.
154
+ - **Partition at different levels** — database, queue, network, and compute — to address bottlenecks wherever they occur.
155
+ - **Understand service-specific limits** for throughput, connections, storage, and request rates and design partitioning strategies accordingly.
156
+
157
+ ### Related design patterns
158
+
159
+ | Pattern | Purpose |
160
+ |---------|---------|
161
+ | Sharding | Distribute data across multiple databases or partitions |
162
+ | Priority Queue | Process high-priority work before lower-priority work |
163
+ | Queue-Based Load Leveling | Buffer writes to smooth spikes |
164
+ | Valet Key | Grant limited direct access to resources |
165
+
166
+ ### Azure services
167
+
168
+ - **Azure Cosmos DB** — automatic partitioning with configurable partition keys
169
+ - **Azure SQL Elastic Pools** — manage and scale multiple databases
170
+ - **Azure Storage** — table, blob, and queue partitioning
171
+ - **Azure Service Bus** — partitioned queues and topics
172
+
173
+ ---
174
+
175
+ ## 6. Design for operations
176
+
177
+ Design the application so that the operations team has the tools they need to monitor, diagnose, and manage it in production.
178
+
179
+ ### Recommendations
180
+
181
+ - **Instrument everything** with structured logging, distributed tracing, and metrics to make the system observable from day one.
182
+ - **Use distributed tracing** with correlation IDs that flow across service boundaries to diagnose issues in microservices architectures.
183
+ - **Automate operational tasks** — deployments, scaling, failover, and routine maintenance should require no manual steps.
184
+ - **Treat configuration as code** — store all environment configuration in version control and deploy it through the same CI/CD pipelines as application code.
185
+ - **Implement dashboards and alerts** that surface actionable information, not just raw data, so operators can respond quickly.
186
+
187
+ ### Related design patterns
188
+
189
+ | Pattern | Purpose |
190
+ |---------|---------|
191
+ | Health Endpoint Monitoring | Expose operational health for monitoring tools |
192
+ | Ambassador | Offload cross-cutting concerns like logging and monitoring |
193
+ | Sidecar | Deploy monitoring agents alongside application containers |
194
+ | External Configuration Store | Centralize configuration management |
195
+
196
+ ### Azure services
197
+
198
+ - **Azure Monitor** — metrics, logs, and alerts across all Azure resources
199
+ - **Application Insights** — application performance monitoring and distributed tracing
200
+ - **Azure Log Analytics** — centralized log querying with KQL
201
+ - **Azure Resource Manager (ARM) / Bicep** — infrastructure as code
202
+ - **Azure DevOps / GitHub Actions** — CI/CD pipelines
203
+
204
+ ---
205
+
206
+ ## 7. Use managed services
207
+
208
+ Prefer platform as a service (PaaS) over infrastructure as a service (IaaS) wherever possible to reduce operational overhead.
209
+
210
+ ### Recommendations
211
+
212
+ - **Default to PaaS** for compute, databases, messaging, and storage — let Azure handle OS patching, scaling, and high availability.
213
+ - **Use IaaS only when you need fine-grained control** over the operating system, runtime, or network configuration that PaaS cannot provide.
214
+ - **Leverage built-in scaling and redundancy** features of managed services instead of building and maintaining them yourself.
215
+
216
+ ### Related design patterns
217
+
218
+ | Pattern | Purpose |
219
+ |---------|---------|
220
+ | Backends for Frontends | Use managed API gateways per client type |
221
+ | Gateway Aggregation | Aggregate calls through a managed gateway |
222
+ | Static Content Hosting | Use managed storage for static assets |
223
+
224
+ ### Azure services
225
+
226
+ | IaaS | PaaS Alternative |
227
+ |------|-------------------|
228
+ | VMs with IIS/Nginx | Azure App Service |
229
+ | VMs with SQL Server | Azure SQL Database |
230
+ | VMs with RabbitMQ | Azure Service Bus |
231
+ | VMs with Kubernetes | Azure Kubernetes Service (AKS) |
232
+ | VMs with custom functions | Azure Functions |
233
+ | VMs with Redis | Azure Cache for Redis |
234
+ | VMs with Elasticsearch | Azure AI Search |
235
+
236
+ ---
237
+
238
+ ## 8. Use an identity service
239
+
240
+ Use a centralized identity platform instead of building or managing your own authentication and authorization system.
241
+
242
+ ### Recommendations
243
+
244
+ - **Use Microsoft Entra ID** (formerly Azure AD) as the single identity provider for users, applications, and service-to-service authentication.
245
+ - **Never store credentials in application code or configuration** — use managed identities, certificate-based auth, or federated credentials.
246
+ - **Implement federation protocols** (SAML, OIDC, OAuth 2.0) to integrate with external identity providers and enable single sign-on.
247
+ - **Adopt modern security features** — passwordless authentication (FIDO2, Windows Hello), conditional access policies, multi-factor authentication (MFA), and single sign-on (SSO).
248
+ - **Use managed identities for Azure resources** to eliminate credential management for service-to-service communication entirely.
249
+
250
+ ### Related design patterns
251
+
252
+ | Pattern | Purpose |
253
+ |---------|---------|
254
+ | Federated Identity | Delegate authentication to an external identity provider |
255
+ | Gatekeeper | Protect backends by validating identity at the edge |
256
+ | Valet Key | Grant scoped, time-limited access to resources |
257
+
258
+ ### Azure services
259
+
260
+ - **Microsoft Entra ID** — cloud identity and access management
261
+ - **Azure Managed Identities** — credential-free service-to-service auth
262
+ - **Azure Key Vault** — secrets, certificates, and key management
263
+ - **Microsoft Entra External ID** — customer and partner identity (B2C/B2B)
264
+
265
+ ---
266
+
267
+ ## 9. Design for evolution
268
+
269
+ Design the architecture so it can evolve over time as requirements, technologies, and team understanding change.
270
+
271
+ ### Recommendations
272
+
273
+ - **Enforce loose coupling and high cohesion** — services should expose well-defined interfaces and encapsulate their internal implementation details.
274
+ - **Encapsulate domain knowledge** within service boundaries so changes to business logic don't ripple across the system.
275
+ - **Use asynchronous messaging** between services to reduce temporal coupling — services don't need to be available at the same time.
276
+ - **Version APIs** from day one so clients can migrate at their own pace and you can evolve without breaking existing consumers.
277
+ - **Deploy services independently** with their own release cadence — avoid coordinated "big bang" deployments.
278
+
279
+ ### Related design patterns
280
+
281
+ | Pattern | Purpose |
282
+ |---------|---------|
283
+ | Anti-Corruption Layer | Isolate new services from legacy systems |
284
+ | Strangler Fig | Incrementally migrate a monolith to microservices |
285
+ | Backends for Frontends | Evolve APIs independently per client type |
286
+ | Gateway Routing | Route requests to different service versions |
287
+
288
+ ### Azure services
289
+
290
+ - **Azure API Management** — API versioning, routing, and lifecycle management
291
+ - **Azure Kubernetes Service (AKS)** — independent service deployments with rolling updates
292
+ - **Azure Service Bus** — asynchronous inter-service messaging
293
+ - **Azure Container Apps** — revision-based deployments with traffic splitting
294
+
295
+ ---
296
+
297
+ ## 10. Build for the needs of business
298
+
299
+ Every design decision must be justified by a business requirement. Align technical choices with business goals, constraints, and growth plans.
300
+
301
+ ### Recommendations
302
+
303
+ - **Define RTO, RPO, and MTO** (Recovery Time Objective, Recovery Point Objective, Maximum Tolerable Outage) for each workload based on business impact analysis.
304
+ - **Document SLAs and SLOs** — understand the composite SLA of your architecture and set internal SLOs that provide an error budget for engineering work.
305
+ - **Model the system around the business domain** using domain-driven design to ensure the architecture reflects how the business operates.
306
+ - **Define functional and nonfunctional requirements explicitly** — capture performance targets, compliance needs, data residency constraints, and user experience expectations.
307
+ - **Plan for growth** — design capacity models that account for business projections, seasonal peaks, and market expansion.
308
+
309
+ ### Related design patterns
310
+
311
+ | Pattern | Purpose |
312
+ |---------|---------|
313
+ | Priority Queue | Process business-critical work first |
314
+ | Throttling | Protect SLOs under heavy load |
315
+ | Deployment Stamps | Scale to new markets and regions |
316
+ | Bulkhead | Isolate critical workloads from non-critical ones |
317
+
318
+ ### Azure services
319
+
320
+ - **Azure Advisor** — cost, performance, reliability, and security recommendations
321
+ - **Azure Cost Management** — budget tracking and cost optimization
322
+ - **Azure Service Health** — SLA tracking and incident awareness
323
+ - **Azure Well-Architected Framework Review** — assess architecture against best practices
324
+ - **Azure Monitor SLO/SLI dashboards** — measure and track service level objectives
325
+
326
+ ---
327
+
328
+ > Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)
@@ -0,0 +1,285 @@
1
+ # Mission-Critical Architecture on Azure
2
+
3
+ Guidance for designing mission-critical workloads on Azure that prioritize cloud-native capabilities to maximize reliability and operational effectiveness.
4
+
5
+ **Target SLO:** **99.99%** or higher — permitted annual downtime: **52 minutes 35 seconds**.
6
+
7
+ All encompassed design decisions are intended to accomplish this target SLO.
8
+
9
+ | SLO Target | Permitted Annual Downtime | Typical Use Case |
10
+ |---|---|---|
11
+ | 99.9% | 8 hours 45 minutes | Standard business apps |
12
+ | 99.95% | 4 hours 22 minutes | Important business apps |
13
+ | 99.99% | 52 minutes 35 seconds | Mission-critical workloads |
14
+ | 99.999% | 5 minutes 15 seconds | Safety-critical systems |
15
+
16
+ ---
17
+
18
+ ## Key Design Strategies
19
+
20
+ ### 1. Redundancy in Layers
21
+
22
+ Deploy redundancy at every layer of the architecture to eliminate single points of failure.
23
+
24
+ - Deploy to multiple regions in an **active-active** model — application distributed across 2+ Azure regions handling active user traffic simultaneously
25
+ - Utilize **availability zones** for all considered services — distributing components across physically separate datacenters inside a region
26
+ - Choose resources that support **global distribution** natively
27
+ - Apply zone-redundant configurations for all stateful services
28
+ - Ensure data replication meets RPO requirements across regions
29
+
30
+ **Azure services:** Azure Front Door (global routing), Azure Traffic Manager (DNS failover), Azure Cosmos DB (multi-region writes), Azure SQL (geo-replication)
31
+
32
+ ### 2. Deployment Stamps
33
+
34
+ Deploy regional stamps as scale units — a logical set of resources that can be independently provisioned to keep up with demand changes.
35
+
36
+ - Each stamp is a **self-contained scale unit** with its own compute, caching, and local state
37
+ - Multiple nested scale units within a stamp (e.g., Frontend APIs and Background processors scale independently)
38
+ - **No dependencies between scale units** — they only communicate with shared services outside the stamp
39
+ - Scale units are **temporary/ephemeral** — store persistent system-of-record data only in the replicated database
40
+ - Use stamps for blue-green deployments by rolling out new units, validating, and gradually shifting traffic
41
+
42
+ **Key benefit:** Compartmentalization enables independent scaling and fault isolation per region.
43
+
44
+ ### 3. Reliable and Repeatable Deployments
45
+
46
+ Apply the principle of Infrastructure as Code (IaC) for version control and standardized operations.
47
+
48
+ - Use **Terraform** or **Bicep** for infrastructure definition with version control
49
+ - Implement **zero-downtime blue/green deployment** pipelines — build and release pipelines fully automated
50
+ - Apply **environment consistency** — use the same deployment pipeline code across production and pre-production environments
51
+ - Integrate **continuous validation** — automated testing as part of DevOps processes
52
+ - Include synchronized **load and chaos testing** to validate both application code and underlying infrastructure
53
+ - Deploy stamps as a **single operational unit** — never partially deploy a stamp
54
+
55
+ ### 4. Operational Insights
56
+
57
+ Build comprehensive observability without introducing single points of failure.
58
+
59
+ - Use **federated workspaces** for observability data — monitoring data for global and regional resources stored independently
60
+ - A centralized observability store is **NOT recommended** (it becomes a single point of failure)
61
+ - Use **cross-workspace querying** to achieve a unified data sink and single pane of glass for operations
62
+ - Construct a **layered health model** mapping application health to a traffic light model for contextualizing
63
+ - Health scores calculated for each **individual component**, then **aggregated at user flow level**
64
+ - Combine with key non-functional requirements (performance) as coefficients to quantify application health
65
+
66
+ ---
67
+
68
+ ## Design Areas
69
+
70
+ Each design area must be addressed for a mission-critical architecture.
71
+
72
+ | Design Area | Description | Key Concerns |
73
+ |---|---|---|
74
+ | **Application platform** | Infrastructure choices and mitigations for potential failure cases | AKS vs App Service, availability zones, containerization |
75
+ | **Application design** | Design patterns that allow for scaling and error handling | Stateless services, async messaging, queue-based decoupling |
76
+ | **Networking and connectivity** | Network considerations for routing incoming traffic to stamps | Global load balancing, WAF, DDoS protection, private endpoints |
77
+ | **Data platform** | Choices in data store technologies | Volume, velocity, variety, veracity; active-active vs active-passive |
78
+ | **Deployment and testing** | Strategies for CI/CD pipelines and automation | Blue/green deployments, load testing, chaos testing |
79
+ | **Health modeling** | Observability through customer impact analysis | Correlated monitoring, traffic light model, health scores |
80
+ | **Security** | Mitigation of attack vectors | Microsoft Zero Trust model, identity-based access, encryption |
81
+ | **Operational procedures** | Processes related to runtime operations | Deployment SOPs, key management, patching, incident response |
82
+
83
+ ---
84
+
85
+ ## Active-Active Multi-Region Architecture
86
+
87
+ The core topology for mission-critical workloads distributes the application across multiple Azure regions.
88
+
89
+ ### Architecture Characteristics
90
+
91
+ - Application distributed across **2+ Azure regions** handling active user traffic simultaneously
92
+ - Each region contains independent **deployment stamps** (scale units)
93
+ - **Azure Front Door** provides global routing, SSL termination, and WAF at the edge
94
+ - Scale units have **no cross-dependencies** — they communicate only with shared services (e.g., global database, DNS)
95
+ - Persistent data resides only in the **replicated database** — stamps store no durable local state
96
+ - When scale units are replaced or retired, applications reconnect transparently
97
+
98
+ ### Data Replication Strategies
99
+
100
+ | Strategy | Writes | Reads | Consistency | Best For |
101
+ |---|---|---|---|---|
102
+ | Active-passive (Azure SQL) | Single primary region | All regions via read replicas | Strong | Relational data, ACID transactions |
103
+ | Active-active (Cosmos DB) | All regions | All regions | Tunable (5 levels) | Document/key-value data, global apps |
104
+ | Write-behind (Redis → SQL) | Redis first, async to SQL | Redis or SQL | Eventual | High-throughput writes, rate limiting |
105
+
106
+ ### Regional Stamp Composition
107
+
108
+ Each stamp typically includes:
109
+
110
+ - **Compute tier** — App Service or AKS with multiple instances across availability zones
111
+ - **Caching tier** — Azure Managed Redis for session state, rate limiting, feature flags
112
+ - **Configuration** — Azure App Configuration for settings (capacity correlates with requests/second)
113
+ - **Secrets** — Azure Key Vault for certificates and secrets
114
+ - **Networking** — Virtual network with private endpoints, NSGs, and service endpoints
115
+
116
+ ---
117
+
118
+ ## Health Modeling and Traffic Light Approach
119
+
120
+ Health modeling provides the foundation for automated operational decisions.
121
+
122
+ ### Building the Health Model
123
+
124
+ 1. **Identify user flows** — map critical paths through the application (e.g., "user login", "checkout", "search")
125
+ 2. **Decompose into components** — each flow depends on specific compute, data, and network components
126
+ 3. **Assign health scores** — each component reports a health score based on metrics (latency, error rate, saturation)
127
+ 4. **Aggregate per flow** — combine component scores weighted by criticality to produce a flow-level health score
128
+ 5. **Apply traffic light** — map aggregate scores to **Green** (healthy), **Yellow** (degraded), **Red** (unhealthy)
129
+
130
+ ### Health Score Coefficients
131
+
132
+ | Factor | Metric Examples | Weight Guidance |
133
+ |---|---|---|
134
+ | Availability | Error rate, HTTP 5xx ratio | High — directly impacts users |
135
+ | Performance | P95 latency, request duration | Medium — affects user experience |
136
+ | Saturation | CPU %, memory %, queue depth | Medium — indicates future problems |
137
+ | Freshness | Data replication lag, cache age | Lower — depends on consistency needs |
138
+
139
+ ### Operational Actions by Health State
140
+
141
+ | State | Meaning | Automated Action |
142
+ |---|---|---|
143
+ | 🟢 Green | All components healthy | Normal operations |
144
+ | 🟡 Yellow | Degraded but functional | Alert on-call, increase monitoring frequency |
145
+ | 🔴 Red | Critical failure detected | Trigger failover, page on-call, block deployments |
146
+
147
+ ---
148
+
149
+ ## Zero-Downtime Deployment (Blue/Green)
150
+
151
+ Deployment must never cause downtime in a mission-critical system.
152
+
153
+ ### Blue/Green Process
154
+
155
+ 1. **Provision new stamp** — deploy a complete new scale unit ("green") alongside the existing one ("blue")
156
+ 2. **Run validation** — execute automated smoke tests, integration tests, and synthetic transactions against the green stamp
157
+ 3. **Canary traffic** — route a small percentage of production traffic (e.g., 5%) to the green stamp
158
+ 4. **Monitor health** — compare health scores between blue and green stamps over a defined observation period
159
+ 5. **Gradual shift** — increase traffic to green stamp in increments (5% → 25% → 50% → 100%)
160
+ 6. **Decommission blue** — once green is fully validated, tear down the blue stamp
161
+
162
+ ### Key Requirements
163
+
164
+ - Build and release pipelines must be **fully automated** — no manual deployment steps
165
+ - Use the **same pipeline code** for all environments (dev, staging, production)
166
+ - Each stamp deployed as a **single operational unit** — never partial
167
+ - Rollback is achieved by **shifting traffic back** to the previous stamp (still running during validation)
168
+ - **Continuous validation** runs throughout the deployment, not just at the end
169
+
170
+ ---
171
+
172
+ ## Chaos Engineering and Continuous Validation
173
+
174
+ Proactive failure testing ensures recovery mechanisms work before real incidents occur.
175
+
176
+ ### Chaos Engineering Practices
177
+
178
+ - Use **Azure Chaos Studio** to run controlled experiments against production or pre-production environments
179
+ - Test failure modes: availability zone outage, network partition, dependency failure, CPU/memory pressure
180
+ - Run chaos experiments as part of the **CI/CD pipeline** — every deployment is validated under fault conditions
181
+ - **Synchronized load and chaos testing** — inject faults while the system is under realistic load
182
+
183
+ ### Validation Checklist
184
+
185
+ - [ ] Health model detects injected faults within SLO-defined time windows
186
+ - [ ] Automated failover completes within target RTO
187
+ - [ ] No data loss exceeding target RPO during regional failover
188
+ - [ ] Application degrades gracefully (reduced functionality, not total failure)
189
+ - [ ] Alerts fire correctly and reach the on-call team
190
+ - [ ] Runbooks and automated remediation execute successfully
191
+
192
+ ---
193
+
194
+ ## Application Platform Considerations
195
+
196
+ ### Platform Options
197
+
198
+ | Platform | Best For | Availability Zone Support | Complexity |
199
+ |---|---|---|---|
200
+ | **Azure App Service** | Web apps, APIs, PaaS-first approach | Yes (zone-redundant) | Low-Medium |
201
+ | **AKS** | Complex microservices, full K8s control | Yes (zone-redundant node pools) | High |
202
+ | **Container Apps** | Serverless containers, event-driven | Yes | Medium |
203
+
204
+ ### Recommendations
205
+
206
+ - **Prioritize availability zones** for all production workloads — spread across physically separate datacenters
207
+ - **Containerize workloads** for reliability and portability between platforms
208
+ - Ensure all services in a scale unit support availability zones — don't mix zonal and non-zonal services
209
+ - For latency-sensitive or chatty workloads, consider tradeoffs of cross-zone traffic cost and latency
210
+
211
+ ---
212
+
213
+ ## Data Platform Considerations
214
+
215
+ ### Choosing a Primary Database
216
+
217
+ | Scenario | Recommended Service | Deployment Model |
218
+ |---|---|---|
219
+ | Relational data, ACID transactions | **Azure SQL** | Active-passive with geo-replication |
220
+ | Global distribution, multi-model | **Azure Cosmos DB** | Active-active with multi-region writes |
221
+ | Multiple microservice databases | **Mixed (polyglot)** | Per-service database with appropriate model |
222
+
223
+ ### Azure SQL in Mission-Critical
224
+
225
+ - Azure SQL does **not** natively support active-active concurrent writes in multiple regions
226
+ - Use **active-passive** strategy: single primary region for writes, read replicas in secondary regions
227
+ - **Partial active-active** possible at the application tier — route reads to local replicas, writes to primary
228
+ - Configure **auto-failover groups** for automated regional failover
229
+
230
+ ### Azure Managed Redis in Mission-Critical
231
+
232
+ - Use within or alongside each scale unit for:
233
+ - **Cache data** — rebuildable, repopulated on demand
234
+ - **Session state** — user sessions during scale unit lifetime
235
+ - **Rate limit counters** — per-user and per-tenant throttling
236
+ - **Feature flags** — dynamic configuration without redeployment
237
+ - **Coordination metadata** — distributed locks, leader election
238
+ - **Active geo-replication** enables Redis data to replicate asynchronously across regions
239
+ - Design cached data as either **rebuildable** (repopulate without availability impact) or **durable auxiliary state** (protected by persistence and geo-replication)
240
+
241
+ ---
242
+
243
+ ## Security in Mission-Critical
244
+
245
+ ### Zero Trust Principles
246
+
247
+ - **Verify explicitly** — authenticate and authorize based on all available data points (identity, location, device, service)
248
+ - **Use least privilege access** — limit user access with Just-In-Time and Just-Enough-Access (JIT/JEA)
249
+ - **Assume breach** — minimize blast radius and segment access, verify end-to-end encryption, use analytics for threat detection
250
+
251
+ ### Security Controls
252
+
253
+ | Layer | Control | Azure Service |
254
+ |---|---|---|
255
+ | Edge | DDoS protection, WAF | Azure Front Door, Azure DDoS Protection |
256
+ | Identity | Managed identities, RBAC | Microsoft Entra ID, Azure RBAC |
257
+ | Network | Private endpoints, NSGs | Azure Private Link, Virtual Network |
258
+ | Data | Encryption at rest and in transit | Azure Key Vault, TDE, TLS 1.2+ |
259
+ | Operations | Privileged access management | Microsoft Entra PIM, Azure Bastion |
260
+
261
+ ---
262
+
263
+ ## Operational Procedures
264
+
265
+ ### Key Operational Processes
266
+
267
+ | Process | Description | Automation Level |
268
+ |---|---|---|
269
+ | **Deployment** | Blue/green with automated validation | Fully automated |
270
+ | **Scaling** | Stamp provisioning and decommissioning | Automated with manual approval gates |
271
+ | **Key rotation** | Certificate and secret rotation | Automated via Key Vault policies |
272
+ | **Patching** | OS and runtime updates | Automated via platform (PaaS) or pipeline (IaaS) |
273
+ | **Incident response** | Detection, triage, mitigation, resolution | Semi-automated (alert → runbook → human) |
274
+ | **Capacity planning** | Forecast demand, pre-provision stamps | Manual with data-driven analysis |
275
+
276
+ ### Runbook Requirements
277
+
278
+ - All operational runbooks must be **tested in pre-production** with the same chaos/load scenarios as production
279
+ - **Automated remediation** preferred over manual intervention for known failure modes
280
+ - Runbooks must include **rollback procedures** for every change type
281
+ - **Post-incident reviews** (blameless) must feed back into health model and chaos experiment improvements
282
+
283
+ ---
284
+
285
+ > Source: [Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/)