@teckedd-code2save/b2dp 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,436 @@
1
+ # Cloud Solution Architect Acceptance Criteria
2
+
3
+ **Skill**: `cloud-solution-architect`
4
+ **Purpose**: Produce architecture reviews, design pattern recommendations, and technology decisions aligned with Azure Well-Architected Framework
5
+ **Focus**: WAF pillar alignment, design patterns, technology justification, mission-critical design, performance antipatterns, ADRs
6
+
7
+ ---
8
+
9
+ ## 1. Architecture Review and WAF Pillar Alignment
10
+
11
+ ### 1.1 ✅ CORRECT: Recommendations Mapped to All 5 WAF Pillars
12
+
13
+ ```markdown
14
+ ## Architecture Review: Order Processing System
15
+
16
+ ### WAF Pillar: Reliability
17
+ - **Finding**: Single database instance without failover
18
+ - **Pattern**: Deploy Geode pattern for multi-region data
19
+ - **Recommendation**: Enable Azure SQL geo-replication
20
+ - **Impact**: Improves RTO from hours to minutes
21
+
22
+ ### WAF Pillar: Security
23
+ - **Finding**: API keys stored in application config
24
+ - **Pattern**: External Configuration Store pattern
25
+ - **Recommendation**: Migrate secrets to Azure Key Vault with managed identity
26
+ - **Impact**: Eliminates credential exposure risk
27
+
28
+ ### WAF Pillar: Cost Optimization
29
+ - **Finding**: Over-provisioned VMs running at 15% utilization
30
+ - **Pattern**: Queue-Based Load Leveling pattern
31
+ - **Recommendation**: Replace dedicated VMs with Azure Container Apps with scale-to-zero
32
+ - **Impact**: Estimated 60% cost reduction
33
+
34
+ ### WAF Pillar: Operational Excellence
35
+ - **Finding**: No automated deployment pipeline
36
+ - **Pattern**: Deployment Stamps pattern
37
+ - **Recommendation**: Implement Azure DevOps pipelines with staged rollouts
38
+ - **Impact**: Reduces deployment errors and enables rollback
39
+
40
+ ### WAF Pillar: Performance Efficiency
41
+ - **Finding**: Synchronous API calls to downstream services
42
+ - **Pattern**: Async Request-Reply pattern
43
+ - **Recommendation**: Introduce Azure Service Bus for async processing
44
+ - **Impact**: Reduces P99 latency from 5s to 200ms
45
+ ```
46
+
47
+ ### 1.2 ✅ CORRECT: Recommendations Include Specific Azure Services and Design Patterns
48
+
49
+ ```markdown
50
+ ### Finding: Tight coupling between order and payment services
51
+
52
+ - **WAF Pillar**: Reliability
53
+ - **Pattern**: [Choreography pattern](https://learn.microsoft.com/azure/architecture/patterns/choreography)
54
+ - **Azure Services**: Azure Service Bus, Azure Event Grid
55
+ - **Design**: Publish OrderCreated event → Payment service subscribes and processes independently
56
+ - **Fallback**: Dead-letter queue with retry policy for failed payments
57
+ ```
58
+
59
+ ### 1.3 ❌ INCORRECT: Generic Advice Without WAF Pillar Mapping
60
+
61
+ ```markdown
62
+ ## Architecture Review
63
+
64
+ - Consider using caching
65
+ - Add monitoring
66
+ - Use managed services
67
+ - Improve security
68
+ ```
69
+
70
+ ### 1.4 ❌ INCORRECT: Recommendations Not Tied to Specific Design Patterns
71
+
72
+ ```markdown
73
+ ### Finding: Database is slow
74
+
75
+ - **Recommendation**: Make the database faster
76
+ - **Impact**: Better performance
77
+ ```
78
+
79
+ ---
80
+
81
+ ## 2. Design Pattern Selection
82
+
83
+ ### 2.1 ✅ CORRECT: Pattern Selected Matches Problem Context with Justification
84
+
85
+ ```markdown
86
+ ### Pattern Decision: Inter-Service Communication
87
+
88
+ **Problem Context**: 5 microservices need to coordinate order fulfillment
89
+ with varying processing times (100ms to 30s). Services must remain
90
+ independently deployable and failures must not cascade.
91
+
92
+ **Selected Pattern**: Choreography via Event-Driven Architecture
93
+
94
+ **Justification**:
95
+ - Services have different processing times → async decoupling required
96
+ - No single orchestrator needed → reduces single point of failure
97
+ - Teams own services independently → choreography respects team boundaries
98
+
99
+ **Rejected Alternative**: Orchestrator pattern
100
+ - Would create central dependency and bottleneck
101
+ - Harder to scale orchestrator for 10k+ orders/sec
102
+ ```
103
+
104
+ ### 2.2 ✅ CORRECT: Trade-offs Documented Between Alternative Patterns
105
+
106
+ ```markdown
107
+ ### Pattern Comparison: Data Consistency
108
+
109
+ | Criteria | Saga (Choreography) | Saga (Orchestration) | 2PC |
110
+ |----------|---------------------|----------------------|-----|
111
+ | Consistency | Eventual | Eventual | Strong |
112
+ | Coupling | Low | Medium | High |
113
+ | Complexity | Medium | Medium | Low (but rigid) |
114
+ | Failure handling | Compensating events | Central coordinator | Automatic rollback |
115
+ | Scalability | High | Medium | Low |
116
+ | **Fit for context** | ✅ Best fit | ⚠️ Acceptable | ❌ Poor fit |
117
+
118
+ **Decision**: Saga with Choreography — aligns with existing event-driven
119
+ architecture and team autonomy requirements.
120
+ ```
121
+
122
+ ### 2.3 ❌ INCORRECT: Pattern Chosen Without Considering Problem Constraints
123
+
124
+ ```markdown
125
+ ### Pattern: CQRS
126
+
127
+ Use CQRS for the application.
128
+ ```
129
+
130
+ ### 2.4 ❌ INCORRECT: Applying Patterns That Don't Fit the Problem Domain
131
+
132
+ ```markdown
133
+ ### Pattern: Event Sourcing
134
+
135
+ Implement event sourcing for the static content website to track all
136
+ changes to HTML pages.
137
+ ```
138
+
139
+ ---
140
+
141
+ ## 3. Technology Choice Justification
142
+
143
+ ### 3.1 ✅ CORRECT: Technology Choices Justified with Comparison Table
144
+
145
+ ```markdown
146
+ ### Technology Decision: Message Broker
147
+
148
+ **Requirements**: 10k msgs/sec, at-least-once delivery, <100ms latency,
149
+ .NET SDK support, managed service preferred.
150
+
151
+ | Criteria | Azure Service Bus | Azure Event Hubs | Azure Queue Storage |
152
+ |----------|-------------------|------------------|---------------------|
153
+ | Throughput | 1M msgs/sec (premium) | Millions/sec | 20k msgs/sec |
154
+ | Ordering | FIFO (sessions) | Per-partition | None |
155
+ | Max message size | 256KB–100MB | 1MB | 64KB |
156
+ | Dead-letter support | ✅ Built-in | ❌ Manual | ❌ Manual |
157
+ | Cost (estimated/mo) | ~$670 | ~$220 | ~$5 |
158
+ | Team experience | Medium | Low | High |
159
+
160
+ **Decision**: Azure Service Bus Premium
161
+ - Dead-letter support critical for payment reliability
162
+ - FIFO sessions needed for order sequencing
163
+ - Cost justified by reduced operational complexity
164
+ ```
165
+
166
+ ### 3.2 ✅ CORRECT: Decision Considers Scale, Cost, Complexity, and Team Skills
167
+
168
+ ```markdown
169
+ ### Technology Decision: Container Orchestration
170
+
171
+ **Scale**: 20 microservices, 3 environments, ~500 pods peak
172
+ **Team skills**: 2 engineers with Kubernetes experience, 6 with App Service
173
+ **Budget**: $15k/month compute
174
+
175
+ **Decision**: Azure Container Apps (not AKS)
176
+ - Team lacks deep K8s expertise → ACA reduces operational burden
177
+ - Scale requirements fit ACA limits (300 replicas per app)
178
+ - KEDA-based autoscaling meets event-driven needs
179
+ - Saves ~$3k/month vs equivalent AKS cluster
180
+ - Migration path to AKS exists if requirements grow
181
+ ```
182
+
183
+ ### 3.3 ❌ INCORRECT: Technology Selected Without Comparison to Alternatives
184
+
185
+ ```markdown
186
+ ### Technology Decision
187
+
188
+ Use Kubernetes for containers.
189
+ ```
190
+
191
+ ### 3.4 ❌ INCORRECT: Choosing Most Complex Option Without Justification
192
+
193
+ ```markdown
194
+ ### Technology Decision
195
+
196
+ Deploy AKS with Istio service mesh, Dapr sidecars, and custom operators
197
+ for a 3-service application with 100 requests per minute.
198
+ ```
199
+
200
+ ---
201
+
202
+ ## 4. Mission-Critical Design
203
+
204
+ ### 4.1 ✅ CORRECT: Design Addresses All 8 Design Areas
205
+
206
+ ```markdown
207
+ ## Mission-Critical Assessment: Payment Platform
208
+
209
+ **SLO Target**: 99.99% availability (≤4.32 min downtime/month)
210
+
211
+ ### 1. Application Platform
212
+ - AKS multi-region (East US + West US) with availability zones
213
+ - Node auto-scaling: 3–20 nodes per region
214
+
215
+ ### 2. Application Design
216
+ - Stateless services with external state in Cosmos DB
217
+ - Circuit breaker on all downstream calls (Polly)
218
+ - Bulkhead isolation between payment providers
219
+
220
+ ### 3. Networking
221
+ - Azure Front Door with health probes per region
222
+ - Private endpoints for all data services
223
+ - DDoS Protection Standard enabled
224
+
225
+ ### 4. Data Platform
226
+ - Cosmos DB multi-region write with strong consistency
227
+ - Automated backups every 4 hours, PITR enabled
228
+ - Read replicas in 3 regions
229
+
230
+ ### 5. Deployment and Testing
231
+ - Blue-green deployment via Azure Front Door traffic shifting
232
+ - Canary releases: 5% → 25% → 100% over 2 hours
233
+ - Chaos engineering: monthly failure injection tests
234
+
235
+ ### 6. Health Modeling
236
+ - Composite health score: infrastructure + dependency + application metrics
237
+ - Azure Monitor with custom health model dashboard
238
+ - Automated alerting at degraded/unhealthy thresholds
239
+
240
+ ### 7. Security
241
+ - Zero Trust: verify explicitly, least privilege, assume breach
242
+ - Managed identities for all service-to-service auth
243
+ - WAF policies on Front Door
244
+
245
+ ### 8. Operational Procedures
246
+ - Runbooks for top 10 failure scenarios
247
+ - Automated failover tested quarterly
248
+ - On-call rotation with 15-minute response SLA
249
+ ```
250
+
251
+ ### 4.2 ✅ CORRECT: SLO Target Explicitly Stated with Redundancy Strategy
252
+
253
+ ```markdown
254
+ ### Availability Design
255
+
256
+ | Component | SLA | Redundancy | Failover |
257
+ |-----------|-----|------------|----------|
258
+ | Azure Front Door | 99.99% | Global | Automatic |
259
+ | AKS | 99.95% | AZ-redundant | Pod rescheduling |
260
+ | Cosmos DB | 99.999% | Multi-region write | Automatic |
261
+ | Key Vault | 99.99% | AZ-redundant | Automatic |
262
+
263
+ **Composite SLO**: 99.95% × 99.99% × 99.999% × 99.99% = ~99.93%
264
+ **Target SLO**: 99.95% → Add regional AKS failover to close gap
265
+ ```
266
+
267
+ ### 4.3 ❌ INCORRECT: Missing Design Areas in Mission-Critical Review
268
+
269
+ ```markdown
270
+ ## Mission-Critical Assessment
271
+
272
+ - Use multi-region deployment
273
+ - Add monitoring
274
+ - Enable autoscaling
275
+ ```
276
+
277
+ ### 4.4 ❌ INCORRECT: No Health Modeling or Observability Strategy
278
+
279
+ ```markdown
280
+ ## Mission-Critical Assessment
281
+
282
+ ### Compute
283
+ - Deploy to two regions
284
+
285
+ ### Data
286
+ - Use Cosmos DB
287
+
288
+ (No health modeling, security, operational procedures, or deployment strategy)
289
+ ```
290
+
291
+ ---
292
+
293
+ ## 5. Performance Antipattern Identification
294
+
295
+ ### 5.1 ✅ CORRECT: Antipatterns Identified with Specific Remediation Steps
296
+
297
+ ```markdown
298
+ ## Performance Antipattern: Chatty I/O
299
+
300
+ **Detection**: Application Insights shows 47 SQL queries per API request
301
+ to `/api/orders/{id}` endpoint, averaging 1.2s total.
302
+
303
+ **Metrics**:
304
+ - Dependency calls per request: 47 (target: <5)
305
+ - P95 latency: 2.1s (target: <200ms)
306
+
307
+ **Root cause**: N+1 query pattern — loading order, then iterating line
308
+ items and loading each product individually.
309
+
310
+ **Remediation**:
311
+ 1. Replace individual queries with batch query using `WHERE IN` clause
312
+ 2. Add projection to return only needed columns
313
+ 3. Implement response caching with 30s TTL for product data
314
+
315
+ **Expected improvement**: 47 queries → 2 queries, latency 2.1s → ~150ms
316
+ ```
317
+
318
+ ### 5.2 ✅ CORRECT: Detection Method and Metrics Defined for Each Antipattern
319
+
320
+ ```markdown
321
+ ## Performance Antipattern: No Caching
322
+
323
+ **Detection method**: Azure Monitor → API response times + database DTU
324
+ **Key metrics**:
325
+ - Cache hit ratio: 0% (no cache exists)
326
+ - Database DTU: 85% sustained (threshold: 70%)
327
+ - Identical query ratio: 62% of queries return same data within 60s
328
+
329
+ **Remediation**:
330
+ 1. Add Azure Cache for Redis (Standard C1)
331
+ 2. Cache product catalog (TTL: 5 min)
332
+ 3. Cache user sessions (TTL: 30 min)
333
+ 4. Implement cache-aside pattern with fallback to database
334
+
335
+ **Monitoring after fix**:
336
+ - Track cache hit ratio (target: >80%)
337
+ - Monitor Redis memory usage and evictions
338
+ ```
339
+
340
+ ### 5.3 ❌ INCORRECT: Antipattern Identified Without Remediation Guidance
341
+
342
+ ```markdown
343
+ ## Performance Issue
344
+
345
+ The application has a chatty I/O problem. Too many database calls.
346
+ ```
347
+
348
+ ### 5.4 ❌ INCORRECT: Generic Performance Advice Without Specific Antipattern Analysis
349
+
350
+ ```markdown
351
+ ## Performance Review
352
+
353
+ - Add caching
354
+ - Use async
355
+ - Scale up the database
356
+ - Optimize queries
357
+ ```
358
+
359
+ ---
360
+
361
+ ## 6. Architecture Decision Records
362
+
363
+ ### 6.1 ✅ CORRECT: Decisions Documented with Context, Options, Rationale, and Consequences
364
+
365
+ ```markdown
366
+ ## ADR-003: Use Azure Cosmos DB for Order Data
367
+
368
+ **Status**: Accepted
369
+ **Date**: 2024-03-15
370
+ **Deciders**: Platform team, Product engineering
371
+
372
+ ### Context
373
+ Order service needs a database supporting:
374
+ - Multi-region writes for <50ms latency globally
375
+ - Automatic scaling from 100 to 50,000 RU/s
376
+ - Document model for flexible order schemas
377
+ - 99.999% availability SLA for payment-critical data
378
+
379
+ ### Options Considered
380
+
381
+ | Option | Pros | Cons |
382
+ |--------|------|------|
383
+ | Azure SQL | Strong consistency, familiar | Single-region write, fixed schema |
384
+ | Cosmos DB | Multi-region, flexible schema, SLA | Cost at scale, eventual consistency default |
385
+ | PostgreSQL Flexible | Open source, cost-effective | Manual geo-replication, no SLA match |
386
+
387
+ ### Decision
388
+ Azure Cosmos DB with NoSQL API and session consistency.
389
+
390
+ ### Rationale
391
+ - Only service offering 99.999% SLA with multi-region writes
392
+ - Session consistency balances performance and user experience
393
+ - Auto-scale RU/s matches unpredictable order volume patterns
394
+ - Document model accommodates evolving order schema without migrations
395
+
396
+ ### Consequences
397
+ - **Positive**: Global low-latency reads/writes, managed scaling
398
+ - **Negative**: Higher cost (~$2k/month vs ~$500 for SQL), team needs Cosmos DB training
399
+ - **Risks**: Partition key design errors can cause hot partitions — mitigate with design review
400
+ ```
401
+
402
+ ### 6.2 ❌ INCORRECT: Architecture Decisions Without Documented Rationale
403
+
404
+ ```markdown
405
+ ## Decision
406
+
407
+ We will use Cosmos DB for orders.
408
+ ```
409
+
410
+ ---
411
+
412
+ ## 7. Anti-Patterns Summary
413
+
414
+ | Anti-Pattern | Impact | Fix |
415
+ |--------------|--------|-----|
416
+ | No WAF mapping | Incomplete review, missed pillars | Map each recommendation to a WAF pillar |
417
+ | Wrong pattern | Mismatched solution to problem | Validate pattern against problem constraints |
418
+ | No tradeoff analysis | Uninformed decisions | Compare alternatives systematically |
419
+ | Missing design areas | Gaps in mission-critical review | Use 8-area checklist for coverage |
420
+ | No remediation | Unactionable findings | Include specific fix steps for each antipattern |
421
+ | No ADR rationale | Undocumented decisions erode over time | Record context, options, and consequences |
422
+
423
+ ---
424
+
425
+ ## 8. Checklist for Architecture Review
426
+
427
+ - [ ] Architecture review maps to all 5 WAF pillars
428
+ - [ ] Design patterns selected with problem context justification
429
+ - [ ] Technology choices include comparison and tradeoff analysis
430
+ - [ ] Mission-critical designs address all 8 design areas
431
+ - [ ] Performance antipatterns identified with specific remediation
432
+ - [ ] Architecture decisions documented with rationale
433
+ - [ ] SLO/SLA targets explicitly stated
434
+ - [ ] Health modeling strategy defined
435
+ - [ ] Deployment strategy includes zero-downtime approach
436
+ - [ ] Security follows Zero Trust model