@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (135) hide show
  1. package/README.md +156 -0
  2. package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
  3. package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
  4. package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
  5. package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
  6. package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
  7. package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
  8. package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
  9. package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
  10. package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
  11. package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
  12. package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
  13. package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
  14. package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
  15. package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
  16. package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
  17. package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
  18. package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
  19. package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
  20. package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
  21. package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
  22. package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
  23. package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
  24. package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
  25. package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
  26. package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
  27. package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
  28. package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
  29. package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
  30. package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
  31. package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
  32. package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
  33. package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
  34. package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
  35. package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
  36. package/assets/init/.claude/CLAUDE.md +4 -0
  37. package/assets/init/AGENTS.md +293 -0
  38. package/assets/init/CLAUDE.md +635 -0
  39. package/assets/init/README.md +507 -0
  40. package/assets/init/calibration/gold-set/.gitkeep +0 -0
  41. package/assets/init/calibration/results/.gitkeep +0 -0
  42. package/assets/init/core/core-meta-rubric.yaml +643 -0
  43. package/assets/init/docs/consistency-report.md +325 -0
  44. package/assets/init/docs/getting-started.md +194 -0
  45. package/assets/init/docs/profile-authoring-guide.md +51 -0
  46. package/assets/init/docs/terminology.md +126 -0
  47. package/assets/init/earos.manifest.yaml +104 -0
  48. package/assets/init/evaluations/.gitkeep +0 -0
  49. package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
  50. package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
  51. package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
  52. package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
  53. package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
  54. package/assets/init/overlays/data-governance.yaml +94 -0
  55. package/assets/init/overlays/regulatory.yaml +154 -0
  56. package/assets/init/overlays/security.yaml +92 -0
  57. package/assets/init/profiles/adr.yaml +225 -0
  58. package/assets/init/profiles/capability-map.yaml +223 -0
  59. package/assets/init/profiles/reference-architecture.yaml +426 -0
  60. package/assets/init/profiles/roadmap.yaml +205 -0
  61. package/assets/init/profiles/solution-architecture.yaml +227 -0
  62. package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
  63. package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
  64. package/assets/init/research/reference-architecture-research.md +751 -0
  65. package/assets/init/standard/EAROS.md +1426 -0
  66. package/assets/init/standard/schemas/artifact.schema.json +1295 -0
  67. package/assets/init/standard/schemas/artifact.uischema.json +65 -0
  68. package/assets/init/standard/schemas/evaluation.schema.json +284 -0
  69. package/assets/init/standard/schemas/rubric.schema.json +383 -0
  70. package/assets/init/templates/evaluation-record.template.yaml +58 -0
  71. package/assets/init/templates/new-profile.template.yaml +65 -0
  72. package/bin.js +188 -0
  73. package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
  74. package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
  75. package/dist/assets/arc-CyDBhtDM.js +1 -0
  76. package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
  77. package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
  78. package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
  79. package/dist/assets/channel-CiySTNoJ.js +1 -0
  80. package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
  81. package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
  82. package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
  83. package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
  84. package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
  85. package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
  86. package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
  87. package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
  88. package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
  89. package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
  90. package/dist/assets/clone-ylgRbd3D.js +1 -0
  91. package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
  92. package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
  93. package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
  94. package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
  95. package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
  96. package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
  97. package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
  98. package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
  99. package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
  100. package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
  101. package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
  102. package/dist/assets/graph-DLQn37b-.js +1 -0
  103. package/dist/assets/index-BFFITMT8.js +650 -0
  104. package/dist/assets/index-H7f6VTz1.css +1 -0
  105. package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
  106. package/dist/assets/init-Gi6I4Gst.js +1 -0
  107. package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
  108. package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
  109. package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
  110. package/dist/assets/katex-B1X10hvy.js +261 -0
  111. package/dist/assets/layout-C0dvb42R.js +1 -0
  112. package/dist/assets/linear-j4a8mGj7.js +1 -0
  113. package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
  114. package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
  115. package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
  116. package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
  117. package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
  118. package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
  119. package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
  120. package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
  121. package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
  122. package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
  123. package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
  124. package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
  125. package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
  126. package/dist/index.html +23 -0
  127. package/export-docx.js +1583 -0
  128. package/init.js +353 -0
  129. package/manifest-cli.mjs +207 -0
  130. package/package.json +83 -0
  131. package/schemas/artifact.schema.json +1295 -0
  132. package/schemas/artifact.uischema.json +65 -0
  133. package/schemas/evaluation.schema.json +284 -0
  134. package/schemas/rubric.schema.json +383 -0
  135. package/serve.js +238 -0
@@ -0,0 +1,2056 @@
1
+ kind: artifact
2
+ artifact_type: reference_architecture
3
+
4
+ metadata:
5
+ title: Event-Driven Order Processing Platform on AWS
6
+ version: 1.0.0
7
+ status: approved
8
+ author: Thomas Rohde
9
+ owner: Enterprise Architecture, E-Commerce Platform Domain
10
+ effective_date: "2026-03-20"
11
+ next_review_date: "2026-09-20"
12
+ last_updated: "2026-03-20"
13
+ purpose: >
14
+ This reference architecture defines the target pattern for all event-driven
15
+ microservices implementing order processing on AWS. It supports Architecture
16
+ Board approval as the mandatory golden path for new e-commerce order services
17
+ and serves as the calibration benchmark for EAROS reference architecture
18
+ assessments.
19
+ decision_context: >
20
+ Architecture Board review Q1 2026. The e-commerce platform is migrating from
21
+ a monolithic order management system to event-driven microservices. This
22
+ reference architecture governs all new services built as part of that
23
+ migration and all future order-processing workloads on AWS.
24
+ stakeholders:
25
+ - role: Executive Sponsor
26
+ name: Chief Technology Officer
27
+ concerns: >
28
+ Strategic alignment, cost profile, risk posture, compliance with PCI-DSS,
29
+ and ability to scale to 10× current order volume without re-platforming.
30
+ - role: Platform Architect
31
+ name: Enterprise Architecture
32
+ concerns: >
33
+ Architectural soundness, pattern reusability, alignment with cloud
34
+ strategy, ADR rationale, and prescriptiveness classification.
35
+ - role: Domain Architect
36
+ name: E-Commerce Domain Architecture
37
+ concerns: >
38
+ Service decomposition, data model, integration patterns with upstream
39
+ and downstream systems, and bounded context alignment.
40
+ - role: Development Team Lead
41
+ name: Order Platform Engineering
42
+ concerns: >
43
+ Implementation guidance, IaC templates, API specs, getting-started
44
+ guide, and time to first deployment.
45
+ - role: Site Reliability Engineer
46
+ name: Platform SRE
47
+ concerns: >
48
+ SLOs, observability, scaling policies, DR procedures, runbooks, and
49
+ on-call playbooks.
50
+ - role: Security Architect
51
+ name: Information Security
52
+ concerns: >
53
+ PCI-DSS compliance, authentication, authorisation, encryption, WAF
54
+ configuration, and IAM least-privilege posture.
55
+ - role: Compliance Officer
56
+ name: Risk and Compliance
57
+ concerns: >
58
+ PCI-DSS Level 1 compliance mapping, audit trail completeness, data
59
+ residency, and exception log.
60
+ change_log:
61
+ - version: "1.0.0"
62
+ date: "2026-03-20"
63
+ author: Thomas Rohde
64
+ changes:
65
+ - Initial approved version — gold-standard EAROS calibration artifact
66
+ - All 5 architecture views complete (context, functional, deployment, data flow, security)
67
+ - 5 full ADRs with alternatives, trade-offs, and revisit conditions
68
+ - Complete operational model with SLOs, dashboards, scaling, and DR
69
+ - PCI-DSS compliance mapping covering all 12 requirements areas
70
+ - 6-step getting-started guide tested against sandbox environment
71
+
72
+ sections:
73
+ reading_guide:
74
+ how_to_use: >
75
+ This document is structured so each audience can navigate directly to
76
+ their primary concerns. The section map below identifies the most relevant
77
+ sections for each stakeholder role. Cross-references between sections are
78
+ explicit — where a decision in one section depends on content from another,
79
+ a reference is provided.
80
+ section_map:
81
+ - section: Business Context
82
+ audience: CTO, Executive Sponsor
83
+ concern: Strategic alignment, business drivers, and use-case coverage
84
+ - section: Architecture Views — Context
85
+ audience: CTO, Domain Architect
86
+ concern: System boundary, external actors, and integration landscape
87
+ - section: Architecture Views — Functional
88
+ audience: Domain Architect, Development Team Lead
89
+ concern: Service decomposition, component responsibilities, and interfaces
90
+ - section: Architecture Views — Data Flow
91
+ audience: Domain Architect, Development Team Lead, Data Governance
92
+ concern: Runtime order-processing flow and event choreography
93
+ - section: Architecture Views — Deployment
94
+ audience: Platform SRE, Security Architect
95
+ concern: Infrastructure topology, multi-AZ resilience, and network design
96
+ - section: Architecture Views — Security
97
+ audience: Security Architect, Compliance Officer
98
+ concern: Trust boundaries, authentication, authorisation, and encryption
99
+ - section: Architecture Decisions (ADRs)
100
+ audience: Platform Architect, Domain Architect
101
+ concern: Decision rationale, alternatives considered, and trade-offs accepted
102
+ - section: Component Classification
103
+ audience: Development Team Lead
104
+ concern: Mandatory vs optional components and approved extension points
105
+ - section: Quality Attributes
106
+ audience: Platform SRE, Domain Architect
107
+ concern: Measurable SLO targets and validation strategies
108
+ - section: Operational Model
109
+ audience: Platform SRE
110
+ concern: Monitoring, alerting, scaling, and disaster recovery
111
+ - section: Implementation Artifacts
112
+ audience: Development Team Lead
113
+ concern: IaC templates, API specs, CI/CD pipeline, and scaffold template
114
+ - section: Getting Started
115
+ audience: Development Team Lead
116
+ concern: Step-by-step guide to first deployment
117
+ - section: RAID Log
118
+ audience: Platform Architect, Risk and Compliance
119
+ concern: Risks, design assumptions, constraints, and trade-offs
120
+ - section: Governance and Compliance
121
+ audience: Compliance Officer, Security Architect
122
+ concern: PCI-DSS control mapping and exception register
123
+ - section: Decisions and Actions
124
+ audience: Architecture Board, CTO
125
+ concern: Governance outcome, approval conditions, and next actions
126
+
127
+ scope:
128
+ statement: >
129
+ This reference architecture covers the event-driven order processing
130
+ platform on AWS: the full lifecycle from order submission through payment
131
+ authorisation, fulfilment dispatch, and customer notification. It defines
132
+ mandatory patterns for all new order-processing microservices in the
133
+ e-commerce domain.
134
+ in_scope:
135
+ - Order submission API (HTTP/REST via API Gateway)
136
+ - Order Service — create, validate, and persist orders
137
+ - Payment Service — authorise payments via external payment gateway
138
+ - Fulfilment Service — dispatch orders to warehouse management system
139
+ - Notification Service — send order status updates via email/SMS
140
+ - Event Bus (AWS EventBridge) — async inter-service choreography
141
+ - Work queues (AWS SQS) — reliable message delivery with DLQ
142
+ - Order data store (AWS DynamoDB) — primary persistence
143
+ - Identity and access (AWS Cognito) — customer and API authentication
144
+ - Observability stack (CloudWatch, X-Ray) — metrics, logs, and tracing
145
+ - Infrastructure-as-code (AWS CDK) — deployment automation
146
+ - CI/CD pipeline (AWS CodePipeline + CodeBuild) — delivery automation
147
+ out_of_scope:
148
+ - Customer mobile and web frontend — covered by separate UI reference architecture
149
+ - Payment gateway internals — third-party service; integration contract only
150
+ - Warehouse management system — external system; integration via adapter only
151
+ - Email and SMS providers — external services; notification adapter only
152
+ - Product catalogue and inventory — covered by separate catalogue reference architecture
153
+ - Analytics and reporting pipeline — covered by data platform reference architecture
154
+ - Corporate IAM and SSO — platform service; consumed, not owned
155
+ - Fraud detection scoring — consumed as a platform service via API
156
+ boundary_definition: >
157
+ The system boundary is defined at the API Gateway ingress (customer-facing)
158
+ and at the integration adapters for external systems (payment gateway,
159
+ warehouse management system, notification providers). All components inside
160
+ the boundary are owned and operated by the Order Platform Engineering team.
161
+ The C4 context diagram (Architecture Views — Context) is the authoritative
162
+ boundary representation.
163
+ assumptions:
164
+ - assumption: AWS is the mandated cloud provider for all new e-commerce workloads
165
+ consequence_if_violated: >
166
+ The entire IaC stack, managed services selection, and IAM model would
167
+ need to be replaced. Estimated re-platform effort: 6–9 months.
168
+ - assumption: >
169
+ The external payment gateway provides a REST API with SLA >= 99.9%
170
+ availability and supports idempotent payment authorisation
171
+ consequence_if_violated: >
172
+ Payment Service retry logic and DLQ strategy may be insufficient.
173
+ A synchronous fallback or alternative payment gateway would be required.
174
+ - assumption: Order volume is 100–1000 orders/minute sustained, peak 5000/minute
175
+ consequence_if_violated: >
176
+ Lambda concurrency limits and DynamoDB capacity units may need
177
+ re-provisioning. Architecture is elastic but cost model changes.
178
+ - assumption: PCI-DSS scope is limited to payment authorisation flow (no card data stored)
179
+ consequence_if_violated: >
180
+ Significantly broader PCI-DSS controls would apply, requiring dedicated
181
+ cardholder data environment (CDE) separation and additional audit scope.
182
+ - assumption: AWS CDK v2 is the approved IaC tool for this domain
183
+ consequence_if_violated: >
184
+ IaC templates would need to be rewritten in Terraform or CloudFormation.
185
+
186
+ drivers_and_principles:
187
+ drivers:
188
+ - id: DR-01
189
+ description: >
190
+ Scale order processing capacity 10× from 100 to 1000 sustained orders/minute
191
+ without architectural re-platforming, to support projected e-commerce growth
192
+ over the next 3 years.
193
+ architecture_response: >
194
+ Serverless compute (Lambda) and fully managed data services (DynamoDB,
195
+ SQS, EventBridge) provide elastic horizontal scaling. Scaling policies
196
+ (see Operational Model) target 1000 orders/minute sustained with 5000/minute
197
+ burst. ADR-003 documents the Lambda-vs-ECS trade-off for this requirement.
198
+ - id: DR-02
199
+ description: >
200
+ Achieve PCI-DSS Level 1 compliance for the payment authorisation flow
201
+ to satisfy card network requirements and enable enterprise payment volumes.
202
+ architecture_response: >
203
+ PCI-DSS compliance achieved through: Cognito + JWT for customer authentication
204
+ (no credentials in order data), API Gateway WAF with OWASP ruleset, KMS
205
+ encryption at rest and TLS 1.2+ in transit, VPC isolation for Lambda
206
+ functions, CloudTrail audit logging, and full compliance mapping in
207
+ Governance section. No card data is stored (tokenisation via payment gateway).
208
+ - id: DR-03
209
+ description: >
210
+ Reduce order processing latency: P99 order submission < 500ms,
211
+ order status query < 200ms, to support real-time customer experience.
212
+ architecture_response: >
213
+ Synchronous API Gateway → Lambda path for order submission targets
214
+ < 500ms P99. DynamoDB single-table design (ADR-005) with access patterns
215
+ optimised for order-by-id and customer-order-list queries targets < 200ms
216
+ P99. Lambda provisioned concurrency eliminates cold-start latency for
217
+ the submission path. Quality Attributes section defines fitness functions.
218
+ - id: DR-04
219
+ description: >
220
+ Enable independent deployment of services (order, payment, fulfilment,
221
+ notification) so teams can release without coordinating across service
222
+ boundaries.
223
+ architecture_response: >
224
+ Event-driven choreography via EventBridge (ADR-001) decouples services
225
+ at runtime. Each service owns its schema and deployment pipeline (see CI/CD
226
+ templates). Schema evolution uses EventBridge schema registry with
227
+ compatibility checks in CI. Services do not share databases.
228
+ - id: DR-05
229
+ description: >
230
+ Provide complete, immutable audit trail of all order state transitions
231
+ for compliance, customer support, and fraud investigation.
232
+ architecture_response: >
233
+ Every state transition publishes an event to EventBridge with correlation
234
+ ID, timestamp, actor, and previous/new state. Events are persisted to
235
+ S3 via Kinesis Firehose for 7-year retention. CloudTrail captures all
236
+ API-level actions. Correlation IDs thread through all service logs.
237
+ principles:
238
+ - id: PRIN-01
239
+ name: Event-driven by default
240
+ how_applied: >
241
+ All inter-service communication uses EventBridge async events, not
242
+ synchronous REST calls. Synchronous REST is used only for the customer-
243
+ facing submission API and for external system adapters. See ADR-001.
244
+ - id: PRIN-02
245
+ name: API-first
246
+ how_applied: >
247
+ All service interfaces are defined as OpenAPI 3.1 specifications before
248
+ implementation. API specs are the contract; code is the implementation.
249
+ See Implementation Artifacts for spec locations.
250
+ - id: PRIN-03
251
+ name: Zero-trust networking
252
+ how_applied: >
253
+ All service-to-service calls require IAM role authentication.
254
+ No service has a blanket "allow all within VPC" rule. Each Lambda
255
+ function has a least-privilege IAM role permitting only the specific
256
+ resources it needs. See Security View and Governance.
257
+ - id: PRIN-04
258
+ name: Infrastructure-as-code only
259
+ how_applied: >
260
+ All AWS resources are provisioned via CDK stacks. No manual console
261
+ changes. CloudFormation drift detection runs nightly. See ADR-003
262
+ and Implementation Artifacts.
263
+ - id: PRIN-05
264
+ name: Observability-first
265
+ how_applied: >
266
+ All Lambda functions emit structured JSON logs with correlation IDs.
267
+ X-Ray tracing is active on all functions. CloudWatch dashboards are
268
+ provisioned by CDK alongside the service, not added after deployment.
269
+
270
+ architecture_views:
271
+ context:
272
+ description: >
273
+ The Event-Driven Order Processing Platform sits at the centre of the
274
+ e-commerce order lifecycle. External actors interact at the boundary:
275
+ customers submit orders and query status via API Gateway; the Payment
276
+ Gateway (Stripe) authorises charges; the Warehouse Management System
277
+ receives fulfilment instructions; and the Notification Provider (Amazon
278
+ SES/SNS) delivers email and SMS. The corporate IAM platform issues
279
+ tokens used by internal operator tooling. The Analytics Platform
280
+ consumes order events from S3 for reporting.
281
+
282
+ Key boundary characteristics:
283
+ - Customers interact only through the API Gateway (HTTPS); no direct
284
+ service access.
285
+ - Payment Gateway integration is outbound only; no inbound webhooks
286
+ (polling model for authorisation status).
287
+ - Warehouse Management System integration uses SQS queue (decoupled;
288
+ WMS pulls at its own rate).
289
+ - All events are published to EventBridge for cross-domain consumption
290
+ by the Analytics Platform.
291
+ diagram_source: |
292
+ flowchart LR
293
+ classDef actor fill:#f8fafc,stroke:#334155,stroke-width:1.4px,color:#0f172a;
294
+ classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
295
+ classDef internal fill:#ecfeff,stroke:#0f766e,stroke-width:1.4px,color:#134e4a;
296
+
297
+ Customer@{ shape: stadium, label: "Customer" }
298
+ Operator@{ shape: rounded, label: "Internal Operator" }
299
+ PayGW@{ shape: cloud, label: "Payment Gateway\n(Stripe)" }
300
+ WMS@{ shape: cloud, label: "Warehouse Management\nSystem" }
301
+ Analytics@{ shape: doc, label: "Analytics Platform" }
302
+
303
+ APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 56, h: 56, constraint: "on" }
304
+ Cognito@{ img: "/icons/aws/cognito.svg", label: "Corporate IAM\nvia Cognito", pos: "b", w: 56, h: 56, constraint: "on" }
305
+ Platform@{ img: "/icons/aws/aws-cloud.svg", label: "Event-Driven Order\nProcessing Platform", pos: "b", w: 72, h: 72, constraint: "on" }
306
+ EventBus@{ img: "/icons/aws/eventbridge.svg", label: "EventBridge", pos: "b", w: 56, h: 56, constraint: "on" }
307
+ SES@{ img: "/icons/aws/ses.svg", label: "Amazon SES", pos: "b", w: 56, h: 56, constraint: "on" }
308
+ SNS@{ img: "/icons/aws/sns.svg", label: "Amazon SNS", pos: "b", w: 56, h: 56, constraint: "on" }
309
+
310
+ Customer -->|Submit orders| APIGW
311
+ Operator -->|Query status| APIGW
312
+ Cognito -->|JWT tokens| APIGW
313
+ APIGW --> Platform
314
+ Platform -->|REST HTTPS| PayGW
315
+ Platform -->|Fulfilment requests| WMS
316
+ Platform --> SES
317
+ Platform --> SNS
318
+ Platform --> EventBus
319
+ EventBus -->|Order events| Analytics
320
+
321
+ class Customer,Operator actor
322
+ class PayGW,WMS external
323
+ class Analytics internal
324
+ source_type: mermaid
325
+
326
+ functional:
327
+ description: >
328
+ The platform decomposes into six primary containers plus the shared
329
+ event bus. Each container is a separately deployable Lambda function
330
+ group with its own IAM role, CloudWatch log group, and CDK stack.
331
+ Containers communicate exclusively through EventBridge events (async)
332
+ or SQS queues (work distribution). No container calls another container
333
+ synchronously.
334
+
335
+ Container responsibilities:
336
+ - API Gateway: TLS termination, Cognito JWT authorisation, request
337
+ routing, WAF enforcement, throttling (1000 req/s burst, 100 req/s
338
+ sustained per stage).
339
+ - Order Service: Validates order data, persists to DynamoDB, publishes
340
+ OrderCreated event. Exposes REST endpoints for order submission and
341
+ status query.
342
+ - Payment Service: Subscribes to OrderCreated, calls Stripe payment API,
343
+ publishes PaymentAuthorised or PaymentFailed event. Idempotent via
344
+ Stripe idempotency keys.
345
+ - Fulfilment Service: Subscribes to PaymentAuthorised, places fulfilment
346
+ message on SQS queue consumed by WMS adapter, publishes FulfilmentDispatched.
347
+ - Notification Service: Subscribes to OrderCreated, PaymentAuthorised,
348
+ PaymentFailed, FulfilmentDispatched. Sends customer-facing status
349
+ updates via SES (email) and SNS (SMS).
350
+ - Event Bus (EventBridge): Central event backbone. All inter-service
351
+ events pass through EventBridge with schema validation enforced.
352
+ - Event Archive (S3 + Firehose): All EventBridge events replayed to
353
+ Kinesis Firehose → S3 for audit retention (7 years) and analytics.
354
+ diagram_source: |
355
+ flowchart LR
356
+ classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
357
+
358
+ WAF@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
359
+ APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
360
+ Cognito@{ img: "/icons/aws/cognito.svg", label: "Cognito\nAuthorizer", pos: "b", w: 52, h: 52, constraint: "on" }
361
+ OrderSvc@{ img: "/icons/aws/lambda.svg", label: "Order Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
362
+ PaySvc@{ img: "/icons/aws/lambda.svg", label: "Payment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
363
+ FulfilSvc@{ img: "/icons/aws/lambda.svg", label: "Fulfilment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
364
+ NotifySvc@{ img: "/icons/aws/lambda.svg", label: "Notification Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
365
+ WMSAdapter@{ img: "/icons/aws/lambda.svg", label: "WMS Adapter\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
366
+ DDB@{ img: "/icons/aws/dynamodb.svg", label: "Orders Table\nDynamoDB", pos: "b", w: 52, h: 52, constraint: "on" }
367
+ EB@{ img: "/icons/aws/eventbridge.svg", label: "Order Event Bus", pos: "b", w: 52, h: 52, constraint: "on" }
368
+ SQS@{ img: "/icons/aws/sqs.svg", label: "Fulfilment Queue\n+ DLQ", pos: "b", w: 52, h: 52, constraint: "on" }
369
+ SES@{ img: "/icons/aws/ses.svg", label: "Amazon SES", pos: "b", w: 52, h: 52, constraint: "on" }
370
+ SNS@{ img: "/icons/aws/sns.svg", label: "Amazon SNS", pos: "b", w: 52, h: 52, constraint: "on" }
371
+ Firehose@{ img: "/icons/aws/data-firehose.svg", label: "Kinesis Data Firehose", pos: "b", w: 52, h: 52, constraint: "on" }
372
+ S3@{ img: "/icons/aws/s3.svg", label: "S3 Event Archive\n(7-year retention)", pos: "b", w: 52, h: 52, constraint: "on" }
373
+ Stripe@{ shape: cloud, label: "Stripe API" }
374
+ WMS@{ shape: cloud, label: "Warehouse Management\nSystem" }
375
+
376
+ WAF --> APIGW
377
+ Cognito --> APIGW
378
+ APIGW --> OrderSvc
379
+ OrderSvc --> DDB
380
+ OrderSvc --> EB
381
+ EB --> PaySvc
382
+ EB --> FulfilSvc
383
+ EB --> NotifySvc
384
+ PaySvc --> Stripe
385
+ PaySvc --> DDB
386
+ PaySvc --> EB
387
+ FulfilSvc --> DDB
388
+ FulfilSvc --> SQS
389
+ FulfilSvc --> EB
390
+ SQS --> WMSAdapter
391
+ WMSAdapter --> WMS
392
+ NotifySvc --> SES
393
+ NotifySvc --> SNS
394
+ EB --> Firehose
395
+ Firehose --> S3
396
+
397
+ class Stripe,WMS external
398
+ source_type: mermaid
399
+
400
+ deployment:
401
+ description: >
402
+ All components are deployed in a single AWS region (eu-west-1) with
403
+ multi-AZ resilience. Lambda functions execute across three Availability
404
+ Zones (AZs) by default. DynamoDB is a regional service with automatic
405
+ multi-AZ replication. SQS and EventBridge are regional services with
406
+ 99.99% availability SLA.
407
+
408
+ Network topology:
409
+ - VPC with three private subnets (one per AZ) for Lambda functions
410
+ requiring VPC attachment (Payment Service and WMS Adapter).
411
+ - API Gateway is a managed regional endpoint outside the VPC; WAF
412
+ is attached at the CloudFront distribution level.
413
+ - DynamoDB, EventBridge, SQS, S3, SES, and SNS are accessed via
414
+ VPC endpoints (PrivateLink) where available; no public internet
415
+ egress for data traffic.
416
+ - NAT Gateway per AZ for Lambda functions requiring internet egress
417
+ (Stripe API calls from Payment Service).
418
+
419
+ Resilience model:
420
+ - Lambda: executes in all AZs; automatic AZ failover; no single-AZ
421
+ dependency.
422
+ - DynamoDB: multi-AZ; automatic failover; point-in-time recovery
423
+ (PITR) enabled; 35-day backup window.
424
+ - SQS: multi-AZ; messages durable across AZ failures.
425
+ - EventBridge: regional service; 99.99% SLA.
426
+
427
+ Disaster recovery (active-passive multi-region):
428
+ - Primary region: eu-west-1.
429
+ - DR region: eu-central-1 (warm standby).
430
+ - DynamoDB Global Tables replicate in near-real-time.
431
+ - Route 53 health checks; failover routing policy activates DR region
432
+ within RTO target (4 hours).
433
+ diagram_source: |
434
+ flowchart TB
435
+ Route53@{ img: "/icons/aws/route53.svg", label: "Route 53\nHealth checks + failover", pos: "b", w: 56, h: 56, constraint: "on" }
436
+
437
+ subgraph Primary["eu-west-1 (Primary)"]
438
+ direction TB
439
+ CF_Primary@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
440
+ WAF_Primary@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
441
+ APIGW_Primary@{ img: "/icons/aws/api-gateway.svg", label: "Regional API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
442
+
443
+ subgraph VPC_Primary["VPC"]
444
+ direction LR
445
+
446
+ subgraph AZ_A["eu-west-1a"]
447
+ direction TB
448
+ SubnetA@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet A\n10.0.1.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
449
+ LambdaA@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
450
+ NATA@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway A", pos: "b", w: 48, h: 48, constraint: "on" }
451
+ SubnetA --- LambdaA
452
+ LambdaA -. egress .-> NATA
453
+ end
454
+
455
+ subgraph AZ_B["eu-west-1b"]
456
+ direction TB
457
+ SubnetB@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet B\n10.0.2.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
458
+ LambdaB@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
459
+ NATB@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway B", pos: "b", w: 48, h: 48, constraint: "on" }
460
+ SubnetB --- LambdaB
461
+ LambdaB -. egress .-> NATB
462
+ end
463
+
464
+ subgraph AZ_C["eu-west-1c"]
465
+ direction TB
466
+ SubnetC@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet C\n10.0.3.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
467
+ LambdaC@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
468
+ NATC@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway C", pos: "b", w: 48, h: 48, constraint: "on" }
469
+ SubnetC --- LambdaC
470
+ LambdaC -. egress .-> NATC
471
+ end
472
+ end
473
+
474
+ DDB_Primary@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB Global Table\nPrimary", pos: "b", w: 52, h: 52, constraint: "on" }
475
+ end
476
+
477
+ subgraph DR["eu-central-1 (Warm standby)"]
478
+ direction TB
479
+ CF_DR@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
480
+ WAF_DR@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
481
+ APIGW_DR@{ img: "/icons/aws/api-gateway.svg", label: "Regional API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
482
+ Lambda_DR@{ img: "/icons/aws/lambda.svg", label: "Lambda warm standby", pos: "b", w: 52, h: 52, constraint: "on" }
483
+ DDB_DR@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB Global Table\nReplica", pos: "b", w: 52, h: 52, constraint: "on" }
484
+ end
485
+
486
+ Route53 --> CF_Primary
487
+ Route53 -. failover .-> CF_DR
488
+ CF_Primary --> WAF_Primary --> APIGW_Primary
489
+ CF_DR --> WAF_DR --> APIGW_DR
490
+ APIGW_Primary --> LambdaA
491
+ APIGW_Primary --> LambdaB
492
+ APIGW_Primary --> LambdaC
493
+ APIGW_DR --> Lambda_DR
494
+ LambdaA -. private AWS SDK .-> DDB_Primary
495
+ LambdaB -. private AWS SDK .-> DDB_Primary
496
+ LambdaC -. private AWS SDK .-> DDB_Primary
497
+ Lambda_DR -. warm standby access .-> DDB_DR
498
+ DDB_Primary <-->|Global Tables\nreplication| DDB_DR
499
+ source_type: mermaid
500
+
501
+ data_flow:
502
+ description: >
503
+ The primary data flow is the order processing lifecycle: a customer
504
+ submits an order and receives a confirmation. The flow is predominantly
505
+ asynchronous after order persistence, with the customer receiving a
506
+ 202 Accepted response and subsequent status updates via notification.
507
+
508
+ Correlation ID (UUID v4) is generated at step 1 and threaded through
509
+ every subsequent step as an HTTP header, EventBridge detail field, SQS
510
+ message attribute, and DynamoDB attribute, enabling end-to-end tracing
511
+ via CloudWatch X-Ray.
512
+ narrative_steps:
513
+ - step: 1
514
+ description: >
515
+ Customer submits POST /orders via HTTPS to API Gateway CloudFront
516
+ distribution. WAF evaluates the request against OWASP Core Rule Set.
517
+ Cognito JWT authoriser validates the Bearer token. Throttle check
518
+ (100 req/s sustained per customer). A correlation ID is generated
519
+ and injected into the request context.
520
+ - step: 2
521
+ description: >
522
+ API Gateway invokes Order Service Lambda (synchronous). Request
523
+ payload is validated against OpenAPI schema (request validator
524
+ enabled). Lambda reads Cognito claims to extract customer ID.
525
+ - step: 3
526
+ description: >
527
+ Order Service performs business validation: checks item quantities
528
+ are positive, delivery address is valid, and order total is non-zero.
529
+ If validation fails, returns HTTP 422 with structured error response.
530
+ - step: 4
531
+ description: >
532
+ Order Service writes the order record to DynamoDB with status PENDING.
533
+ Partition key: ORDER#<orderId>. Sort key: METADATA. Includes
534
+ correlation ID, customer ID, items, totals, and timestamps.
535
+ Uses conditional write to ensure idempotency.
536
+ - step: 5
537
+ description: >
538
+ Order Service publishes OrderCreated event to EventBridge on the
539
+ order-processing bus. Event detail includes orderId, customerId,
540
+ totalAmount, correlationId, and timestamp. Schema is registered
541
+ in EventBridge Schema Registry and validated before publish.
542
+ - step: 6
543
+ description: >
544
+ Order Service returns HTTP 202 Accepted to API Gateway with
545
+ orderId and a polling URL (/orders/{orderId}/status). Customer
546
+ can poll this endpoint or await email/SMS notification.
547
+ - step: 7
548
+ description: >
549
+ EventBridge routes OrderCreated event to Payment Service Lambda
550
+ (event rule: source=order-service, detail-type=OrderCreated).
551
+ Payment Service Lambda extracts orderId and totalAmount.
552
+ - step: 8
553
+ description: >
554
+ Payment Service calls Stripe Charge API (POST /v1/payment_intents)
555
+ with idempotency key = correlationId to prevent double-charge.
556
+ Stripe responds synchronously within ~2s. If Stripe returns error,
557
+ Payment Service publishes PaymentFailed event and retries with
558
+ exponential backoff (max 3 attempts).
559
+ - step: 9
560
+ description: >
561
+ On Stripe success, Payment Service updates DynamoDB order record
562
+ (status → PAYMENT_AUTHORISED, stripeChargeId added). Publishes
563
+ PaymentAuthorised event to EventBridge. EventBridge routes the
564
+ event simultaneously to Fulfilment Service and Notification Service.
565
+ - step: 10
566
+ description: >
567
+ Fulfilment Service receives PaymentAuthorised event. Writes fulfilment
568
+ instruction message to SQS Fulfilment Queue (includes orderId, items,
569
+ delivery address). Updates DynamoDB order status to FULFILMENT_QUEUED.
570
+ Publishes FulfilmentDispatched event to EventBridge.
571
+ - step: 11
572
+ description: >
573
+ WMS Adapter Lambda polls SQS Fulfilment Queue (long polling, 20s).
574
+ Translates order instruction to WMS API format and calls WMS REST
575
+ API. On success, deletes SQS message. On WMS failure, message
576
+ returns to queue; after 3 attempts, routes to Dead Letter Queue.
577
+ SRE CloudWatch alarm fires on DLQ depth > 0.
578
+ - step: 12
579
+ description: >
580
+ Notification Service receives OrderCreated, PaymentAuthorised, and
581
+ FulfilmentDispatched events (separate EventBridge rules). For each,
582
+ constructs customer-facing message, sends email via SES and SMS via
583
+ SNS. Uses DynamoDB customer preferences table to determine channel.
584
+ All notifications carry correlationId for support tracing.
585
+
586
+ security:
587
+ description: >
588
+ The security architecture implements defence-in-depth across five
589
+ concentric trust zones. PCI-DSS controls are mapped to specific
590
+ architecture elements — see Governance section for full mapping.
591
+
592
+ Trust Zone 1 — Public (untrusted): CloudFront + WAF. All inbound
593
+ traffic from the internet enters here. WAF enforces OWASP Core Rule
594
+ Set v3.2, rate limiting (100 req/IP/s), and geo-blocking as required.
595
+ No backend service is reachable without passing WAF.
596
+
597
+ Trust Zone 2 — API perimeter: API Gateway with Cognito JWT authoriser.
598
+ Every request must carry a valid JWT issued by Cognito User Pool.
599
+ Cognito enforces MFA for operator access. Token expiry: 1 hour (access),
600
+ 24 hours (refresh). API Gateway request validators enforce schema
601
+ before Lambda invocation.
602
+
603
+ Trust Zone 3 — Service mesh (VPC private subnets): Lambda functions
604
+ executing in VPC. No inbound internet access. Inter-service communication
605
+ uses EventBridge (regional, no VPC required) and SQS (VPC endpoint).
606
+ Lambda IAM roles are least-privilege: each function has an IAM role
607
+ permitting only its specific DynamoDB actions, EventBridge PutEvents
608
+ to its source bus, and SQS actions on its specific queue.
609
+
610
+ Trust Zone 4 — Data layer: DynamoDB, SQS, EventBridge, S3. All
611
+ resources are encrypted at rest with AWS KMS customer-managed keys
612
+ (one CMK per service, key rotation enabled). All S3 buckets have
613
+ public access block enabled and versioning on. DynamoDB encryption
614
+ uses KMS CMK; Point-in-Time Recovery enabled.
615
+
616
+ Trust Zone 5 — Audit: CloudTrail (management and data events), S3
617
+ event archive, VPC Flow Logs. CloudTrail logs are written to a
618
+ dedicated S3 bucket in a separate account with write-once (Object
619
+ Lock) policy.
620
+
621
+ Encryption in transit: TLS 1.2 minimum enforced by API Gateway and
622
+ Cognito. All SDK calls to AWS services use HTTPS. Stripe integration
623
+ uses TLS 1.2+. Internal Lambda-to-SQS and Lambda-to-DynamoDB calls
624
+ use AWS SDK over HTTPS via VPC endpoints.
625
+ diagram_source: |
626
+ flowchart LR
627
+ classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
628
+
629
+ Internet@{ shape: cloud, label: "Internet" }
630
+ Stripe@{ shape: cloud, label: "Stripe\nTLS 1.2+" }
631
+
632
+ subgraph Zone1["Zone 1: Public"]
633
+ direction TB
634
+ CF@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
635
+ WAF@{ img: "/icons/aws/waf.svg", label: "AWS WAF\nOWASP CRS", pos: "b", w: 52, h: 52, constraint: "on" }
636
+ CF --> WAF
637
+ end
638
+
639
+ subgraph Zone2["Zone 2: API Perimeter"]
640
+ direction TB
641
+ APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
642
+ Cognito@{ img: "/icons/aws/cognito.svg", label: "Cognito JWT\nAuthorizer", pos: "b", w: 52, h: 52, constraint: "on" }
643
+ Cognito --> APIGW
644
+ end
645
+
646
+ subgraph Zone3["Zone 3: Service Mesh (VPC)"]
647
+ direction TB
648
+ OrderLambda@{ img: "/icons/aws/lambda.svg", label: "Order Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
649
+ PayLambda@{ img: "/icons/aws/lambda.svg", label: "Payment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
650
+ FulfilLambda@{ img: "/icons/aws/lambda.svg", label: "Fulfilment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
651
+ NotifyLambda@{ img: "/icons/aws/lambda.svg", label: "Notification Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
652
+ end
653
+
654
+ subgraph Zone4["Zone 4: Data Layer (KMS encrypted)"]
655
+ direction TB
656
+ EB@{ img: "/icons/aws/eventbridge.svg", label: "EventBridge\nSchema validated", pos: "b", w: 52, h: 52, constraint: "on" }
657
+ DDB@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB", pos: "b", w: 52, h: 52, constraint: "on" }
658
+ SQSq@{ img: "/icons/aws/sqs.svg", label: "SQS", pos: "b", w: 52, h: 52, constraint: "on" }
659
+ S3arch@{ img: "/icons/aws/s3.svg", label: "S3 Archive", pos: "b", w: 52, h: 52, constraint: "on" }
660
+ end
661
+
662
+ subgraph Zone5["Zone 5: Audit"]
663
+ direction TB
664
+ CloudTrail@{ img: "/icons/aws/cloudtrail.svg", label: "CloudTrail\nwrite-once S3", pos: "b", w: 52, h: 52, constraint: "on" }
665
+ XRAY@{ img: "/icons/aws/xray.svg", label: "AWS X-Ray", pos: "b", w: 52, h: 52, constraint: "on" }
666
+ end
667
+
668
+ Internet --> CF
669
+ WAF --> APIGW
670
+ APIGW --> OrderLambda
671
+ OrderLambda --> EB
672
+ EB --> PayLambda --> Stripe
673
+ EB --> FulfilLambda
674
+ EB --> NotifyLambda
675
+ OrderLambda --> DDB
676
+ PayLambda --> DDB
677
+ FulfilLambda --> SQSq
678
+ EB --> S3arch
679
+
680
+ DDB -. data events .-> CloudTrail
681
+ SQSq -. queue events .-> CloudTrail
682
+ S3arch -. archive logs .-> CloudTrail
683
+ OrderLambda -. traces .-> XRAY
684
+ PayLambda -. traces .-> XRAY
685
+ FulfilLambda -. traces .-> XRAY
686
+ NotifyLambda -. traces .-> XRAY
687
+
688
+ class Internet,Stripe external
689
+ source_type: mermaid
690
+
691
+ element_catalog:
692
+ - name: API Gateway
693
+ type: gateway
694
+ technology: AWS API Gateway (Regional, REST API)
695
+ responsibility: >
696
+ TLS termination, Cognito JWT authorisation, WAF integration,
697
+ request schema validation, throttling, and routing to Order Service Lambda.
698
+ relationships:
699
+ - CloudFront + WAF (upstream)
700
+ - Cognito User Pool (authoriser)
701
+ - Order Service Lambda (downstream invoke)
702
+
703
+ - name: CloudFront + WAF
704
+ type: gateway
705
+ technology: AWS CloudFront with AWS WAF (OWASP CRS v3.2)
706
+ responsibility: >
707
+ Global CDN edge, DDoS mitigation, WAF rule enforcement (OWASP CRS,
708
+ rate limiting, geo-blocking).
709
+ relationships:
710
+ - API Gateway (origin)
711
+
712
+ - name: Cognito User Pool
713
+ type: service
714
+ technology: AWS Cognito User Pool
715
+ responsibility: >
716
+ Customer and operator identity management; JWT issuance; MFA for
717
+ operators; token validation for API Gateway authoriser.
718
+ relationships:
719
+ - API Gateway (authoriser)
720
+
721
+ - name: Order Service Lambda
722
+ type: function
723
+ technology: AWS Lambda, Node.js 20, ARM64
724
+ responsibility: >
725
+ Validates and persists order submissions. Exposes GET /orders/{id}
726
+ for status polling. Publishes OrderCreated to EventBridge.
727
+ relationships:
728
+ - API Gateway (invoked by)
729
+ - DynamoDB Orders Table (read/write)
730
+ - EventBridge Order Bus (publish OrderCreated)
731
+
732
+ - name: Payment Service Lambda
733
+ type: function
734
+ technology: AWS Lambda, Node.js 20, ARM64, VPC-attached
735
+ responsibility: >
736
+ Subscribes to OrderCreated events. Calls Stripe payment API.
737
+ Publishes PaymentAuthorised or PaymentFailed. Updates DynamoDB.
738
+ relationships:
739
+ - EventBridge Order Bus (subscribe OrderCreated)
740
+ - Stripe Payment Gateway (outbound HTTPS)
741
+ - DynamoDB Orders Table (write)
742
+ - EventBridge Order Bus (publish PaymentAuthorised, PaymentFailed)
743
+
744
+ - name: Fulfilment Service Lambda
745
+ type: function
746
+ technology: AWS Lambda, Node.js 20, ARM64
747
+ responsibility: >
748
+ Subscribes to PaymentAuthorised. Places fulfilment instruction on SQS.
749
+ Publishes FulfilmentDispatched. Updates DynamoDB.
750
+ relationships:
751
+ - EventBridge Order Bus (subscribe PaymentAuthorised)
752
+ - SQS Fulfilment Queue (send)
753
+ - DynamoDB Orders Table (write)
754
+ - EventBridge Order Bus (publish FulfilmentDispatched)
755
+
756
+ - name: WMS Adapter Lambda
757
+ type: function
758
+ technology: AWS Lambda, Node.js 20, ARM64, VPC-attached
759
+ responsibility: >
760
+ Polls SQS Fulfilment Queue; translates to WMS API format; calls WMS.
761
+ On WMS failure: message returns to queue (DLQ after 3 failures).
762
+ relationships:
763
+ - SQS Fulfilment Queue (consume)
764
+ - Warehouse Management System (outbound HTTPS)
765
+
766
+ - name: Notification Service Lambda
767
+ type: function
768
+ technology: AWS Lambda, Node.js 20, ARM64
769
+ responsibility: >
770
+ Subscribes to OrderCreated, PaymentAuthorised, PaymentFailed,
771
+ FulfilmentDispatched. Sends email (SES) and SMS (SNS) notifications.
772
+ relationships:
773
+ - EventBridge Order Bus (subscribe multiple events)
774
+ - Amazon SES (send email)
775
+ - Amazon SNS (send SMS)
776
+ - DynamoDB Customer Preferences Table (read)
777
+
778
+ - name: EventBridge Order Bus
779
+ type: other
780
+ technology: AWS EventBridge Custom Event Bus
781
+ responsibility: >
782
+ Central event backbone for all inter-service async communication.
783
+ Schema registry enforces event schema validation before routing.
784
+ relationships:
785
+ - All service Lambdas (publish and subscribe)
786
+ - Kinesis Firehose (event archive pipe)
787
+
788
+ - name: DynamoDB Orders Table
789
+ type: database
790
+ technology: AWS DynamoDB (single-table design, on-demand capacity)
791
+ responsibility: >
792
+ Primary data store for orders and customer preferences.
793
+ Single-table design supports order-by-id, customer-orders, and
794
+ status-based access patterns. PITR enabled.
795
+ relationships:
796
+ - Order Service Lambda (read/write)
797
+ - Payment Service Lambda (write)
798
+ - Fulfilment Service Lambda (write)
799
+
800
+ - name: SQS Fulfilment Queue
801
+ type: queue
802
+ technology: AWS SQS Standard Queue + Dead Letter Queue
803
+ responsibility: >
804
+ Decouples Fulfilment Service from WMS Adapter. Provides at-least-once
805
+ delivery with visibility timeout (30s). DLQ captures messages failing
806
+ after 3 receive attempts.
807
+ relationships:
808
+ - Fulfilment Service Lambda (send)
809
+ - WMS Adapter Lambda (consume)
810
+
811
+ - name: S3 Event Archive
812
+ type: storage
813
+ technology: AWS S3 (KMS encrypted, Object Lock, versioning)
814
+ responsibility: >
815
+ 7-year immutable event archive for audit, compliance, and analytics.
816
+ Receives all EventBridge events via Kinesis Firehose.
817
+ relationships:
818
+ - Kinesis Firehose (write)
819
+ - Analytics Platform (read)
820
+
821
+ decisions:
822
+ - id: ADR-001
823
+ title: Event-driven choreography over synchronous request-response for inter-service communication
824
+ context: >
825
+ Six services (Order, Payment, Fulfilment, Notification, WMS Adapter,
826
+ Analytics) must coordinate to process an order. The services need to
827
+ be independently deployable by separate teams. DR-04 requires no
828
+ cross-service deployment coordination. DR-01 requires 10× scale without
829
+ re-platforming.
830
+ options:
831
+ - id: A
832
+ description: Synchronous REST — each service calls the next in the chain
833
+ pros:
834
+ - Simple to reason about; linear call trace
835
+ - Immediate consistency; error propagation is direct
836
+ cons:
837
+ - Tight coupling; callee changes break callers
838
+ - Cascading failures; one slow service blocks the chain
839
+ - Cannot independently deploy without coordinating all services
840
+ - Latency is additive across the chain
841
+ - id: B
842
+ description: Orchestration via step function (AWS Step Functions)
843
+ pros:
844
+ - Central visibility of workflow state
845
+ - Built-in retry and error handling
846
+ cons:
847
+ - Central orchestrator is a coupling point
848
+ - Teams must agree on orchestration contract changes
849
+ - Step Functions costs and throttle limits at high order volumes
850
+ - id: C
851
+ description: Choreography via EventBridge event bus (chosen)
852
+ pros:
853
+ - Services are fully decoupled; each publishes and subscribes independently
854
+ - Schema registry enforces contracts without runtime coupling
855
+ - EventBridge scales to millions of events/second
856
+ - New subscribers (analytics, fraud) added without modifying publishers
857
+ cons:
858
+ - End-to-end flow harder to trace (mitigated by X-Ray correlation IDs)
859
+ - Eventual consistency model requires idempotent consumers
860
+ - Schema evolution requires backwards-compatible changes
861
+ decision: Option C — EventBridge choreography
862
+ rationale: >
863
+ DR-04 (independent deployability) and DR-01 (10× scale) are best
864
+ served by choreography. EventBridge provides native schema enforcement,
865
+ eliminating the primary risk of loose coupling. X-Ray with correlation
866
+ IDs mitigates the observability trade-off. PRIN-01 (event-driven by
867
+ default) aligns with this choice.
868
+ tradeoffs: >
869
+ Accepted: harder end-to-end flow tracing (mitigated by X-Ray + correlation
870
+ IDs); eventual consistency requiring idempotent consumers; schema
871
+ evolution discipline required. Rejected: tight coupling, cascading
872
+ failures, deployment coordination overhead.
873
+ consequences: >
874
+ All service teams must implement idempotent consumers (DynamoDB
875
+ conditional writes for deduplication). EventBridge Schema Registry
876
+ is mandatory. X-Ray tracing is mandatory for all Lambda functions.
877
+ A change to an event schema requires a compatibility check in CI
878
+ before merge.
879
+ revisit_conditions: >
880
+ Revisit if: order event throughput exceeds 50K events/second (Step
881
+ Functions throttle may become attractive); or if strong consistency
882
+ requirements are introduced (e.g. inventory reservation); or if the
883
+ team size shrinks such that independent deployment is no longer a goal.
884
+ driver_refs: [DR-01, DR-04, DR-05]
885
+
886
+ - id: ADR-002
887
+ title: DynamoDB over RDS PostgreSQL as the primary order data store
888
+ context: >
889
+ Order data must support: order-by-id lookup (< 5ms), customer-order-list
890
+ query (< 10ms), status-update writes (< 5ms). DR-01 requires 10× scale.
891
+ DR-05 requires immutable audit trail. The access patterns are
892
+ well-defined and documented. Complex ad-hoc queries are handled by
893
+ the Analytics Platform (not this service).
894
+ options:
895
+ - id: A
896
+ description: DynamoDB single-table design with on-demand capacity (chosen)
897
+ pros:
898
+ - Scales to any throughput without provisioning
899
+ - Single-digit millisecond latency at any scale
900
+ - No connection pool management; Lambda-friendly
901
+ - PITR + Global Tables for DR
902
+ cons:
903
+ - No SQL; complex queries require GSI design upfront
904
+ - Schema changes require careful migration
905
+ - Relational queries (multi-entity joins) not supported
906
+ - id: B
907
+ description: RDS PostgreSQL with read replicas
908
+ pros:
909
+ - Full SQL; flexible queries without upfront access pattern design
910
+ - Familiar to most developers
911
+ cons:
912
+ - Connection pool limits (RDS Proxy needed for Lambda)
913
+ - Manual scaling; vertical scale requires downtime
914
+ - Not serverless — always-on cost even at zero load
915
+ - PITR window limited to 35 days
916
+ - id: C
917
+ description: Aurora Serverless v2 PostgreSQL
918
+ pros:
919
+ - SQL + serverless scaling
920
+ cons:
921
+ - Cold-start latency for Aurora v2 is higher than DynamoDB
922
+ - Still requires connection pooling for Lambda at scale
923
+ - Higher cost at sustained high throughput vs DynamoDB
924
+ decision: Option A — DynamoDB single-table design
925
+ rationale: >
926
+ Access patterns are well-defined (order-by-id, customer-order-list,
927
+ status queries). DynamoDB's serverless model aligns with Lambda
928
+ (DR-01, DR-03). No connection pool management at scale is a
929
+ significant operational advantage. Analytics queries (complex joins)
930
+ are delegated to the Analytics Platform via S3 event archive.
931
+ PRIN-04 (IaC only) is satisfied by CDK DynamoDB construct.
932
+ tradeoffs: >
933
+ Accepted: no SQL for ad-hoc queries; access patterns must be
934
+ documented and designed upfront; schema evolution requires migration
935
+ scripts. Gained: sub-5ms latency at any scale; zero ops overhead
936
+ for capacity management; serverless cost model.
937
+ consequences: >
938
+ Access pattern document (DynamoDB design appendix) must be maintained.
939
+ Any new query pattern requires a GSI — raise as Architecture Review
940
+ before implementation. Analytics queries must use S3 event archive
941
+ or DynamoDB exports, not direct DynamoDB scans.
942
+ revisit_conditions: >
943
+ Revisit if: complex relational queries become a product requirement
944
+ for the order service itself (not analytics); or if DynamoDB pricing
945
+ changes materially; or if the team grows to have dedicated DBA capacity
946
+ with preference for SQL.
947
+ driver_refs: [DR-01, DR-03]
948
+
949
+ - id: ADR-003
950
+ title: AWS Lambda over ECS Fargate for compute
951
+ context: >
952
+ Six service functions need to run on AWS. DR-01 requires elastic scale
953
+ to 5000 orders/minute burst. PRIN-04 requires IaC-only provisioning.
954
+ The team of 4 engineers cannot operate a container platform as well as
955
+ build services.
956
+ options:
957
+ - id: A
958
+ description: AWS Lambda (serverless functions) — chosen
959
+ pros:
960
+ - Zero operational overhead — no cluster management
961
+ - Automatic scaling from 0 to thousands of concurrent executions
962
+ - Pay-per-invocation; zero cost at zero load
963
+ - Native EventBridge and SQS triggers; no polling code
964
+ cons:
965
+ - 15-minute maximum execution time (not a constraint here)
966
+ - Cold-start latency (mitigated by provisioned concurrency for submission path)
967
+ - Package size limit 250MB unzipped (not a constraint here)
968
+ - id: B
969
+ description: ECS Fargate (containerised services)
970
+ pros:
971
+ - No cold start; consistent latency
972
+ - Arbitrary execution time
973
+ - Familiar container model
974
+ cons:
975
+ - Always-on minimum cost (min 1 task per service × 6 services)
976
+ - Auto-scaling is slower (minutes, not seconds)
977
+ - Requires more operational expertise; cluster networking
978
+ - Team would spend 30–40% of time on container operations
979
+ - id: C
980
+ description: EKS (Kubernetes)
981
+ pros:
982
+ - Maximum flexibility; portable
983
+ cons:
984
+ - Highest operational overhead; not appropriate for 4-person team
985
+ - Overkill for defined event-driven workloads
986
+ decision: Option A — AWS Lambda
987
+ rationale: >
988
+ The workload (event-triggered, short-duration, variable load) is the
989
+ canonical Lambda use case. DR-01 (10× scale) is satisfied without
990
+ operational overhead. PRIN-04 (IaC only) is trivially satisfied by
991
+ CDK Lambda construct. Provisioned concurrency on the Order Service
992
+ submission path addresses cold-start for DR-03 (P99 < 500ms).
993
+ tradeoffs: >
994
+ Accepted: cold-start risk (mitigated by provisioned concurrency);
995
+ 15-minute execution limit (acceptable — order processing completes
996
+ in seconds); vendor lock-in to AWS Lambda. Gained: zero operational
997
+ overhead; automatic scaling; pay-per-use cost model.
998
+ consequences: >
999
+ Lambda package size must stay < 250MB. Lambda execution time must
1000
+ complete within 15 minutes (monitored in CloudWatch). Provisioned
1001
+ concurrency must be configured for Order Service (and reviewed
1002
+ quarterly for cost vs performance).
1003
+ revisit_conditions: >
1004
+ Revisit if: a service requires > 15-minute execution; or team grows
1005
+ to 10+ engineers with dedicated platform capacity; or Lambda pricing
1006
+ changes materially relative to ECS Fargate.
1007
+ driver_refs: [DR-01, DR-03]
1008
+
1009
+ - id: ADR-004
1010
+ title: EventBridge over Amazon SNS for the event bus
1011
+ context: >
1012
+ An event bus is required for pub-sub between six services (ADR-001
1013
+ selected choreography). Two candidate AWS services: SNS (topics +
1014
+ subscriptions) and EventBridge (custom event bus with routing rules).
1015
+ options:
1016
+ - id: A
1017
+ description: Amazon SNS (topics + Lambda subscriptions)
1018
+ pros:
1019
+ - Simpler mental model; familiar
1020
+ - Lower per-event cost at high volume
1021
+ cons:
1022
+ - No content-based routing (filter by event body requires Lambda code)
1023
+ - No schema registry; contracts enforced only in application code
1024
+ - No built-in event replay for debugging/DR
1025
+ - No Archive and Replay feature
1026
+ - id: B
1027
+ description: AWS EventBridge custom event bus with schema registry — chosen
1028
+ pros:
1029
+ - Content-based routing rules (filter by event source, type, and body fields)
1030
+ - Schema registry with automatic discovery and compatibility checks
1031
+ - Archive and Replay for debugging and DR
1032
+ - Native integration with many AWS services
1033
+ cons:
1034
+ - Higher cost per event than SNS (approximately 5× at high volume)
1035
+ - "Throughput limit: 10K events/second per event bus (region)"
1036
+ - id: C
1037
+ description: Amazon MSK (Managed Kafka)
1038
+ pros:
1039
+ - Extremely high throughput; replay by default
1040
+ cons:
1041
+ - Significant operational overhead for 4-person team
1042
+ - Not appropriate for current order volume (< 5000/minute)
1043
+ decision: Option B — AWS EventBridge
1044
+ rationale: >
1045
+ Schema registry enforcement (DR-04, PRIN-01) and content-based routing
1046
+ provide the decoupling guarantees required. Archive and Replay
1047
+ directly supports DR-05 (audit trail). Cost premium over SNS is
1048
+ justified by reduced application-layer filtering code. EventBridge
1049
+ throughput limit (10K events/s) is 120× the peak order volume
1050
+ (5000 orders/min = ~83/s), providing ample headroom.
1051
+ tradeoffs: >
1052
+ Accepted: higher cost vs SNS (~5× per event); throughput ceiling of
1053
+ 10K events/s per bus (headroom: 120× current peak). Gained: schema
1054
+ enforcement; content-based routing; Archive and Replay; no custom
1055
+ routing logic in Lambda.
1056
+ consequences: >
1057
+ EventBridge Schema Registry is mandatory for all new events. Schema
1058
+ compatibility checks must run in CI. Archive must be enabled on the
1059
+ order-processing bus (7-day default; S3 Firehose for long-term storage).
1060
+ revisit_conditions: >
1061
+ Revisit if: event throughput approaches 8K events/second (80% of
1062
+ bus limit); or cost becomes a material concern (review quarterly).
1063
+ MSK would be the next step at sustained > 10K events/second.
1064
+ driver_refs: [DR-04, DR-05]
1065
+
1066
+ - id: ADR-005
1067
+ title: DynamoDB single-table design over multi-table design
1068
+ context: >
1069
+ DynamoDB was selected (ADR-002). DynamoDB supports two design approaches:
1070
+ single-table (all entities in one table, PK/SK encoding encodes entity
1071
+ type) and multi-table (one table per entity type, simpler model).
1072
+ Access patterns: order-by-id, customer-orders list, orders-by-status.
1073
+ options:
1074
+ - id: A
1075
+ description: Single-table design (all entities in one table) — chosen
1076
+ pros:
1077
+ - Single-digit ms latency for all access patterns via GSI
1078
+ - Fewer DynamoDB tables to manage and monitor
1079
+ - Can fetch related entities in a single request (if co-located)
1080
+ cons:
1081
+ - PK/SK design is non-obvious; requires documentation
1082
+ - Harder for developers unfamiliar with DynamoDB patterns
1083
+ - Mistakes in PK/SK design are expensive to fix post-deployment
1084
+ - id: B
1085
+ description: Multi-table design (one table per entity type)
1086
+ pros:
1087
+ - Simpler mental model; each table maps to one entity
1088
+ - IAM policies per table are more granular
1089
+ cons:
1090
+ - Cross-entity queries require multiple requests or scatter-gather
1091
+ - More tables to monitor, back up, and configure
1092
+ decision: Option A — Single-table design
1093
+ rationale: >
1094
+ Access patterns are well-defined at design time. Single-table design
1095
+ achieves all required patterns with GSIs and avoids scatter-gather
1096
+ queries. The operational simplicity (fewer tables) outweighs the
1097
+ design complexity given the team has DynamoDB expertise. DR-03
1098
+ (< 200ms P99) is best served by single-table with optimised GSIs.
1099
+ tradeoffs: >
1100
+ Accepted: PK/SK design complexity; requires upfront access pattern
1101
+ documentation; harder for new team members to onboard. Gained:
1102
+ optimal query performance; all access patterns served with single-digit
1103
+ ms latency; fewer DynamoDB resources to manage.
1104
+ consequences: >
1105
+ Access pattern document must be written and maintained. All DynamoDB
1106
+ changes must be reviewed by an engineer with DynamoDB expertise.
1107
+ New access patterns must be raised as an Architecture Decision before
1108
+ adding GSIs.
1109
+ revisit_conditions: >
1110
+ Revisit if: a new access pattern cannot be served by existing GSIs
1111
+ and would require a table scan; or if the team changes to have no
1112
+ DynamoDB expertise.
1113
+ driver_refs: [DR-03]
1114
+
1115
+ component_classification:
1116
+ components:
1117
+ - name: API Gateway (Regional, REST API)
1118
+ mandate_level: mandatory
1119
+ rationale: >
1120
+ Mandatory entry point for all order API traffic. Provides Cognito
1121
+ JWT authorisation, WAF integration, and request validation.
1122
+ Replacing with a custom reverse proxy would require duplicating
1123
+ security controls and is not permitted without Architecture Board
1124
+ exception.
1125
+ - name: Cognito User Pool
1126
+ mandate_level: mandatory
1127
+ rationale: >
1128
+ Corporate identity standard for customer and operator authentication.
1129
+ Provides JWT issuance, MFA, and token management. Replacing with
1130
+ alternative IdP requires Security Architecture review.
1131
+ - name: EventBridge Custom Event Bus
1132
+ mandate_level: mandatory
1133
+ rationale: >
1134
+ Mandatory inter-service communication channel. ADR-001 and ADR-004.
1135
+ Direct service-to-service calls are not permitted.
1136
+ - name: DynamoDB (single-table)
1137
+ mandate_level: mandatory
1138
+ rationale: >
1139
+ Mandatory for order state persistence. ADR-002. Alternative stores
1140
+ not permitted for order data without Architecture Board review.
1141
+ - name: Lambda (Node.js 20, ARM64)
1142
+ mandate_level: mandatory
1143
+ rationale: >
1144
+ Mandatory compute platform. ADR-003. Node.js 20 is the approved
1145
+ runtime. ARM64 selected for cost efficiency (~20% cheaper than x86).
1146
+ - name: AWS CDK v2 (IaC)
1147
+ mandate_level: mandatory
1148
+ rationale: PRIN-04 (IaC only). All infrastructure must be in CDK stacks.
1149
+ - name: CloudWatch + X-Ray (observability)
1150
+ mandate_level: mandatory
1151
+ rationale: >
1152
+ PRIN-05. Structured logging, metrics, and X-Ray tracing are mandatory
1153
+ for all Lambda functions. Correlation ID propagation is mandatory.
1154
+ - name: KMS Customer-Managed Keys
1155
+ mandate_level: mandatory
1156
+ rationale: >
1157
+ PCI-DSS requirement. All data stores must use KMS CMKs (not
1158
+ AWS-managed keys) to satisfy PCI-DSS key management controls.
1159
+ - name: SQS (work queues with DLQ)
1160
+ mandate_level: recommended
1161
+ rationale: >
1162
+ Recommended for all work-queue patterns (e.g. WMS adapter).
1163
+ Not mandatory for direct Lambda event triggers from EventBridge.
1164
+ - name: Kinesis Firehose → S3 (event archive)
1165
+ mandate_level: recommended
1166
+ rationale: >
1167
+ Recommended for all event buses where audit retention > 7 days is
1168
+ required. Mandatory only if DR-05 (7-year audit) applies to the
1169
+ specific domain.
1170
+ - name: Lambda Provisioned Concurrency
1171
+ mandate_level: recommended
1172
+ rationale: >
1173
+ Recommended for latency-sensitive synchronous paths (order submission).
1174
+ Optional for asynchronous event handlers where cold-start is acceptable.
1175
+ - name: CloudFront + WAF
1176
+ mandate_level: mandatory
1177
+ rationale: >
1178
+ All public-facing API traffic must traverse CloudFront + WAF.
1179
+ Mandatory for PCI-DSS (network security controls) and DDoS mitigation.
1180
+ - name: Notification channel (SES/SNS vs third-party)
1181
+ mandate_level: optional
1182
+ rationale: >
1183
+ Teams may substitute SES/SNS with an approved third-party notification
1184
+ provider (e.g. Twilio) if product requirements dictate. Must use
1185
+ the Notification Service adapter pattern; no direct notification
1186
+ calls from other services.
1187
+ extension_points:
1188
+ - name: Notification provider
1189
+ description: >
1190
+ The Notification Service uses an adapter pattern. Teams may substitute
1191
+ the SES/SNS implementation with any approved notification provider
1192
+ by implementing the NotificationAdapter interface.
1193
+ guidance: >
1194
+ Acceptable: Twilio, SendGrid, or any provider integrated via AWS
1195
+ Lambda. Provider must support at-least-once delivery guarantees
1196
+ and must not require storing customer contact data outside AWS.
1197
+ examples: Twilio SMS adapter, SendGrid email adapter
1198
+ - name: Payment gateway
1199
+ description: >
1200
+ The Payment Service uses a payment gateway adapter. Teams may
1201
+ substitute Stripe with another approved payment gateway.
1202
+ guidance: >
1203
+ New gateway must support idempotent charge requests (via idempotency
1204
+ key or equivalent). Must undergo Security Architecture review before
1205
+ substitution.
1206
+ examples: Adyen adapter, Braintree adapter
1207
+ - name: Programming language (Lambda runtime)
1208
+ description: >
1209
+ Node.js 20 is the approved default runtime. Teams may use Python 3.12
1210
+ or Java 21 if the team has demonstrably stronger expertise in that
1211
+ language and the performance characteristics are validated.
1212
+ guidance: >
1213
+ Must use the AWS Lambda Powertools library for the chosen runtime.
1214
+ Non-Node.js choices require Architecture Board approval and must
1215
+ demonstrate equivalent structured logging and X-Ray integration.
1216
+ examples: Python 3.12 with Lambda Powertools for Python
1217
+
1218
+ quality_attributes:
1219
+ - attribute: Availability
1220
+ target: 99.95% measured monthly
1221
+ measurement: Synthetic monitoring probes every 30 seconds via CloudWatch
1222
+ Synthetics Canary. Probe hits POST /orders with a test payload.
1223
+ SLO measured as proportion of successful probe responses over rolling
1224
+ 30-day window.
1225
+ validation_strategy: >
1226
+ Chaos engineering quarterly (AWS Fault Injection Simulator): simulate
1227
+ single-AZ failure, Lambda throttling, DynamoDB throttling. Verify
1228
+ automatic recovery within 5 minutes. Load test monthly in staging
1229
+ at 150% of peak load (7500 orders/minute).
1230
+ fitness_function: >
1231
+ CloudWatch alarm fires if availability drops below 99.95% in any
1232
+ rolling 24-hour window. Alarm triggers PagerDuty P1. CDK deploys
1233
+ the alarm alongside the service stack.
1234
+ quality_scenario: >
1235
+ Stimulus: AZ failure in eu-west-1. Source: AWS infrastructure.
1236
+ Environment: Production, peak load 1000 orders/minute.
1237
+ Response: Lambda automatically routes traffic to remaining two AZs.
1238
+ Response measure: Zero failed order submissions; recovery within 2
1239
+ minutes; monthly availability remains >= 99.95%.
1240
+
1241
+ - attribute: Latency P99 (order submission)
1242
+ target: < 500ms end-to-end from API Gateway receipt to 202 response
1243
+ measurement: CloudWatch API Gateway P99 latency metric (IntegrationLatency).
1244
+ X-Ray service map shows breakdown per segment.
1245
+ validation_strategy: >
1246
+ Gatling load test at 1000 concurrent users, sustained 10 minutes.
1247
+ Test runs in staging CI before every production deployment.
1248
+ fitness_function: >
1249
+ Gatling test fails the CI build if P99 > 500ms at 1000 concurrent
1250
+ users. CloudWatch alarm on P99 > 400ms in production (warning);
1251
+ P99 > 500ms (critical, PagerDuty P2).
1252
+ quality_scenario: >
1253
+ Stimulus: 1000 concurrent order submissions. Source: Load test /
1254
+ production peak. Environment: Normal operating conditions.
1255
+ Response: API Gateway routes to Order Service Lambda (provisioned
1256
+ concurrency). Response measure: P99 <= 500ms for >= 99% of requests.
1257
+
1258
+ - attribute: Latency P99 (order status query)
1259
+ target: < 200ms end-to-end from API Gateway receipt to 200 response
1260
+ measurement: CloudWatch API Gateway P99 latency (GetOrderStatus endpoint).
1261
+ validation_strategy: >
1262
+ Gatling load test at 5000 concurrent status queries (read-heavy
1263
+ production pattern). DynamoDB GSI query measured separately.
1264
+ fitness_function: >
1265
+ Gatling test fails CI if P99 > 200ms at 5000 concurrent reads.
1266
+ quality_scenario: >
1267
+ Stimulus: 5000 concurrent status queries. Response measure:
1268
+ P99 <= 200ms; DynamoDB read latency <= 5ms.
1269
+
1270
+ - attribute: Throughput (sustained)
1271
+ target: 1000 orders/minute sustained; 5000 orders/minute burst (5 minutes)
1272
+ measurement: >
1273
+ CloudWatch metric: OrderCreated events/minute on EventBridge.
1274
+ Lambda concurrency utilisation dashboard.
1275
+ validation_strategy: >
1276
+ Monthly load test at 1000 orders/minute for 30 minutes in staging.
1277
+ Quarterly burst test at 5000 orders/minute for 5 minutes.
1278
+ fitness_function: >
1279
+ Load test step in CodePipeline deployment pipeline. Fails if
1280
+ < 1000 orders/minute throughput sustained or > 5% error rate.
1281
+
1282
+ - attribute: Error rate
1283
+ target: < 0.1% of order submissions result in a 5xx error
1284
+ measurement: CloudWatch API Gateway 5xxError metric as percentage of total requests.
1285
+ validation_strategy: >
1286
+ Continuous production monitoring. Monthly chaos injection (Lambda
1287
+ function errors, DynamoDB errors) to validate error handling.
1288
+ fitness_function: >
1289
+ CloudWatch alarm on 5xx error rate > 0.1% over rolling 5-minute
1290
+ window. PagerDuty P2 alert.
1291
+
1292
+ - attribute: Recovery Time Objective (RTO)
1293
+ target: 4 hours (full service restoration after catastrophic failure)
1294
+ measurement: "DR test: time from incident declaration to full order submission capacity."
1295
+ validation_strategy: >
1296
+ Annual DR exercise: simulate eu-west-1 complete unavailability.
1297
+ Execute failover runbook to eu-central-1. Measure time to first
1298
+ successful order in DR region.
1299
+ fitness_function: N/A — manual DR exercise with timer.
1300
+ quality_scenario: >
1301
+ Stimulus: eu-west-1 region failure. Source: AWS infrastructure.
1302
+ Environment: Production. Response: Route 53 failover activates
1303
+ eu-central-1 warm standby. Response measure: Full order submission
1304
+ capability restored within 4 hours.
1305
+
1306
+ - attribute: Recovery Point Objective (RPO)
1307
+ target: 1 hour (maximum data loss in catastrophic failure scenario)
1308
+ measurement: "DynamoDB Global Tables replication lag monitoring. Target: < 1s typical."
1309
+ validation_strategy: >
1310
+ Annual DR exercise: confirm DynamoDB Global Tables replica in
1311
+ eu-central-1 has all orders from the 60 minutes preceding the
1312
+ simulated failure.
1313
+
1314
+ - attribute: Security (PCI-DSS compliance)
1315
+ target: PCI-DSS Level 1 compliant (annual QSA audit pass)
1316
+ measurement: Annual PCI-DSS QSA audit. Quarterly internal penetration test.
1317
+ validation_strategy: >
1318
+ Monthly automated compliance scan (AWS Security Hub, PCI-DSS standard).
1319
+ Quarterly penetration test by approved security vendor.
1320
+ Annual QSA audit.
1321
+ fitness_function: >
1322
+ AWS Security Hub PCI-DSS standard enabled; findings of HIGH or
1323
+ CRITICAL severity block deployment via CodePipeline gate.
1324
+
1325
+ operational_model:
1326
+ slos:
1327
+ - name: Order API availability
1328
+ target: "99.95%"
1329
+ measurement_window: rolling 30 days
1330
+ error_budget: "21.9 minutes/month (99.95% availability)"
1331
+ - name: Order submission P99 latency
1332
+ target: "< 500ms"
1333
+ measurement_window: rolling 24 hours
1334
+ error_budget: "< 0.1% of requests may exceed 500ms"
1335
+ - name: Order status query P99 latency
1336
+ target: "< 200ms"
1337
+ measurement_window: rolling 24 hours
1338
+ error_budget: "< 0.1% of requests may exceed 200ms"
1339
+ - name: Payment authorisation success rate
1340
+ target: "> 99.5% (excluding card declines)"
1341
+ measurement_window: rolling 24 hours
1342
+ error_budget: "< 0.5% of authorisation attempts may fail due to system error"
1343
+ - name: Fulfilment dispatch success rate
1344
+ target: "> 99.9%"
1345
+ measurement_window: rolling 24 hours
1346
+ error_budget: "< 0.1% of fulfilment instructions may end in DLQ"
1347
+
1348
+ monitoring:
1349
+ strategy: >
1350
+ Three-signal observability: metrics (CloudWatch), structured logs
1351
+ (CloudWatch Logs Insights), and distributed traces (X-Ray). All Lambda
1352
+ functions emit structured JSON logs with correlation ID on every
1353
+ invocation. X-Ray active tracing on all Lambda functions. CloudWatch
1354
+ Container Insights for DynamoDB and SQS. On-call via PagerDuty
1355
+ with escalation policy (P1: immediate, P2: 30-minute SLA).
1356
+ CloudWatch dashboards provisioned by CDK alongside each service stack.
1357
+ metrics:
1358
+ - Order submission rate (orders/minute) — EventBridge PutEvents count
1359
+ - Order submission P99 latency — API Gateway IntegrationLatency P99
1360
+ - Payment authorisation success rate — Payment Service custom metric
1361
+ - Fulfilment DLQ depth — SQS ApproximateNumberOfMessagesNotVisible on DLQ
1362
+ - Lambda error rate (%) — Lambda Errors / Lambda Invocations per function
1363
+ - Lambda duration P99 — Lambda Duration P99 per function
1364
+ - Lambda concurrent executions — Lambda ConcurrentExecutions
1365
+ - DynamoDB read/write throttle events — DynamoDB ThrottledRequests
1366
+ - EventBridge throttled rules — EventBridge ThrottledRules
1367
+ - API Gateway 4xx and 5xx error rates
1368
+ - SQS message age (P99) — SQS ApproximateAgeOfOldestMessage
1369
+ dashboards: >
1370
+ CloudWatch dashboard templates provisioned by CDK in /infra/monitoring/.
1371
+ Dashboards: Order Platform Overview (SLO status), Service Health
1372
+ (per-Lambda metrics), Data Layer (DynamoDB + SQS), Security
1373
+ (WAF blocked requests, CloudTrail anomalies). Runbook links embedded
1374
+ in dashboard widgets.
1375
+ alerting_rules: >
1376
+ Alert configuration in /infra/monitoring/alerts.ts (CDK). Rules:
1377
+ P1 (immediate PagerDuty): availability < 99.9% (5-min window),
1378
+ DLQ depth > 10, payment success rate < 99%.
1379
+ P2 (30-min SLA): P99 latency > 400ms, Lambda error rate > 0.5%,
1380
+ DynamoDB throttle > 10 events/min.
1381
+ P3 (business hours): Lambda concurrent executions > 80% of limit,
1382
+ SQS message age > 5 minutes.
1383
+
1384
+ scaling:
1385
+ policies:
1386
+ - >
1387
+ Lambda: default concurrency 100 per function (soft limit). Reserved
1388
+ concurrency on Order Service: 500 (prevents throttling on burst).
1389
+ Provisioned concurrency on Order Service: 20 (eliminates cold starts
1390
+ on submission path). Lambda auto-scales to account concurrency limit
1391
+ (3000 default in eu-west-1).
1392
+ - >
1393
+ DynamoDB: on-demand capacity mode (no provisioning required).
1394
+ Auto-scales to required throughput. Monitor for throttle events
1395
+ in CloudWatch; if sustained throttle > 5 minutes, review access
1396
+ patterns for hot partition.
1397
+ - >
1398
+ SQS: scales transparently. WMS Adapter Lambda concurrency set to
1399
+ 50 to match WMS API rate limit.
1400
+ - >
1401
+ API Gateway: regional endpoint; scales to 10K requests/second by
1402
+ default. Throttle limits: 1000 burst, 100 steady-state per stage.
1403
+ Increase by support request if DR-01 targets are approached.
1404
+ capacity_planning_notes: >
1405
+ Current production peak: ~200 orders/minute. Architecture is sized
1406
+ for 1000 orders/minute sustained without any provisioning changes.
1407
+ For 10× (10K orders/minute), Lambda concurrency limits and API Gateway
1408
+ throttle limits would require AWS support request. DynamoDB on-demand
1409
+ scales automatically. Review capacity quarterly using CloudWatch
1410
+ utilisation metrics.
1411
+
1412
+ disaster_recovery:
1413
+ rto: 4 hours
1414
+ rpo: 1 hour
1415
+ failover_procedures: >
1416
+ 1. Incident declared by on-call SRE via PagerDuty.
1417
+ 2. SRE confirms eu-west-1 is unavailable (Route 53 health checks fail).
1418
+ 3. SRE executes DR runbook: update Route 53 weighted routing to
1419
+ 100% eu-central-1. Estimated time: 15 minutes.
1420
+ 4. CDK stacks in eu-central-1 are pre-deployed (warm standby).
1421
+ Activate by running cdk deploy --context env=dr.
1422
+ 5. DynamoDB Global Table in eu-central-1 is the new primary.
1423
+ Verify data freshness (CloudWatch replication lag metric).
1424
+ 6. Smoke test: submit test order in eu-central-1. Verify 202 response
1425
+ and event flow.
1426
+ 7. Notify stakeholders. Begin post-incident review.
1427
+ 8. Recovery to eu-west-1: reverse Route 53 after region confirmed stable.
1428
+ Re-sync DynamoDB Global Table. Estimated total recovery time: 4 hours.
1429
+ runbook_ref: ops/runbooks/dr-failover-eu-central-1.md
1430
+
1431
+ implementation_artifacts:
1432
+ iac_templates:
1433
+ - name: Order Platform CDK Stack
1434
+ type: cdk
1435
+ location: infra/stacks/order-platform-stack.ts
1436
+ description: >
1437
+ Main CDK stack deploying all Lambda functions, DynamoDB tables,
1438
+ EventBridge bus, SQS queues, API Gateway, Cognito User Pool,
1439
+ CloudWatch dashboards, and alarms. Single cdk deploy command
1440
+ provisions the full stack.
1441
+ - name: Networking CDK Stack
1442
+ type: cdk
1443
+ location: infra/stacks/networking-stack.ts
1444
+ description: >
1445
+ VPC with three private subnets, NAT Gateways (one per AZ),
1446
+ VPC endpoints for DynamoDB, SQS, EventBridge, S3, KMS.
1447
+ - name: Security CDK Stack
1448
+ type: cdk
1449
+ location: infra/stacks/security-stack.ts
1450
+ description: >
1451
+ KMS CMKs (one per service), IAM roles (least-privilege per Lambda),
1452
+ WAF WebACL with OWASP CRS, CloudTrail configuration.
1453
+ - name: DR CDK Stack (eu-central-1)
1454
+ type: cdk
1455
+ location: infra/stacks/dr-stack.ts
1456
+ description: >
1457
+ Warm standby stack for eu-central-1. Deploys all Lambda functions
1458
+ in standby mode; DynamoDB Global Tables replica; Route 53
1459
+ failover configuration.
1460
+ api_specifications:
1461
+ - name: Order API (OpenAPI 3.1)
1462
+ spec_type: openapi
1463
+ location: api/order-api.openapi.yaml
1464
+ - name: Order Events (AsyncAPI 2.6)
1465
+ spec_type: asyncapi
1466
+ location: api/order-events.asyncapi.yaml
1467
+ cicd_templates:
1468
+ - name: Order Platform Pipeline
1469
+ platform: other
1470
+ location: pipeline/order-platform-pipeline.ts
1471
+ description: >
1472
+ AWS CodePipeline: Source (CodeCommit) → Build (CodeBuild, unit tests,
1473
+ CDK synth) → Deploy to staging → Integration tests (Postman/Newman)
1474
+ → Load test (Gatling, fails if SLO targets missed) → Manual approval
1475
+ → Deploy to production (blue/green via CodeDeploy Lambda alias shift).
1476
+ scaffold_template:
1477
+ name: New Order Service Scaffold
1478
+ type: other
1479
+ location: scaffold/new-service/
1480
+ description: >
1481
+ Cookiecutter template generating a new Lambda function project with:
1482
+ pre-configured Lambda Powertools (structured logging, X-Ray tracing,
1483
+ correlation ID middleware), CDK construct stub, OpenAPI spec stub,
1484
+ AsyncAPI event stub, unit test harness, and Postman collection.
1485
+ Run: cookiecutter scaffold/new-service/ to generate a new service.
1486
+ sample_application:
1487
+ name: Order Processing Sample Application
1488
+ location: sample-app/
1489
+ description: >
1490
+ Fully functional reference implementation of the complete order processing
1491
+ flow. Deployable to a sandbox AWS account in under 1 hour using the
1492
+ getting-started guide. Includes all six Lambda functions, CDK stacks,
1493
+ OpenAPI and AsyncAPI specs, and a React-based test UI.
1494
+
1495
+ getting_started:
1496
+ estimated_time_to_first_deployment: >
1497
+ 4 hours for engineers with AWS CDK experience; 8 hours for engineers
1498
+ new to CDK. A sandbox deployment (sample-app/) is achievable in 1 hour
1499
+ using the scaffold template and pre-configured CDK stacks.
1500
+ prerequisites:
1501
+ - AWS account with AdministratorAccess (sandbox) or specific IAM role (see infra/iam/deployer-policy.json)
1502
+ - AWS CLI v2 configured with appropriate credentials
1503
+ - Node.js 20 and npm installed
1504
+ - AWS CDK v2 installed globally (npm install -g aws-cdk)
1505
+ - Docker Desktop (for CDK asset bundling)
1506
+ - Git access to the platform repository
1507
+ - Postman (for API testing, optional)
1508
+ steps:
1509
+ - step: 1
1510
+ title: Clone the repository and install dependencies
1511
+ description: Clone the platform repository and install Node.js dependencies.
1512
+ command: >
1513
+ git clone https://git.internal/platform/order-platform.git &&
1514
+ cd order-platform && npm ci
1515
+ - step: 2
1516
+ title: Bootstrap the CDK environment
1517
+ description: >
1518
+ Bootstrap CDK in your target AWS account and region. Only required
1519
+ once per account/region pair. Provisions S3 bucket and IAM roles
1520
+ for CDK deployment.
1521
+ command: >
1522
+ npx cdk bootstrap aws://ACCOUNT_ID/eu-west-1
1523
+ - step: 3
1524
+ title: Configure environment variables
1525
+ description: >
1526
+ Copy the example environment file and set your environment-specific
1527
+ values (account ID, VPC CIDR, Cognito domain prefix).
1528
+ command: cp infra/config/sandbox.env.example infra/config/sandbox.env
1529
+ - step: 4
1530
+ title: Deploy the sample application
1531
+ description: >
1532
+ Deploy all CDK stacks to your sandbox account. This provisions VPC,
1533
+ Lambda functions, DynamoDB, API Gateway, Cognito, EventBridge, SQS,
1534
+ CloudWatch dashboards, and alarms. Expect ~15 minutes.
1535
+ command: >
1536
+ npx cdk deploy --all --context env=sandbox --require-approval never
1537
+ - step: 5
1538
+ title: Smoke test the deployment
1539
+ description: >
1540
+ Run the provided Postman collection against the deployed API Gateway
1541
+ endpoint. Collection tests order submission, status query, and
1542
+ error responses.
1543
+ command: >
1544
+ newman run postman/order-platform-smoke-test.json
1545
+ --env-var baseUrl=$(npx cdk outputs --json | jq -r '.OrderPlatformStack.ApiGatewayUrl')
1546
+ - step: 6
1547
+ title: Review CloudWatch dashboard
1548
+ description: >
1549
+ Open the Order Platform Overview CloudWatch dashboard. Confirm all
1550
+ metrics are populating after the smoke test. Dashboard URL is output
1551
+ by cdk deploy.
1552
+ troubleshooting:
1553
+ - symptom: cdk deploy fails with "Unable to resolve AWS account"
1554
+ cause: AWS CLI not configured or credentials expired
1555
+ resolution: Run "aws configure" or refresh credentials. Run "aws sts get-caller-identity" to verify.
1556
+ - symptom: Lambda function deployment fails with "Package exceeds 250MB limit"
1557
+ cause: node_modules not pruned; dev dependencies included
1558
+ resolution: Ensure CDK bundling uses --omit=dev. Check infra/stacks/order-platform-stack.ts bundling config.
1559
+ - symptom: Postman smoke test fails on POST /orders with 401
1560
+ cause: Cognito User Pool not yet propagated or test user not created
1561
+ resolution: Wait 2 minutes after deploy. Run "npm run create-test-user" to create a Cognito test user.
1562
+ - symptom: DLQ depth alarm fires after smoke test
1563
+ cause: WMS Adapter cannot reach WMS (no WMS in sandbox)
1564
+ resolution: Expected in sandbox. WMS Adapter is configured with a mock WMS stub in sandbox environment.
1565
+
1566
+ raid:
1567
+ risks:
1568
+ - id: RISK-01
1569
+ description: >
1570
+ DynamoDB hot partition: if a single order ID prefix dominates writes
1571
+ (e.g. all orders from a single promotion), a hot partition could
1572
+ cause write throttling, degrading P99 latency above 500ms target.
1573
+ likelihood: low
1574
+ impact: high
1575
+ mitigation: >
1576
+ DynamoDB partition key uses UUID v4 (random, uniformly distributed).
1577
+ CloudWatch alarm on ThrottledRequests metric. Access pattern review
1578
+ required before any bulk import or promotional campaign.
1579
+ owner: Platform SRE
1580
+ residual_risk: low
1581
+
1582
+ - id: RISK-02
1583
+ description: >
1584
+ Lambda cold-start latency spike: during a burst after a quiet period
1585
+ (e.g. Monday morning), Lambda may experience cold starts on the
1586
+ Order Service submission path, causing P99 > 500ms.
1587
+ likelihood: medium
1588
+ impact: medium
1589
+ mitigation: >
1590
+ Provisioned concurrency of 20 on Order Service Lambda, pre-warming
1591
+ the function. CloudWatch Lambda cold-start metric monitored.
1592
+ Provisioned concurrency level reviewed quarterly (cost vs performance).
1593
+ owner: Platform SRE
1594
+ residual_risk: low
1595
+
1596
+ - id: RISK-03
1597
+ description: >
1598
+ Stripe payment gateway outage: Stripe is an external dependency.
1599
+ If Stripe is unavailable, PaymentFailed events will be published
1600
+ and orders will not be authorised, degrading fulfilment.
1601
+ likelihood: low
1602
+ impact: high
1603
+ mitigation: >
1604
+ Retry logic with exponential backoff (3 attempts, 1s/2s/4s) in
1605
+ Payment Service. SQS-backed retry queue for failed payment events.
1606
+ Stripe SLA 99.99% — incidents < 1 per year historically.
1607
+ Runbook for Stripe incident: ops/runbooks/stripe-incident.md.
1608
+ Customer notification sent on PaymentFailed to reduce support load.
1609
+ owner: Order Platform Engineering
1610
+ residual_risk: medium
1611
+
1612
+ - id: RISK-04
1613
+ description: >
1614
+ Schema evolution breaking change: a service publishes an updated
1615
+ event schema removing a field that a subscriber depends on,
1616
+ causing subscriber Lambda errors and DLQ accumulation.
1617
+ likelihood: medium
1618
+ impact: medium
1619
+ mitigation: >
1620
+ EventBridge Schema Registry with backward compatibility check in CI
1621
+ (fails build if breaking change detected). Schema versioning:
1622
+ major version bump required for breaking changes; new event type
1623
+ published in parallel until all subscribers migrated.
1624
+ owner: Enterprise Architecture
1625
+ residual_risk: low
1626
+
1627
+ - id: RISK-05
1628
+ description: >
1629
+ PCI-DSS scope creep: a developer inadvertently stores card data
1630
+ in DynamoDB or logs, widening PCI-DSS scope and triggering a
1631
+ compliance remediation.
1632
+ likelihood: low
1633
+ impact: high
1634
+ mitigation: >
1635
+ No card data accepted by the Order API — only payment tokens.
1636
+ API Gateway request validator rejects payloads containing card
1637
+ number patterns (WAF custom rule). Developer training on PCI-DSS
1638
+ scope in onboarding. Quarterly code review by Security Architecture.
1639
+ owner: Information Security
1640
+ residual_risk: low
1641
+
1642
+ - id: RISK-06
1643
+ description: >
1644
+ WMS integration failure accumulation: WMS system maintenance or
1645
+ API changes cause WMS Adapter Lambda failures to accumulate in
1646
+ DLQ, delaying fulfilment for many orders.
1647
+ likelihood: medium
1648
+ impact: medium
1649
+ mitigation: >
1650
+ DLQ depth alarm (P1 if > 10 messages). Ops runbook for DLQ
1651
+ reprocessing after WMS recovery. WMS Adapter uses circuit breaker
1652
+ pattern — if WMS fails > 50% of calls in 1 minute, stops calling
1653
+ WMS and sends alert to WMS team.
1654
+ owner: Platform SRE
1655
+ residual_risk: medium
1656
+
1657
+ assumptions:
1658
+ - assumption: >
1659
+ EventBridge event delivery is at-least-once (AWS guarantee);
1660
+ consumers are designed to be idempotent.
1661
+ consequence_if_violated: >
1662
+ Duplicate order processing possible. All consumers use DynamoDB
1663
+ conditional writes to reject already-processed events.
1664
+ - assumption: >
1665
+ DynamoDB on-demand capacity can absorb 10× sustained throughput
1666
+ increase without pre-warming.
1667
+ consequence_if_violated: >
1668
+ DynamoDB may throttle on sudden 10× spike. Mitigation: monitor
1669
+ throttle events; switch to provisioned capacity with auto-scaling
1670
+ if on-demand throughput ramp-up is too slow.
1671
+ - assumption: >
1672
+ The WMS API is stable and does not change its request format
1673
+ without advance notice to the Platform team.
1674
+ consequence_if_violated: >
1675
+ WMS Adapter Lambda will fail and DLQ will accumulate. Runbook:
1676
+ ops/runbooks/wms-api-change.md.
1677
+
1678
+ constraints:
1679
+ - type: regulatory
1680
+ description: PCI-DSS Level 1 compliance is mandatory for the payment authorisation flow.
1681
+ implication: >
1682
+ No card data may be stored or logged anywhere in the platform.
1683
+ All data at rest must be encrypted with KMS CMKs. All API traffic
1684
+ must use TLS 1.2+. Annual QSA audit required.
1685
+ - type: organisational
1686
+ description: AWS is the mandated cloud provider; no multi-cloud or on-premises compute.
1687
+ implication: >
1688
+ All services must use AWS managed services. No Kubernetes, no
1689
+ on-premises databases, no third-party compute.
1690
+ - type: technical
1691
+ description: >
1692
+ Maximum Lambda package size is 250MB unzipped. Maximum Lambda
1693
+ execution time is 15 minutes.
1694
+ implication: >
1695
+ Lambda functions must have lean dependency trees. Any processing
1696
+ requiring > 15 minutes must be decomposed into smaller functions
1697
+ chained via EventBridge or SQS.
1698
+ - type: organisational
1699
+ description: Team size is 4 engineers; no dedicated DBA or platform operator.
1700
+ implication: >
1701
+ Architecture must minimise operational overhead. Fully managed
1702
+ services preferred. No self-managed databases or container clusters.
1703
+ - type: financial
1704
+ description: "Monthly AWS spend budget for order platform: $8,000/month at steady state."
1705
+ implication: >
1706
+ Lambda (pay-per-use) and DynamoDB (on-demand) chosen to minimise
1707
+ cost at current volumes. Cost review quarterly via AWS Cost Explorer.
1708
+
1709
+ tradeoffs:
1710
+ - decision_ref: ADR-001
1711
+ description: Event choreography vs synchronous REST
1712
+ what_was_sacrificed: Immediate consistency and simple linear call tracing
1713
+ what_was_gained: >
1714
+ Service decoupling, independent deployability (DR-04), elastic scale (DR-01),
1715
+ immutable audit trail (DR-05)
1716
+ - decision_ref: ADR-002
1717
+ description: DynamoDB vs RDS PostgreSQL
1718
+ what_was_sacrificed: >
1719
+ SQL query flexibility; complex ad-hoc queries not possible
1720
+ in the order service
1721
+ what_was_gained: >
1722
+ Sub-5ms latency at any scale; zero ops overhead; serverless
1723
+ cost model (DR-01, DR-03)
1724
+ - decision_ref: ADR-003
1725
+ description: Lambda vs ECS Fargate
1726
+ what_was_sacrificed: >
1727
+ Always-on latency consistency; container model familiarity
1728
+ what_was_gained: >
1729
+ Zero operational overhead; automatic scale to 0; pay-per-use
1730
+ cost model; no cluster management
1731
+ - decision_ref: ADR-004
1732
+ description: EventBridge vs SNS
1733
+ what_was_sacrificed: Lower per-event cost (SNS is cheaper at very high volume)
1734
+ what_was_gained: >
1735
+ Schema registry enforcement; content-based routing; Archive
1736
+ and Replay (DR-05); no custom routing logic in application code
1737
+ - decision_ref: ADR-005
1738
+ description: DynamoDB single-table vs multi-table
1739
+ what_was_sacrificed: >
1740
+ Simpler mental model; easier developer onboarding for DynamoDB newcomers
1741
+ what_was_gained: >
1742
+ Optimal query performance for all defined access patterns;
1743
+ single-digit ms reads via GSI; fewer tables to manage
1744
+
1745
+ governance:
1746
+ applicable_standards:
1747
+ - id: PCI-DSS
1748
+ name: Payment Card Industry Data Security Standard Level 1
1749
+ relevance: >
1750
+ The payment authorisation flow processes payment tokens and interacts
1751
+ with Stripe. PCI-DSS Level 1 applies to all systems that transmit
1752
+ cardholder data or interact with payment processing systems.
1753
+ - id: ISO-27001
1754
+ name: ISO/IEC 27001:2022 Information Security Management
1755
+ relevance: >
1756
+ Enterprise security standard. Applies to all production systems.
1757
+ Controls implemented via IAM least-privilege, KMS encryption,
1758
+ CloudTrail audit logging, and WAF.
1759
+ - id: GDPR
1760
+ name: General Data Protection Regulation (EU) 2016/679
1761
+ relevance: >
1762
+ Customer order data includes PII (name, address, email). GDPR
1763
+ applies to all processing of EU customer data. Data residency
1764
+ in eu-west-1 (Ireland) satisfies Chapter V transfers.
1765
+ - id: ESS-01
1766
+ name: Enterprise Security Standard 01 — Encryption in Transit
1767
+ relevance: Requires TLS 1.2+ for all API traffic. Satisfied by API Gateway.
1768
+ - id: ESS-03
1769
+ name: Enterprise Security Standard 03 — Encryption at Rest
1770
+ relevance: >
1771
+ Requires KMS CMK encryption for all data stores containing
1772
+ customer or payment data.
1773
+ compliance_mapping:
1774
+ - control: "PCI-DSS Requirement 1: Network security controls"
1775
+ design_element: >
1776
+ CloudFront + WAF (OWASP CRS v3.2); VPC with private subnets;
1777
+ no direct internet access to Lambda or DynamoDB; VPC endpoints
1778
+ for all AWS service traffic.
1779
+ evidence: Architecture Views — Deployment; Architecture Views — Security
1780
+ owner: Information Security
1781
+
1782
+ - control: "PCI-DSS Requirement 2: Secure configurations"
1783
+ design_element: >
1784
+ All infrastructure defined in CDK (PRIN-04); no manual console
1785
+ changes; CloudFormation drift detection nightly; Lambda functions
1786
+ have no inbound network access except via API Gateway.
1787
+ evidence: Implementation Artifacts — IaC templates; Architecture Decisions ADR-003
1788
+ owner: Platform SRE
1789
+
1790
+ - control: "PCI-DSS Requirement 3: Protect stored account data"
1791
+ design_element: >
1792
+ No card data stored anywhere in the platform. Order API accepts
1793
+ payment tokens only (Stripe tokenisation). WAF custom rule rejects
1794
+ payloads containing card number patterns. DynamoDB stores order
1795
+ reference and Stripe charge ID only.
1796
+ evidence: Architecture Views — Security; RAID Constraints (PCI-DSS); ADR-002
1797
+ owner: Information Security
1798
+
1799
+ - control: "PCI-DSS Requirement 4: Protect cardholder data in transit"
1800
+ design_element: >
1801
+ TLS 1.2+ enforced by API Gateway (minimum TLS policy). Stripe
1802
+ integration uses TLS 1.2+. All internal AWS SDK calls use HTTPS
1803
+ via VPC endpoints.
1804
+ evidence: Architecture Views — Security (Trust Zone 1–3)
1805
+ owner: Information Security
1806
+
1807
+ - control: "PCI-DSS Requirement 6: Develop and maintain secure systems"
1808
+ design_element: >
1809
+ AWS Security Hub PCI-DSS standard enabled; HIGH/CRITICAL findings
1810
+ block CodePipeline deployment. Quarterly penetration testing.
1811
+ OWASP CRS on WAF addresses OWASP Top 10.
1812
+ evidence: Quality Attributes — Security; Implementation Artifacts — CI/CD
1813
+ owner: Information Security
1814
+
1815
+ - control: "PCI-DSS Requirement 7: Restrict access to system components"
1816
+ design_element: >
1817
+ Each Lambda function has a least-privilege IAM role permitting
1818
+ only its specific resource actions. Cognito JWT authorisation on
1819
+ all API endpoints. No shared service accounts.
1820
+ evidence: Architecture Views — Security (Trust Zone 3); Component Classification
1821
+ owner: Information Security
1822
+
1823
+ - control: "PCI-DSS Requirement 8: Identify users and authenticate access"
1824
+ design_element: >
1825
+ Cognito User Pool with MFA for operator access. JWT tokens with
1826
+ 1-hour expiry. Customer authentication via Cognito. No shared
1827
+ credentials.
1828
+ evidence: Architecture Views — Security (Trust Zone 2); element catalog (Cognito)
1829
+ owner: Information Security
1830
+
1831
+ - control: "PCI-DSS Requirement 10: Log and monitor all access"
1832
+ design_element: >
1833
+ CloudTrail (management and data events) with write-once S3 Object
1834
+ Lock. Structured JSON logs from all Lambda functions with correlation
1835
+ IDs. X-Ray distributed tracing. CloudWatch alarms for anomalous
1836
+ activity.
1837
+ evidence: Architecture Views — Security (Trust Zone 5); Operational Model — Monitoring
1838
+ owner: Platform SRE
1839
+
1840
+ - control: "PCI-DSS Requirement 12: Support information security with policies"
1841
+ design_element: >
1842
+ Architecture Board governance process. Annual QSA audit. Quarterly
1843
+ internal penetration testing. Exception register maintained (this document).
1844
+ evidence: Governance and Compliance (this section); Decisions and Actions
1845
+ owner: Risk and Compliance
1846
+
1847
+ - control: "GDPR Article 5: Data minimisation"
1848
+ design_element: >
1849
+ Order data schema includes only fields necessary for order processing.
1850
+ No free-text fields that could contain unexpected PII. Schema
1851
+ enforced by API Gateway request validator and EventBridge schema.
1852
+ evidence: API specification (api/order-api.openapi.yaml)
1853
+ owner: Risk and Compliance
1854
+
1855
+ - control: "GDPR Article 25: Data protection by design"
1856
+ design_element: >
1857
+ Customer data encrypted at rest (KMS CMK) and in transit (TLS 1.2+).
1858
+ Minimum data retention: orders deleted after 7 years per tax law
1859
+ (S3 lifecycle policy). Access to order data restricted by IAM.
1860
+ evidence: Architecture Views — Security; RAID Constraints
1861
+ owner: Risk and Compliance
1862
+
1863
+ - control: "ESS-01: Encryption in transit"
1864
+ design_element: >
1865
+ API Gateway minimum TLS 1.2 policy. VPC endpoints for internal
1866
+ traffic (no public internet for data plane). Stripe TLS 1.2+.
1867
+ evidence: Architecture Views — Security
1868
+ owner: Information Security
1869
+
1870
+ - control: "ESS-03: Encryption at rest"
1871
+ design_element: >
1872
+ DynamoDB KMS CMK. SQS KMS CMK. S3 KMS CMK. EventBridge does not
1873
+ store event data at rest beyond delivery. CloudTrail KMS CMK.
1874
+ evidence: Architecture Views — Security (Trust Zone 4); element catalog
1875
+ owner: Information Security
1876
+ exceptions: []
1877
+
1878
+ decisions_and_actions:
1879
+ governance_outcome: approved
1880
+ decision_statement: >
1881
+ This reference architecture is approved as the mandatory pattern for
1882
+ all event-driven order processing microservices in the e-commerce domain
1883
+ on AWS. All new order processing services must conform to this reference
1884
+ architecture or obtain an Architecture Board exception. Effective Q2 2026.
1885
+ conditions: []
1886
+ next_actions:
1887
+ - description: Publish architecture to internal developer portal (Backstage)
1888
+ owner: Enterprise Architecture
1889
+ target_date: "2026-04-05"
1890
+ - description: >
1891
+ Migrate Order Service v1 (monolith) to conform with this reference
1892
+ architecture (initial extraction of Order domain)
1893
+ owner: Order Platform Engineering
1894
+ target_date: "2026-06-30"
1895
+ - description: >
1896
+ Complete annual DR exercise to validate RTO 4h / RPO 1h targets
1897
+ owner: Platform SRE
1898
+ target_date: "2026-07-31"
1899
+ - description: >
1900
+ Conduct first annual PCI-DSS QSA audit against this architecture
1901
+ owner: Risk and Compliance
1902
+ target_date: "2026-09-30"
1903
+ - description: >
1904
+ Review and update this reference architecture at 6-month review
1905
+ (next_review_date: 2026-09-20)
1906
+ owner: Enterprise Architecture
1907
+ target_date: "2026-09-20"
1908
+
1909
+ evolution:
1910
+ version: "1.0.0"
1911
+ known_limitations:
1912
+ - >
1913
+ GraphQL API not yet supported — add as extension point in v1.1
1914
+ if product teams require GraphQL for mobile clients.
1915
+ - >
1916
+ Multi-region active-active not yet supported — active-passive DR
1917
+ only. Active-active requires conflict resolution strategy for
1918
+ DynamoDB Global Tables write conflicts.
1919
+ - >
1920
+ Observability does not yet include real-time business metrics
1921
+ (order conversion rate, cart abandonment). Add in v1.1 via
1922
+ EventBridge → Kinesis → real-time dashboard.
1923
+ roadmap:
1924
+ - version: "1.1.0"
1925
+ planned_date: "2026-09-20"
1926
+ planned_changes:
1927
+ - Add GraphQL API extension point (API Gateway + AppSync variant)
1928
+ - Add real-time business metrics (Kinesis + QuickSight)
1929
+ - Add fraud detection service integration (platform service adapter)
1930
+ - Update getting-started guide for Backstage Software Template
1931
+ - version: "2.0.0"
1932
+ planned_date: "2027-Q1"
1933
+ planned_changes:
1934
+ - Evaluate active-active multi-region (DynamoDB Global Tables v2)
1935
+ - Evaluate Lambda SnapStart for Java runtime variant
1936
+ - Incorporate AWS Bedrock for order anomaly detection
1937
+ deprecation_strategy: >
1938
+ When a major version (2.0.0) is published, v1.x will remain supported
1939
+ for 12 months. Adopting teams will receive migration guidance
1940
+ (migration/v1-to-v2.md) and 6 months advance notice. After 12 months,
1941
+ v1.x is deprecated and services must be migrated.
1942
+ feedback_channel: >
1943
+ Submit issues and feedback via the platform GitHub repository
1944
+ (Issues labelled "reference-architecture"). Architecture Board reviews
1945
+ feedback at quarterly governance meeting. SREs report operational
1946
+ issues via PagerDuty post-incident reviews.
1947
+
1948
+ glossary:
1949
+ - term: API Gateway
1950
+ definition: >
1951
+ AWS managed service providing REST API endpoint, TLS termination,
1952
+ Cognito JWT authorisation, and request validation.
1953
+ - term: CDK (Cloud Development Kit)
1954
+ definition: >
1955
+ AWS Cloud Development Kit v2 — TypeScript-based IaC framework that
1956
+ synthesises to CloudFormation templates.
1957
+ - term: Choreography
1958
+ definition: >
1959
+ Event-driven coordination pattern where services react to events
1960
+ published by other services without a central orchestrator.
1961
+ - term: Cognito
1962
+ definition: >
1963
+ AWS managed identity service providing user pools (authentication),
1964
+ JWT issuance, and MFA.
1965
+ - term: Correlation ID
1966
+ definition: >
1967
+ UUID v4 generated at order submission and threaded through all
1968
+ subsequent processing steps, logs, events, and database records
1969
+ for end-to-end tracing.
1970
+ - term: DLQ (Dead Letter Queue)
1971
+ definition: >
1972
+ SQS queue that receives messages failing processing after the
1973
+ maximum receive count (3 attempts). Used for error isolation
1974
+ and manual reprocessing.
1975
+ - term: DynamoDB
1976
+ definition: >
1977
+ AWS managed NoSQL key-value and document database with single-digit
1978
+ millisecond latency at any scale.
1979
+ - term: EventBridge
1980
+ definition: >
1981
+ AWS managed serverless event bus with content-based routing rules
1982
+ and schema registry.
1983
+ - term: Fitness Function
1984
+ definition: >
1985
+ Automated test or check in CI/CD that continuously validates a
1986
+ quality attribute target.
1987
+ - term: Golden Path
1988
+ definition: >
1989
+ Spotify-coined term for a well-lit, well-supported, easy-to-follow
1990
+ implementation path that reduces decision fatigue for development teams.
1991
+ - term: GSI (Global Secondary Index)
1992
+ definition: >
1993
+ DynamoDB index on non-primary-key attributes, enabling efficient
1994
+ queries by alternate access patterns.
1995
+ - term: IAM (Identity and Access Management)
1996
+ definition: >
1997
+ AWS service managing permissions. Each Lambda function has a
1998
+ least-privilege IAM role.
1999
+ - term: Idempotent
2000
+ definition: >
2001
+ An operation that produces the same result regardless of how many
2002
+ times it is executed with the same input. Required for all
2003
+ EventBridge subscribers to handle at-least-once delivery.
2004
+ - term: KMS (Key Management Service)
2005
+ definition: >
2006
+ AWS managed encryption key service. Customer-managed keys (CMKs)
2007
+ provide full key lifecycle control.
2008
+ - term: Lambda
2009
+ definition: >
2010
+ AWS serverless function-as-a-service compute. Executes code in
2011
+ response to events (API Gateway, EventBridge, SQS) with automatic
2012
+ scaling and pay-per-invocation billing.
2013
+ - term: PCI-DSS
2014
+ definition: >
2015
+ Payment Card Industry Data Security Standard. Level 1 applies to
2016
+ systems processing > 6 million card transactions per year.
2017
+ - term: Provisioned Concurrency
2018
+ definition: >
2019
+ Lambda feature that pre-initialises a specified number of function
2020
+ instances, eliminating cold-start latency.
2021
+ - term: RPO (Recovery Point Objective)
2022
+ definition: "Maximum acceptable data loss measured in time (target: 1 hour)."
2023
+ - term: RTO (Recovery Time Objective)
2024
+ definition: "Maximum acceptable time to restore service after an incident (target: 4 hours)."
2025
+ - term: RULERS
2026
+ definition: >
2027
+ EAROS evidence-anchoring protocol: for each criterion, extract a
2028
+ direct quote or reference from the artifact before assigning a score.
2029
+ - term: Schema Registry
2030
+ definition: >
2031
+ EventBridge feature that stores and validates event schemas,
2032
+ ensuring publishers and subscribers agree on event structure.
2033
+ - term: Single-table design
2034
+ definition: >
2035
+ DynamoDB design pattern where multiple entity types coexist in
2036
+ one table, encoded via composite PK/SK patterns.
2037
+ - term: SLO (Service Level Objective)
2038
+ definition: >
2039
+ Internal target for service reliability (availability, latency).
2040
+ Distinct from SLA (contractual commitment with customers).
2041
+ - term: VPC (Virtual Private Cloud)
2042
+ definition: >
2043
+ AWS isolated network environment. Lambda functions in private
2044
+ VPC subnets have no direct internet access.
2045
+ - term: WAF (Web Application Firewall)
2046
+ definition: >
2047
+ AWS managed firewall enforcing OWASP Core Rule Set v3.2 on all
2048
+ inbound API traffic.
2049
+ - term: WMS (Warehouse Management System)
2050
+ definition: >
2051
+ External fulfilment system consuming order dispatch instructions
2052
+ from the SQS Fulfilment Queue via the WMS Adapter Lambda.
2053
+ - term: X-Ray
2054
+ definition: >
2055
+ AWS distributed tracing service. Active tracing on all Lambda
2056
+ functions provides end-to-end request traces with correlation IDs.