@trohde/earos 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +156 -0
- package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
- package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
- package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
- package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
- package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
- package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
- package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
- package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
- package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
- package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
- package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
- package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
- package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
- package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
- package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
- package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
- package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
- package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
- package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
- package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
- package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
- package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
- package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
- package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
- package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
- package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
- package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
- package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
- package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
- package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
- package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
- package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
- package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
- package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
- package/assets/init/.claude/CLAUDE.md +4 -0
- package/assets/init/AGENTS.md +293 -0
- package/assets/init/CLAUDE.md +635 -0
- package/assets/init/README.md +507 -0
- package/assets/init/calibration/gold-set/.gitkeep +0 -0
- package/assets/init/calibration/results/.gitkeep +0 -0
- package/assets/init/core/core-meta-rubric.yaml +643 -0
- package/assets/init/docs/consistency-report.md +325 -0
- package/assets/init/docs/getting-started.md +194 -0
- package/assets/init/docs/profile-authoring-guide.md +51 -0
- package/assets/init/docs/terminology.md +126 -0
- package/assets/init/earos.manifest.yaml +104 -0
- package/assets/init/evaluations/.gitkeep +0 -0
- package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
- package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
- package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
- package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
- package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
- package/assets/init/overlays/data-governance.yaml +94 -0
- package/assets/init/overlays/regulatory.yaml +154 -0
- package/assets/init/overlays/security.yaml +92 -0
- package/assets/init/profiles/adr.yaml +225 -0
- package/assets/init/profiles/capability-map.yaml +223 -0
- package/assets/init/profiles/reference-architecture.yaml +426 -0
- package/assets/init/profiles/roadmap.yaml +205 -0
- package/assets/init/profiles/solution-architecture.yaml +227 -0
- package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
- package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
- package/assets/init/research/reference-architecture-research.md +751 -0
- package/assets/init/standard/EAROS.md +1426 -0
- package/assets/init/standard/schemas/artifact.schema.json +1295 -0
- package/assets/init/standard/schemas/artifact.uischema.json +65 -0
- package/assets/init/standard/schemas/evaluation.schema.json +284 -0
- package/assets/init/standard/schemas/rubric.schema.json +383 -0
- package/assets/init/templates/evaluation-record.template.yaml +58 -0
- package/assets/init/templates/new-profile.template.yaml +65 -0
- package/bin.js +188 -0
- package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
- package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
- package/dist/assets/arc-CyDBhtDM.js +1 -0
- package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
- package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
- package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
- package/dist/assets/channel-CiySTNoJ.js +1 -0
- package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
- package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
- package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
- package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
- package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
- package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
- package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
- package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
- package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
- package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
- package/dist/assets/clone-ylgRbd3D.js +1 -0
- package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
- package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
- package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
- package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
- package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
- package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
- package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
- package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
- package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
- package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
- package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
- package/dist/assets/graph-DLQn37b-.js +1 -0
- package/dist/assets/index-BFFITMT8.js +650 -0
- package/dist/assets/index-H7f6VTz1.css +1 -0
- package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
- package/dist/assets/init-Gi6I4Gst.js +1 -0
- package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
- package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
- package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
- package/dist/assets/katex-B1X10hvy.js +261 -0
- package/dist/assets/layout-C0dvb42R.js +1 -0
- package/dist/assets/linear-j4a8mGj7.js +1 -0
- package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
- package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
- package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
- package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
- package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
- package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
- package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
- package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
- package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
- package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
- package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
- package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
- package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
- package/dist/index.html +23 -0
- package/export-docx.js +1583 -0
- package/init.js +353 -0
- package/manifest-cli.mjs +207 -0
- package/package.json +83 -0
- package/schemas/artifact.schema.json +1295 -0
- package/schemas/artifact.uischema.json +65 -0
- package/schemas/evaluation.schema.json +284 -0
- package/schemas/rubric.schema.json +383 -0
- package/serve.js +238 -0
|
@@ -0,0 +1,2056 @@
|
|
|
1
|
+
kind: artifact
|
|
2
|
+
artifact_type: reference_architecture
|
|
3
|
+
|
|
4
|
+
metadata:
|
|
5
|
+
title: Event-Driven Order Processing Platform on AWS
|
|
6
|
+
version: 1.0.0
|
|
7
|
+
status: approved
|
|
8
|
+
author: Thomas Rohde
|
|
9
|
+
owner: Enterprise Architecture, E-Commerce Platform Domain
|
|
10
|
+
effective_date: "2026-03-20"
|
|
11
|
+
next_review_date: "2026-09-20"
|
|
12
|
+
last_updated: "2026-03-20"
|
|
13
|
+
purpose: >
|
|
14
|
+
This reference architecture defines the target pattern for all event-driven
|
|
15
|
+
microservices implementing order processing on AWS. It supports Architecture
|
|
16
|
+
Board approval as the mandatory golden path for new e-commerce order services
|
|
17
|
+
and serves as the calibration benchmark for EAROS reference architecture
|
|
18
|
+
assessments.
|
|
19
|
+
decision_context: >
|
|
20
|
+
Architecture Board review Q1 2026. The e-commerce platform is migrating from
|
|
21
|
+
a monolithic order management system to event-driven microservices. This
|
|
22
|
+
reference architecture governs all new services built as part of that
|
|
23
|
+
migration and all future order-processing workloads on AWS.
|
|
24
|
+
stakeholders:
|
|
25
|
+
- role: Executive Sponsor
|
|
26
|
+
name: Chief Technology Officer
|
|
27
|
+
concerns: >
|
|
28
|
+
Strategic alignment, cost profile, risk posture, compliance with PCI-DSS,
|
|
29
|
+
and ability to scale to 10× current order volume without re-platforming.
|
|
30
|
+
- role: Platform Architect
|
|
31
|
+
name: Enterprise Architecture
|
|
32
|
+
concerns: >
|
|
33
|
+
Architectural soundness, pattern reusability, alignment with cloud
|
|
34
|
+
strategy, ADR rationale, and prescriptiveness classification.
|
|
35
|
+
- role: Domain Architect
|
|
36
|
+
name: E-Commerce Domain Architecture
|
|
37
|
+
concerns: >
|
|
38
|
+
Service decomposition, data model, integration patterns with upstream
|
|
39
|
+
and downstream systems, and bounded context alignment.
|
|
40
|
+
- role: Development Team Lead
|
|
41
|
+
name: Order Platform Engineering
|
|
42
|
+
concerns: >
|
|
43
|
+
Implementation guidance, IaC templates, API specs, getting-started
|
|
44
|
+
guide, and time to first deployment.
|
|
45
|
+
- role: Site Reliability Engineer
|
|
46
|
+
name: Platform SRE
|
|
47
|
+
concerns: >
|
|
48
|
+
SLOs, observability, scaling policies, DR procedures, runbooks, and
|
|
49
|
+
on-call playbooks.
|
|
50
|
+
- role: Security Architect
|
|
51
|
+
name: Information Security
|
|
52
|
+
concerns: >
|
|
53
|
+
PCI-DSS compliance, authentication, authorisation, encryption, WAF
|
|
54
|
+
configuration, and IAM least-privilege posture.
|
|
55
|
+
- role: Compliance Officer
|
|
56
|
+
name: Risk and Compliance
|
|
57
|
+
concerns: >
|
|
58
|
+
PCI-DSS Level 1 compliance mapping, audit trail completeness, data
|
|
59
|
+
residency, and exception log.
|
|
60
|
+
change_log:
|
|
61
|
+
- version: "1.0.0"
|
|
62
|
+
date: "2026-03-20"
|
|
63
|
+
author: Thomas Rohde
|
|
64
|
+
changes:
|
|
65
|
+
- Initial approved version — gold-standard EAROS calibration artifact
|
|
66
|
+
- All 5 architecture views complete (context, functional, deployment, data flow, security)
|
|
67
|
+
- 5 full ADRs with alternatives, trade-offs, and revisit conditions
|
|
68
|
+
- Complete operational model with SLOs, dashboards, scaling, and DR
|
|
69
|
+
- PCI-DSS compliance mapping covering all 12 requirements areas
|
|
70
|
+
- 6-step getting-started guide tested against sandbox environment
|
|
71
|
+
|
|
72
|
+
sections:
|
|
73
|
+
reading_guide:
|
|
74
|
+
how_to_use: >
|
|
75
|
+
This document is structured so each audience can navigate directly to
|
|
76
|
+
their primary concerns. The section map below identifies the most relevant
|
|
77
|
+
sections for each stakeholder role. Cross-references between sections are
|
|
78
|
+
explicit — where a decision in one section depends on content from another,
|
|
79
|
+
a reference is provided.
|
|
80
|
+
section_map:
|
|
81
|
+
- section: Business Context
|
|
82
|
+
audience: CTO, Executive Sponsor
|
|
83
|
+
concern: Strategic alignment, business drivers, and use-case coverage
|
|
84
|
+
- section: Architecture Views — Context
|
|
85
|
+
audience: CTO, Domain Architect
|
|
86
|
+
concern: System boundary, external actors, and integration landscape
|
|
87
|
+
- section: Architecture Views — Functional
|
|
88
|
+
audience: Domain Architect, Development Team Lead
|
|
89
|
+
concern: Service decomposition, component responsibilities, and interfaces
|
|
90
|
+
- section: Architecture Views — Data Flow
|
|
91
|
+
audience: Domain Architect, Development Team Lead, Data Governance
|
|
92
|
+
concern: Runtime order-processing flow and event choreography
|
|
93
|
+
- section: Architecture Views — Deployment
|
|
94
|
+
audience: Platform SRE, Security Architect
|
|
95
|
+
concern: Infrastructure topology, multi-AZ resilience, and network design
|
|
96
|
+
- section: Architecture Views — Security
|
|
97
|
+
audience: Security Architect, Compliance Officer
|
|
98
|
+
concern: Trust boundaries, authentication, authorisation, and encryption
|
|
99
|
+
- section: Architecture Decisions (ADRs)
|
|
100
|
+
audience: Platform Architect, Domain Architect
|
|
101
|
+
concern: Decision rationale, alternatives considered, and trade-offs accepted
|
|
102
|
+
- section: Component Classification
|
|
103
|
+
audience: Development Team Lead
|
|
104
|
+
concern: Mandatory vs optional components and approved extension points
|
|
105
|
+
- section: Quality Attributes
|
|
106
|
+
audience: Platform SRE, Domain Architect
|
|
107
|
+
concern: Measurable SLO targets and validation strategies
|
|
108
|
+
- section: Operational Model
|
|
109
|
+
audience: Platform SRE
|
|
110
|
+
concern: Monitoring, alerting, scaling, and disaster recovery
|
|
111
|
+
- section: Implementation Artifacts
|
|
112
|
+
audience: Development Team Lead
|
|
113
|
+
concern: IaC templates, API specs, CI/CD pipeline, and scaffold template
|
|
114
|
+
- section: Getting Started
|
|
115
|
+
audience: Development Team Lead
|
|
116
|
+
concern: Step-by-step guide to first deployment
|
|
117
|
+
- section: RAID Log
|
|
118
|
+
audience: Platform Architect, Risk and Compliance
|
|
119
|
+
concern: Risks, design assumptions, constraints, and trade-offs
|
|
120
|
+
- section: Governance and Compliance
|
|
121
|
+
audience: Compliance Officer, Security Architect
|
|
122
|
+
concern: PCI-DSS control mapping and exception register
|
|
123
|
+
- section: Decisions and Actions
|
|
124
|
+
audience: Architecture Board, CTO
|
|
125
|
+
concern: Governance outcome, approval conditions, and next actions
|
|
126
|
+
|
|
127
|
+
scope:
|
|
128
|
+
statement: >
|
|
129
|
+
This reference architecture covers the event-driven order processing
|
|
130
|
+
platform on AWS: the full lifecycle from order submission through payment
|
|
131
|
+
authorisation, fulfilment dispatch, and customer notification. It defines
|
|
132
|
+
mandatory patterns for all new order-processing microservices in the
|
|
133
|
+
e-commerce domain.
|
|
134
|
+
in_scope:
|
|
135
|
+
- Order submission API (HTTP/REST via API Gateway)
|
|
136
|
+
- Order Service — create, validate, and persist orders
|
|
137
|
+
- Payment Service — authorise payments via external payment gateway
|
|
138
|
+
- Fulfilment Service — dispatch orders to warehouse management system
|
|
139
|
+
- Notification Service — send order status updates via email/SMS
|
|
140
|
+
- Event Bus (AWS EventBridge) — async inter-service choreography
|
|
141
|
+
- Work queues (AWS SQS) — reliable message delivery with DLQ
|
|
142
|
+
- Order data store (AWS DynamoDB) — primary persistence
|
|
143
|
+
- Identity and access (AWS Cognito) — customer and API authentication
|
|
144
|
+
- Observability stack (CloudWatch, X-Ray) — metrics, logs, and tracing
|
|
145
|
+
- Infrastructure-as-code (AWS CDK) — deployment automation
|
|
146
|
+
- CI/CD pipeline (AWS CodePipeline + CodeBuild) — delivery automation
|
|
147
|
+
out_of_scope:
|
|
148
|
+
- Customer mobile and web frontend — covered by separate UI reference architecture
|
|
149
|
+
- Payment gateway internals — third-party service; integration contract only
|
|
150
|
+
- Warehouse management system — external system; integration via adapter only
|
|
151
|
+
- Email and SMS providers — external services; notification adapter only
|
|
152
|
+
- Product catalogue and inventory — covered by separate catalogue reference architecture
|
|
153
|
+
- Analytics and reporting pipeline — covered by data platform reference architecture
|
|
154
|
+
- Corporate IAM and SSO — platform service; consumed, not owned
|
|
155
|
+
- Fraud detection scoring — consumed as a platform service via API
|
|
156
|
+
boundary_definition: >
|
|
157
|
+
The system boundary is defined at the API Gateway ingress (customer-facing)
|
|
158
|
+
and at the integration adapters for external systems (payment gateway,
|
|
159
|
+
warehouse management system, notification providers). All components inside
|
|
160
|
+
the boundary are owned and operated by the Order Platform Engineering team.
|
|
161
|
+
The C4 context diagram (Architecture Views — Context) is the authoritative
|
|
162
|
+
boundary representation.
|
|
163
|
+
assumptions:
|
|
164
|
+
- assumption: AWS is the mandated cloud provider for all new e-commerce workloads
|
|
165
|
+
consequence_if_violated: >
|
|
166
|
+
The entire IaC stack, managed services selection, and IAM model would
|
|
167
|
+
need to be replaced. Estimated re-platform effort: 6–9 months.
|
|
168
|
+
- assumption: >
|
|
169
|
+
The external payment gateway provides a REST API with SLA >= 99.9%
|
|
170
|
+
availability and supports idempotent payment authorisation
|
|
171
|
+
consequence_if_violated: >
|
|
172
|
+
Payment Service retry logic and DLQ strategy may be insufficient.
|
|
173
|
+
A synchronous fallback or alternative payment gateway would be required.
|
|
174
|
+
- assumption: Order volume is 100–1000 orders/minute sustained, peak 5000/minute
|
|
175
|
+
consequence_if_violated: >
|
|
176
|
+
Lambda concurrency limits and DynamoDB capacity units may need
|
|
177
|
+
re-provisioning. Architecture is elastic but cost model changes.
|
|
178
|
+
- assumption: PCI-DSS scope is limited to payment authorisation flow (no card data stored)
|
|
179
|
+
consequence_if_violated: >
|
|
180
|
+
Significantly broader PCI-DSS controls would apply, requiring dedicated
|
|
181
|
+
cardholder data environment (CDE) separation and additional audit scope.
|
|
182
|
+
- assumption: AWS CDK v2 is the approved IaC tool for this domain
|
|
183
|
+
consequence_if_violated: >
|
|
184
|
+
IaC templates would need to be rewritten in Terraform or CloudFormation.
|
|
185
|
+
|
|
186
|
+
drivers_and_principles:
|
|
187
|
+
drivers:
|
|
188
|
+
- id: DR-01
|
|
189
|
+
description: >
|
|
190
|
+
Scale order processing capacity 10× from 100 to 1000 sustained orders/minute
|
|
191
|
+
without architectural re-platforming, to support projected e-commerce growth
|
|
192
|
+
over the next 3 years.
|
|
193
|
+
architecture_response: >
|
|
194
|
+
Serverless compute (Lambda) and fully managed data services (DynamoDB,
|
|
195
|
+
SQS, EventBridge) provide elastic horizontal scaling. Scaling policies
|
|
196
|
+
(see Operational Model) target 1000 orders/minute sustained with 5000/minute
|
|
197
|
+
burst. ADR-003 documents the Lambda-vs-ECS trade-off for this requirement.
|
|
198
|
+
- id: DR-02
|
|
199
|
+
description: >
|
|
200
|
+
Achieve PCI-DSS Level 1 compliance for the payment authorisation flow
|
|
201
|
+
to satisfy card network requirements and enable enterprise payment volumes.
|
|
202
|
+
architecture_response: >
|
|
203
|
+
PCI-DSS compliance achieved through: Cognito + JWT for customer authentication
|
|
204
|
+
(no credentials in order data), API Gateway WAF with OWASP ruleset, KMS
|
|
205
|
+
encryption at rest and TLS 1.2+ in transit, VPC isolation for Lambda
|
|
206
|
+
functions, CloudTrail audit logging, and full compliance mapping in
|
|
207
|
+
Governance section. No card data is stored (tokenisation via payment gateway).
|
|
208
|
+
- id: DR-03
|
|
209
|
+
description: >
|
|
210
|
+
Reduce order processing latency: P99 order submission < 500ms,
|
|
211
|
+
order status query < 200ms, to support real-time customer experience.
|
|
212
|
+
architecture_response: >
|
|
213
|
+
Synchronous API Gateway → Lambda path for order submission targets
|
|
214
|
+
< 500ms P99. DynamoDB single-table design (ADR-005) with access patterns
|
|
215
|
+
optimised for order-by-id and customer-order-list queries targets < 200ms
|
|
216
|
+
P99. Lambda provisioned concurrency eliminates cold-start latency for
|
|
217
|
+
the submission path. Quality Attributes section defines fitness functions.
|
|
218
|
+
- id: DR-04
|
|
219
|
+
description: >
|
|
220
|
+
Enable independent deployment of services (order, payment, fulfilment,
|
|
221
|
+
notification) so teams can release without coordinating across service
|
|
222
|
+
boundaries.
|
|
223
|
+
architecture_response: >
|
|
224
|
+
Event-driven choreography via EventBridge (ADR-001) decouples services
|
|
225
|
+
at runtime. Each service owns its schema and deployment pipeline (see CI/CD
|
|
226
|
+
templates). Schema evolution uses EventBridge schema registry with
|
|
227
|
+
compatibility checks in CI. Services do not share databases.
|
|
228
|
+
- id: DR-05
|
|
229
|
+
description: >
|
|
230
|
+
Provide complete, immutable audit trail of all order state transitions
|
|
231
|
+
for compliance, customer support, and fraud investigation.
|
|
232
|
+
architecture_response: >
|
|
233
|
+
Every state transition publishes an event to EventBridge with correlation
|
|
234
|
+
ID, timestamp, actor, and previous/new state. Events are persisted to
|
|
235
|
+
S3 via Kinesis Firehose for 7-year retention. CloudTrail captures all
|
|
236
|
+
API-level actions. Correlation IDs thread through all service logs.
|
|
237
|
+
principles:
|
|
238
|
+
- id: PRIN-01
|
|
239
|
+
name: Event-driven by default
|
|
240
|
+
how_applied: >
|
|
241
|
+
All inter-service communication uses EventBridge async events, not
|
|
242
|
+
synchronous REST calls. Synchronous REST is used only for the customer-
|
|
243
|
+
facing submission API and for external system adapters. See ADR-001.
|
|
244
|
+
- id: PRIN-02
|
|
245
|
+
name: API-first
|
|
246
|
+
how_applied: >
|
|
247
|
+
All service interfaces are defined as OpenAPI 3.1 specifications before
|
|
248
|
+
implementation. API specs are the contract; code is the implementation.
|
|
249
|
+
See Implementation Artifacts for spec locations.
|
|
250
|
+
- id: PRIN-03
|
|
251
|
+
name: Zero-trust networking
|
|
252
|
+
how_applied: >
|
|
253
|
+
All service-to-service calls require IAM role authentication.
|
|
254
|
+
No service has a blanket "allow all within VPC" rule. Each Lambda
|
|
255
|
+
function has a least-privilege IAM role permitting only the specific
|
|
256
|
+
resources it needs. See Security View and Governance.
|
|
257
|
+
- id: PRIN-04
|
|
258
|
+
name: Infrastructure-as-code only
|
|
259
|
+
how_applied: >
|
|
260
|
+
All AWS resources are provisioned via CDK stacks. No manual console
|
|
261
|
+
changes. CloudFormation drift detection runs nightly. See ADR-003
|
|
262
|
+
and Implementation Artifacts.
|
|
263
|
+
- id: PRIN-05
|
|
264
|
+
name: Observability-first
|
|
265
|
+
how_applied: >
|
|
266
|
+
All Lambda functions emit structured JSON logs with correlation IDs.
|
|
267
|
+
X-Ray tracing is active on all functions. CloudWatch dashboards are
|
|
268
|
+
provisioned by CDK alongside the service, not added after deployment.
|
|
269
|
+
|
|
270
|
+
architecture_views:
|
|
271
|
+
context:
|
|
272
|
+
description: >
|
|
273
|
+
The Event-Driven Order Processing Platform sits at the centre of the
|
|
274
|
+
e-commerce order lifecycle. External actors interact at the boundary:
|
|
275
|
+
customers submit orders and query status via API Gateway; the Payment
|
|
276
|
+
Gateway (Stripe) authorises charges; the Warehouse Management System
|
|
277
|
+
receives fulfilment instructions; and the Notification Provider (Amazon
|
|
278
|
+
SES/SNS) delivers email and SMS. The corporate IAM platform issues
|
|
279
|
+
tokens used by internal operator tooling. The Analytics Platform
|
|
280
|
+
consumes order events from S3 for reporting.
|
|
281
|
+
|
|
282
|
+
Key boundary characteristics:
|
|
283
|
+
- Customers interact only through the API Gateway (HTTPS); no direct
|
|
284
|
+
service access.
|
|
285
|
+
- Payment Gateway integration is outbound only; no inbound webhooks
|
|
286
|
+
(polling model for authorisation status).
|
|
287
|
+
- Warehouse Management System integration uses SQS queue (decoupled;
|
|
288
|
+
WMS pulls at its own rate).
|
|
289
|
+
- All events are published to EventBridge for cross-domain consumption
|
|
290
|
+
by the Analytics Platform.
|
|
291
|
+
diagram_source: |
|
|
292
|
+
flowchart LR
|
|
293
|
+
classDef actor fill:#f8fafc,stroke:#334155,stroke-width:1.4px,color:#0f172a;
|
|
294
|
+
classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
|
|
295
|
+
classDef internal fill:#ecfeff,stroke:#0f766e,stroke-width:1.4px,color:#134e4a;
|
|
296
|
+
|
|
297
|
+
Customer@{ shape: stadium, label: "Customer" }
|
|
298
|
+
Operator@{ shape: rounded, label: "Internal Operator" }
|
|
299
|
+
PayGW@{ shape: cloud, label: "Payment Gateway\n(Stripe)" }
|
|
300
|
+
WMS@{ shape: cloud, label: "Warehouse Management\nSystem" }
|
|
301
|
+
Analytics@{ shape: doc, label: "Analytics Platform" }
|
|
302
|
+
|
|
303
|
+
APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
304
|
+
Cognito@{ img: "/icons/aws/cognito.svg", label: "Corporate IAM\nvia Cognito", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
305
|
+
Platform@{ img: "/icons/aws/aws-cloud.svg", label: "Event-Driven Order\nProcessing Platform", pos: "b", w: 72, h: 72, constraint: "on" }
|
|
306
|
+
EventBus@{ img: "/icons/aws/eventbridge.svg", label: "EventBridge", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
307
|
+
SES@{ img: "/icons/aws/ses.svg", label: "Amazon SES", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
308
|
+
SNS@{ img: "/icons/aws/sns.svg", label: "Amazon SNS", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
309
|
+
|
|
310
|
+
Customer -->|Submit orders| APIGW
|
|
311
|
+
Operator -->|Query status| APIGW
|
|
312
|
+
Cognito -->|JWT tokens| APIGW
|
|
313
|
+
APIGW --> Platform
|
|
314
|
+
Platform -->|REST HTTPS| PayGW
|
|
315
|
+
Platform -->|Fulfilment requests| WMS
|
|
316
|
+
Platform --> SES
|
|
317
|
+
Platform --> SNS
|
|
318
|
+
Platform --> EventBus
|
|
319
|
+
EventBus -->|Order events| Analytics
|
|
320
|
+
|
|
321
|
+
class Customer,Operator actor
|
|
322
|
+
class PayGW,WMS external
|
|
323
|
+
class Analytics internal
|
|
324
|
+
source_type: mermaid
|
|
325
|
+
|
|
326
|
+
functional:
|
|
327
|
+
description: >
|
|
328
|
+
The platform decomposes into six primary containers plus the shared
|
|
329
|
+
event bus. Each container is a separately deployable Lambda function
|
|
330
|
+
group with its own IAM role, CloudWatch log group, and CDK stack.
|
|
331
|
+
Containers communicate exclusively through EventBridge events (async)
|
|
332
|
+
or SQS queues (work distribution). No container calls another container
|
|
333
|
+
synchronously.
|
|
334
|
+
|
|
335
|
+
Container responsibilities:
|
|
336
|
+
- API Gateway: TLS termination, Cognito JWT authorisation, request
|
|
337
|
+
routing, WAF enforcement, throttling (1000 req/s burst, 100 req/s
|
|
338
|
+
sustained per stage).
|
|
339
|
+
- Order Service: Validates order data, persists to DynamoDB, publishes
|
|
340
|
+
OrderCreated event. Exposes REST endpoints for order submission and
|
|
341
|
+
status query.
|
|
342
|
+
- Payment Service: Subscribes to OrderCreated, calls Stripe payment API,
|
|
343
|
+
publishes PaymentAuthorised or PaymentFailed event. Idempotent via
|
|
344
|
+
Stripe idempotency keys.
|
|
345
|
+
- Fulfilment Service: Subscribes to PaymentAuthorised, places fulfilment
|
|
346
|
+
message on SQS queue consumed by WMS adapter, publishes FulfilmentDispatched.
|
|
347
|
+
- Notification Service: Subscribes to OrderCreated, PaymentAuthorised,
|
|
348
|
+
PaymentFailed, FulfilmentDispatched. Sends customer-facing status
|
|
349
|
+
updates via SES (email) and SNS (SMS).
|
|
350
|
+
- Event Bus (EventBridge): Central event backbone. All inter-service
|
|
351
|
+
events pass through EventBridge with schema validation enforced.
|
|
352
|
+
- Event Archive (S3 + Firehose): All EventBridge events replayed to
|
|
353
|
+
Kinesis Firehose → S3 for audit retention (7 years) and analytics.
|
|
354
|
+
diagram_source: |
|
|
355
|
+
flowchart LR
|
|
356
|
+
classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
|
|
357
|
+
|
|
358
|
+
WAF@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
359
|
+
APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
360
|
+
Cognito@{ img: "/icons/aws/cognito.svg", label: "Cognito\nAuthorizer", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
361
|
+
OrderSvc@{ img: "/icons/aws/lambda.svg", label: "Order Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
362
|
+
PaySvc@{ img: "/icons/aws/lambda.svg", label: "Payment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
363
|
+
FulfilSvc@{ img: "/icons/aws/lambda.svg", label: "Fulfilment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
364
|
+
NotifySvc@{ img: "/icons/aws/lambda.svg", label: "Notification Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
365
|
+
WMSAdapter@{ img: "/icons/aws/lambda.svg", label: "WMS Adapter\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
366
|
+
DDB@{ img: "/icons/aws/dynamodb.svg", label: "Orders Table\nDynamoDB", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
367
|
+
EB@{ img: "/icons/aws/eventbridge.svg", label: "Order Event Bus", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
368
|
+
SQS@{ img: "/icons/aws/sqs.svg", label: "Fulfilment Queue\n+ DLQ", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
369
|
+
SES@{ img: "/icons/aws/ses.svg", label: "Amazon SES", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
370
|
+
SNS@{ img: "/icons/aws/sns.svg", label: "Amazon SNS", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
371
|
+
Firehose@{ img: "/icons/aws/data-firehose.svg", label: "Kinesis Data Firehose", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
372
|
+
S3@{ img: "/icons/aws/s3.svg", label: "S3 Event Archive\n(7-year retention)", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
373
|
+
Stripe@{ shape: cloud, label: "Stripe API" }
|
|
374
|
+
WMS@{ shape: cloud, label: "Warehouse Management\nSystem" }
|
|
375
|
+
|
|
376
|
+
WAF --> APIGW
|
|
377
|
+
Cognito --> APIGW
|
|
378
|
+
APIGW --> OrderSvc
|
|
379
|
+
OrderSvc --> DDB
|
|
380
|
+
OrderSvc --> EB
|
|
381
|
+
EB --> PaySvc
|
|
382
|
+
EB --> FulfilSvc
|
|
383
|
+
EB --> NotifySvc
|
|
384
|
+
PaySvc --> Stripe
|
|
385
|
+
PaySvc --> DDB
|
|
386
|
+
PaySvc --> EB
|
|
387
|
+
FulfilSvc --> DDB
|
|
388
|
+
FulfilSvc --> SQS
|
|
389
|
+
FulfilSvc --> EB
|
|
390
|
+
SQS --> WMSAdapter
|
|
391
|
+
WMSAdapter --> WMS
|
|
392
|
+
NotifySvc --> SES
|
|
393
|
+
NotifySvc --> SNS
|
|
394
|
+
EB --> Firehose
|
|
395
|
+
Firehose --> S3
|
|
396
|
+
|
|
397
|
+
class Stripe,WMS external
|
|
398
|
+
source_type: mermaid
|
|
399
|
+
|
|
400
|
+
deployment:
|
|
401
|
+
description: >
|
|
402
|
+
All components are deployed in a single AWS region (eu-west-1) with
|
|
403
|
+
multi-AZ resilience. Lambda functions execute across three Availability
|
|
404
|
+
Zones (AZs) by default. DynamoDB is a regional service with automatic
|
|
405
|
+
multi-AZ replication. SQS and EventBridge are regional services with
|
|
406
|
+
99.99% availability SLA.
|
|
407
|
+
|
|
408
|
+
Network topology:
|
|
409
|
+
- VPC with three private subnets (one per AZ) for Lambda functions
|
|
410
|
+
requiring VPC attachment (Payment Service and WMS Adapter).
|
|
411
|
+
- API Gateway is a managed regional endpoint outside the VPC; WAF
|
|
412
|
+
is attached at the CloudFront distribution level.
|
|
413
|
+
- DynamoDB, EventBridge, SQS, S3, SES, and SNS are accessed via
|
|
414
|
+
VPC endpoints (PrivateLink) where available; no public internet
|
|
415
|
+
egress for data traffic.
|
|
416
|
+
- NAT Gateway per AZ for Lambda functions requiring internet egress
|
|
417
|
+
(Stripe API calls from Payment Service).
|
|
418
|
+
|
|
419
|
+
Resilience model:
|
|
420
|
+
- Lambda: executes in all AZs; automatic AZ failover; no single-AZ
|
|
421
|
+
dependency.
|
|
422
|
+
- DynamoDB: multi-AZ; automatic failover; point-in-time recovery
|
|
423
|
+
(PITR) enabled; 35-day backup window.
|
|
424
|
+
- SQS: multi-AZ; messages durable across AZ failures.
|
|
425
|
+
- EventBridge: regional service; 99.99% SLA.
|
|
426
|
+
|
|
427
|
+
Disaster recovery (active-passive multi-region):
|
|
428
|
+
- Primary region: eu-west-1.
|
|
429
|
+
- DR region: eu-central-1 (warm standby).
|
|
430
|
+
- DynamoDB Global Tables replicate in near-real-time.
|
|
431
|
+
- Route 53 health checks; failover routing policy activates DR region
|
|
432
|
+
within RTO target (4 hours).
|
|
433
|
+
diagram_source: |
|
|
434
|
+
flowchart TB
|
|
435
|
+
Route53@{ img: "/icons/aws/route53.svg", label: "Route 53\nHealth checks + failover", pos: "b", w: 56, h: 56, constraint: "on" }
|
|
436
|
+
|
|
437
|
+
subgraph Primary["eu-west-1 (Primary)"]
|
|
438
|
+
direction TB
|
|
439
|
+
CF_Primary@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
440
|
+
WAF_Primary@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
441
|
+
APIGW_Primary@{ img: "/icons/aws/api-gateway.svg", label: "Regional API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
442
|
+
|
|
443
|
+
subgraph VPC_Primary["VPC"]
|
|
444
|
+
direction LR
|
|
445
|
+
|
|
446
|
+
subgraph AZ_A["eu-west-1a"]
|
|
447
|
+
direction TB
|
|
448
|
+
SubnetA@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet A\n10.0.1.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
449
|
+
LambdaA@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
450
|
+
NATA@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway A", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
451
|
+
SubnetA --- LambdaA
|
|
452
|
+
LambdaA -. egress .-> NATA
|
|
453
|
+
end
|
|
454
|
+
|
|
455
|
+
subgraph AZ_B["eu-west-1b"]
|
|
456
|
+
direction TB
|
|
457
|
+
SubnetB@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet B\n10.0.2.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
458
|
+
LambdaB@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
459
|
+
NATB@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway B", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
460
|
+
SubnetB --- LambdaB
|
|
461
|
+
LambdaB -. egress .-> NATB
|
|
462
|
+
end
|
|
463
|
+
|
|
464
|
+
subgraph AZ_C["eu-west-1c"]
|
|
465
|
+
direction TB
|
|
466
|
+
SubnetC@{ img: "/icons/aws/private-subnet.svg", label: "Private Subnet C\n10.0.3.0/24", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
467
|
+
LambdaC@{ img: "/icons/aws/lambda.svg", label: "Lambda workload", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
468
|
+
NATC@{ img: "/icons/aws/nat-gateway.svg", label: "NAT Gateway C", pos: "b", w: 48, h: 48, constraint: "on" }
|
|
469
|
+
SubnetC --- LambdaC
|
|
470
|
+
LambdaC -. egress .-> NATC
|
|
471
|
+
end
|
|
472
|
+
end
|
|
473
|
+
|
|
474
|
+
DDB_Primary@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB Global Table\nPrimary", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
475
|
+
end
|
|
476
|
+
|
|
477
|
+
subgraph DR["eu-central-1 (Warm standby)"]
|
|
478
|
+
direction TB
|
|
479
|
+
CF_DR@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
480
|
+
WAF_DR@{ img: "/icons/aws/waf.svg", label: "AWS WAF", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
481
|
+
APIGW_DR@{ img: "/icons/aws/api-gateway.svg", label: "Regional API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
482
|
+
Lambda_DR@{ img: "/icons/aws/lambda.svg", label: "Lambda warm standby", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
483
|
+
DDB_DR@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB Global Table\nReplica", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
484
|
+
end
|
|
485
|
+
|
|
486
|
+
Route53 --> CF_Primary
|
|
487
|
+
Route53 -. failover .-> CF_DR
|
|
488
|
+
CF_Primary --> WAF_Primary --> APIGW_Primary
|
|
489
|
+
CF_DR --> WAF_DR --> APIGW_DR
|
|
490
|
+
APIGW_Primary --> LambdaA
|
|
491
|
+
APIGW_Primary --> LambdaB
|
|
492
|
+
APIGW_Primary --> LambdaC
|
|
493
|
+
APIGW_DR --> Lambda_DR
|
|
494
|
+
LambdaA -. private AWS SDK .-> DDB_Primary
|
|
495
|
+
LambdaB -. private AWS SDK .-> DDB_Primary
|
|
496
|
+
LambdaC -. private AWS SDK .-> DDB_Primary
|
|
497
|
+
Lambda_DR -. warm standby access .-> DDB_DR
|
|
498
|
+
DDB_Primary <-->|Global Tables\nreplication| DDB_DR
|
|
499
|
+
source_type: mermaid
|
|
500
|
+
|
|
501
|
+
data_flow:
|
|
502
|
+
description: >
|
|
503
|
+
The primary data flow is the order processing lifecycle: a customer
|
|
504
|
+
submits an order and receives a confirmation. The flow is predominantly
|
|
505
|
+
asynchronous after order persistence, with the customer receiving a
|
|
506
|
+
202 Accepted response and subsequent status updates via notification.
|
|
507
|
+
|
|
508
|
+
Correlation ID (UUID v4) is generated at step 1 and threaded through
|
|
509
|
+
every subsequent step as an HTTP header, EventBridge detail field, SQS
|
|
510
|
+
message attribute, and DynamoDB attribute, enabling end-to-end tracing
|
|
511
|
+
via CloudWatch X-Ray.
|
|
512
|
+
narrative_steps:
|
|
513
|
+
- step: 1
|
|
514
|
+
description: >
|
|
515
|
+
Customer submits POST /orders via HTTPS to API Gateway CloudFront
|
|
516
|
+
distribution. WAF evaluates the request against OWASP Core Rule Set.
|
|
517
|
+
Cognito JWT authoriser validates the Bearer token. Throttle check
|
|
518
|
+
(100 req/s sustained per customer). A correlation ID is generated
|
|
519
|
+
and injected into the request context.
|
|
520
|
+
- step: 2
|
|
521
|
+
description: >
|
|
522
|
+
API Gateway invokes Order Service Lambda (synchronous). Request
|
|
523
|
+
payload is validated against OpenAPI schema (request validator
|
|
524
|
+
enabled). Lambda reads Cognito claims to extract customer ID.
|
|
525
|
+
- step: 3
|
|
526
|
+
description: >
|
|
527
|
+
Order Service performs business validation: checks item quantities
|
|
528
|
+
are positive, delivery address is valid, and order total is non-zero.
|
|
529
|
+
If validation fails, returns HTTP 422 with structured error response.
|
|
530
|
+
- step: 4
|
|
531
|
+
description: >
|
|
532
|
+
Order Service writes the order record to DynamoDB with status PENDING.
|
|
533
|
+
Partition key: ORDER#<orderId>. Sort key: METADATA. Includes
|
|
534
|
+
correlation ID, customer ID, items, totals, and timestamps.
|
|
535
|
+
Uses conditional write to ensure idempotency.
|
|
536
|
+
- step: 5
|
|
537
|
+
description: >
|
|
538
|
+
Order Service publishes OrderCreated event to EventBridge on the
|
|
539
|
+
order-processing bus. Event detail includes orderId, customerId,
|
|
540
|
+
totalAmount, correlationId, and timestamp. Schema is registered
|
|
541
|
+
in EventBridge Schema Registry and validated before publish.
|
|
542
|
+
- step: 6
|
|
543
|
+
description: >
|
|
544
|
+
Order Service returns HTTP 202 Accepted to API Gateway with
|
|
545
|
+
orderId and a polling URL (/orders/{orderId}/status). Customer
|
|
546
|
+
can poll this endpoint or await email/SMS notification.
|
|
547
|
+
- step: 7
|
|
548
|
+
description: >
|
|
549
|
+
EventBridge routes OrderCreated event to Payment Service Lambda
|
|
550
|
+
(event rule: source=order-service, detail-type=OrderCreated).
|
|
551
|
+
Payment Service Lambda extracts orderId and totalAmount.
|
|
552
|
+
- step: 8
|
|
553
|
+
description: >
|
|
554
|
+
Payment Service calls Stripe Charge API (POST /v1/payment_intents)
|
|
555
|
+
with idempotency key = correlationId to prevent double-charge.
|
|
556
|
+
Stripe responds synchronously within ~2s. If Stripe returns error,
|
|
557
|
+
Payment Service publishes PaymentFailed event and retries with
|
|
558
|
+
exponential backoff (max 3 attempts).
|
|
559
|
+
- step: 9
|
|
560
|
+
description: >
|
|
561
|
+
On Stripe success, Payment Service updates DynamoDB order record
|
|
562
|
+
(status → PAYMENT_AUTHORISED, stripeChargeId added). Publishes
|
|
563
|
+
PaymentAuthorised event to EventBridge. EventBridge routes the
|
|
564
|
+
event simultaneously to Fulfilment Service and Notification Service.
|
|
565
|
+
- step: 10
|
|
566
|
+
description: >
|
|
567
|
+
Fulfilment Service receives PaymentAuthorised event. Writes fulfilment
|
|
568
|
+
instruction message to SQS Fulfilment Queue (includes orderId, items,
|
|
569
|
+
delivery address). Updates DynamoDB order status to FULFILMENT_QUEUED.
|
|
570
|
+
Publishes FulfilmentDispatched event to EventBridge.
|
|
571
|
+
- step: 11
|
|
572
|
+
description: >
|
|
573
|
+
WMS Adapter Lambda polls SQS Fulfilment Queue (long polling, 20s).
|
|
574
|
+
Translates order instruction to WMS API format and calls WMS REST
|
|
575
|
+
API. On success, deletes SQS message. On WMS failure, message
|
|
576
|
+
returns to queue; after 3 attempts, routes to Dead Letter Queue.
|
|
577
|
+
SRE CloudWatch alarm fires on DLQ depth > 0.
|
|
578
|
+
- step: 12
|
|
579
|
+
description: >
|
|
580
|
+
Notification Service receives OrderCreated, PaymentAuthorised, and
|
|
581
|
+
FulfilmentDispatched events (separate EventBridge rules). For each,
|
|
582
|
+
constructs customer-facing message, sends email via SES and SMS via
|
|
583
|
+
SNS. Uses DynamoDB customer preferences table to determine channel.
|
|
584
|
+
All notifications carry correlationId for support tracing.
|
|
585
|
+
|
|
586
|
+
security:
|
|
587
|
+
description: >
|
|
588
|
+
The security architecture implements defence-in-depth across five
|
|
589
|
+
concentric trust zones. PCI-DSS controls are mapped to specific
|
|
590
|
+
architecture elements — see Governance section for full mapping.
|
|
591
|
+
|
|
592
|
+
Trust Zone 1 — Public (untrusted): CloudFront + WAF. All inbound
|
|
593
|
+
traffic from the internet enters here. WAF enforces OWASP Core Rule
|
|
594
|
+
Set v3.2, rate limiting (100 req/IP/s), and geo-blocking as required.
|
|
595
|
+
No backend service is reachable without passing WAF.
|
|
596
|
+
|
|
597
|
+
Trust Zone 2 — API perimeter: API Gateway with Cognito JWT authoriser.
|
|
598
|
+
Every request must carry a valid JWT issued by Cognito User Pool.
|
|
599
|
+
Cognito enforces MFA for operator access. Token expiry: 1 hour (access),
|
|
600
|
+
24 hours (refresh). API Gateway request validators enforce schema
|
|
601
|
+
before Lambda invocation.
|
|
602
|
+
|
|
603
|
+
Trust Zone 3 — Service mesh (VPC private subnets): Lambda functions
|
|
604
|
+
executing in VPC. No inbound internet access. Inter-service communication
|
|
605
|
+
uses EventBridge (regional, no VPC required) and SQS (VPC endpoint).
|
|
606
|
+
Lambda IAM roles are least-privilege: each function has an IAM role
|
|
607
|
+
permitting only its specific DynamoDB actions, EventBridge PutEvents
|
|
608
|
+
to its source bus, and SQS actions on its specific queue.
|
|
609
|
+
|
|
610
|
+
Trust Zone 4 — Data layer: DynamoDB, SQS, EventBridge, S3. All
|
|
611
|
+
resources are encrypted at rest with AWS KMS customer-managed keys
|
|
612
|
+
(one CMK per service, key rotation enabled). All S3 buckets have
|
|
613
|
+
public access block enabled and versioning on. DynamoDB encryption
|
|
614
|
+
uses KMS CMK; Point-in-Time Recovery enabled.
|
|
615
|
+
|
|
616
|
+
Trust Zone 5 — Audit: CloudTrail (management and data events), S3
|
|
617
|
+
event archive, VPC Flow Logs. CloudTrail logs are written to a
|
|
618
|
+
dedicated S3 bucket in a separate account with write-once (Object
|
|
619
|
+
Lock) policy.
|
|
620
|
+
|
|
621
|
+
Encryption in transit: TLS 1.2 minimum enforced by API Gateway and
|
|
622
|
+
Cognito. All SDK calls to AWS services use HTTPS. Stripe integration
|
|
623
|
+
uses TLS 1.2+. Internal Lambda-to-SQS and Lambda-to-DynamoDB calls
|
|
624
|
+
use AWS SDK over HTTPS via VPC endpoints.
|
|
625
|
+
diagram_source: |
|
|
626
|
+
flowchart LR
|
|
627
|
+
classDef external fill:#fff7ed,stroke:#c2410c,stroke-width:1.4px,color:#7c2d12;
|
|
628
|
+
|
|
629
|
+
Internet@{ shape: cloud, label: "Internet" }
|
|
630
|
+
Stripe@{ shape: cloud, label: "Stripe\nTLS 1.2+" }
|
|
631
|
+
|
|
632
|
+
subgraph Zone1["Zone 1: Public"]
|
|
633
|
+
direction TB
|
|
634
|
+
CF@{ img: "/icons/aws/cloudfront.svg", label: "CloudFront", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
635
|
+
WAF@{ img: "/icons/aws/waf.svg", label: "AWS WAF\nOWASP CRS", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
636
|
+
CF --> WAF
|
|
637
|
+
end
|
|
638
|
+
|
|
639
|
+
subgraph Zone2["Zone 2: API Perimeter"]
|
|
640
|
+
direction TB
|
|
641
|
+
APIGW@{ img: "/icons/aws/api-gateway.svg", label: "API Gateway", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
642
|
+
Cognito@{ img: "/icons/aws/cognito.svg", label: "Cognito JWT\nAuthorizer", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
643
|
+
Cognito --> APIGW
|
|
644
|
+
end
|
|
645
|
+
|
|
646
|
+
subgraph Zone3["Zone 3: Service Mesh (VPC)"]
|
|
647
|
+
direction TB
|
|
648
|
+
OrderLambda@{ img: "/icons/aws/lambda.svg", label: "Order Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
649
|
+
PayLambda@{ img: "/icons/aws/lambda.svg", label: "Payment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
650
|
+
FulfilLambda@{ img: "/icons/aws/lambda.svg", label: "Fulfilment Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
651
|
+
NotifyLambda@{ img: "/icons/aws/lambda.svg", label: "Notification Service\nLambda", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
652
|
+
end
|
|
653
|
+
|
|
654
|
+
subgraph Zone4["Zone 4: Data Layer (KMS encrypted)"]
|
|
655
|
+
direction TB
|
|
656
|
+
EB@{ img: "/icons/aws/eventbridge.svg", label: "EventBridge\nSchema validated", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
657
|
+
DDB@{ img: "/icons/aws/dynamodb.svg", label: "DynamoDB", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
658
|
+
SQSq@{ img: "/icons/aws/sqs.svg", label: "SQS", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
659
|
+
S3arch@{ img: "/icons/aws/s3.svg", label: "S3 Archive", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
660
|
+
end
|
|
661
|
+
|
|
662
|
+
subgraph Zone5["Zone 5: Audit"]
|
|
663
|
+
direction TB
|
|
664
|
+
CloudTrail@{ img: "/icons/aws/cloudtrail.svg", label: "CloudTrail\nwrite-once S3", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
665
|
+
XRAY@{ img: "/icons/aws/xray.svg", label: "AWS X-Ray", pos: "b", w: 52, h: 52, constraint: "on" }
|
|
666
|
+
end
|
|
667
|
+
|
|
668
|
+
Internet --> CF
|
|
669
|
+
WAF --> APIGW
|
|
670
|
+
APIGW --> OrderLambda
|
|
671
|
+
OrderLambda --> EB
|
|
672
|
+
EB --> PayLambda --> Stripe
|
|
673
|
+
EB --> FulfilLambda
|
|
674
|
+
EB --> NotifyLambda
|
|
675
|
+
OrderLambda --> DDB
|
|
676
|
+
PayLambda --> DDB
|
|
677
|
+
FulfilLambda --> SQSq
|
|
678
|
+
EB --> S3arch
|
|
679
|
+
|
|
680
|
+
DDB -. data events .-> CloudTrail
|
|
681
|
+
SQSq -. queue events .-> CloudTrail
|
|
682
|
+
S3arch -. archive logs .-> CloudTrail
|
|
683
|
+
OrderLambda -. traces .-> XRAY
|
|
684
|
+
PayLambda -. traces .-> XRAY
|
|
685
|
+
FulfilLambda -. traces .-> XRAY
|
|
686
|
+
NotifyLambda -. traces .-> XRAY
|
|
687
|
+
|
|
688
|
+
class Internet,Stripe external
|
|
689
|
+
source_type: mermaid
|
|
690
|
+
|
|
691
|
+
element_catalog:
|
|
692
|
+
- name: API Gateway
|
|
693
|
+
type: gateway
|
|
694
|
+
technology: AWS API Gateway (Regional, REST API)
|
|
695
|
+
responsibility: >
|
|
696
|
+
TLS termination, Cognito JWT authorisation, WAF integration,
|
|
697
|
+
request schema validation, throttling, and routing to Order Service Lambda.
|
|
698
|
+
relationships:
|
|
699
|
+
- CloudFront + WAF (upstream)
|
|
700
|
+
- Cognito User Pool (authoriser)
|
|
701
|
+
- Order Service Lambda (downstream invoke)
|
|
702
|
+
|
|
703
|
+
- name: CloudFront + WAF
|
|
704
|
+
type: gateway
|
|
705
|
+
technology: AWS CloudFront with AWS WAF (OWASP CRS v3.2)
|
|
706
|
+
responsibility: >
|
|
707
|
+
Global CDN edge, DDoS mitigation, WAF rule enforcement (OWASP CRS,
|
|
708
|
+
rate limiting, geo-blocking).
|
|
709
|
+
relationships:
|
|
710
|
+
- API Gateway (origin)
|
|
711
|
+
|
|
712
|
+
- name: Cognito User Pool
|
|
713
|
+
type: service
|
|
714
|
+
technology: AWS Cognito User Pool
|
|
715
|
+
responsibility: >
|
|
716
|
+
Customer and operator identity management; JWT issuance; MFA for
|
|
717
|
+
operators; token validation for API Gateway authoriser.
|
|
718
|
+
relationships:
|
|
719
|
+
- API Gateway (authoriser)
|
|
720
|
+
|
|
721
|
+
- name: Order Service Lambda
|
|
722
|
+
type: function
|
|
723
|
+
technology: AWS Lambda, Node.js 20, ARM64
|
|
724
|
+
responsibility: >
|
|
725
|
+
Validates and persists order submissions. Exposes GET /orders/{id}
|
|
726
|
+
for status polling. Publishes OrderCreated to EventBridge.
|
|
727
|
+
relationships:
|
|
728
|
+
- API Gateway (invoked by)
|
|
729
|
+
- DynamoDB Orders Table (read/write)
|
|
730
|
+
- EventBridge Order Bus (publish OrderCreated)
|
|
731
|
+
|
|
732
|
+
- name: Payment Service Lambda
|
|
733
|
+
type: function
|
|
734
|
+
technology: AWS Lambda, Node.js 20, ARM64, VPC-attached
|
|
735
|
+
responsibility: >
|
|
736
|
+
Subscribes to OrderCreated events. Calls Stripe payment API.
|
|
737
|
+
Publishes PaymentAuthorised or PaymentFailed. Updates DynamoDB.
|
|
738
|
+
relationships:
|
|
739
|
+
- EventBridge Order Bus (subscribe OrderCreated)
|
|
740
|
+
- Stripe Payment Gateway (outbound HTTPS)
|
|
741
|
+
- DynamoDB Orders Table (write)
|
|
742
|
+
- EventBridge Order Bus (publish PaymentAuthorised, PaymentFailed)
|
|
743
|
+
|
|
744
|
+
- name: Fulfilment Service Lambda
|
|
745
|
+
type: function
|
|
746
|
+
technology: AWS Lambda, Node.js 20, ARM64
|
|
747
|
+
responsibility: >
|
|
748
|
+
Subscribes to PaymentAuthorised. Places fulfilment instruction on SQS.
|
|
749
|
+
Publishes FulfilmentDispatched. Updates DynamoDB.
|
|
750
|
+
relationships:
|
|
751
|
+
- EventBridge Order Bus (subscribe PaymentAuthorised)
|
|
752
|
+
- SQS Fulfilment Queue (send)
|
|
753
|
+
- DynamoDB Orders Table (write)
|
|
754
|
+
- EventBridge Order Bus (publish FulfilmentDispatched)
|
|
755
|
+
|
|
756
|
+
- name: WMS Adapter Lambda
|
|
757
|
+
type: function
|
|
758
|
+
technology: AWS Lambda, Node.js 20, ARM64, VPC-attached
|
|
759
|
+
responsibility: >
|
|
760
|
+
Polls SQS Fulfilment Queue; translates to WMS API format; calls WMS.
|
|
761
|
+
On WMS failure: message returns to queue (DLQ after 3 failures).
|
|
762
|
+
relationships:
|
|
763
|
+
- SQS Fulfilment Queue (consume)
|
|
764
|
+
- Warehouse Management System (outbound HTTPS)
|
|
765
|
+
|
|
766
|
+
- name: Notification Service Lambda
|
|
767
|
+
type: function
|
|
768
|
+
technology: AWS Lambda, Node.js 20, ARM64
|
|
769
|
+
responsibility: >
|
|
770
|
+
Subscribes to OrderCreated, PaymentAuthorised, PaymentFailed,
|
|
771
|
+
FulfilmentDispatched. Sends email (SES) and SMS (SNS) notifications.
|
|
772
|
+
relationships:
|
|
773
|
+
- EventBridge Order Bus (subscribe multiple events)
|
|
774
|
+
- Amazon SES (send email)
|
|
775
|
+
- Amazon SNS (send SMS)
|
|
776
|
+
- DynamoDB Customer Preferences Table (read)
|
|
777
|
+
|
|
778
|
+
- name: EventBridge Order Bus
|
|
779
|
+
type: other
|
|
780
|
+
technology: AWS EventBridge Custom Event Bus
|
|
781
|
+
responsibility: >
|
|
782
|
+
Central event backbone for all inter-service async communication.
|
|
783
|
+
Schema registry enforces event schema validation before routing.
|
|
784
|
+
relationships:
|
|
785
|
+
- All service Lambdas (publish and subscribe)
|
|
786
|
+
- Kinesis Firehose (event archive pipe)
|
|
787
|
+
|
|
788
|
+
- name: DynamoDB Orders Table
|
|
789
|
+
type: database
|
|
790
|
+
technology: AWS DynamoDB (single-table design, on-demand capacity)
|
|
791
|
+
responsibility: >
|
|
792
|
+
Primary data store for orders and customer preferences.
|
|
793
|
+
Single-table design supports order-by-id, customer-orders, and
|
|
794
|
+
status-based access patterns. PITR enabled.
|
|
795
|
+
relationships:
|
|
796
|
+
- Order Service Lambda (read/write)
|
|
797
|
+
- Payment Service Lambda (write)
|
|
798
|
+
- Fulfilment Service Lambda (write)
|
|
799
|
+
|
|
800
|
+
- name: SQS Fulfilment Queue
|
|
801
|
+
type: queue
|
|
802
|
+
technology: AWS SQS Standard Queue + Dead Letter Queue
|
|
803
|
+
responsibility: >
|
|
804
|
+
Decouples Fulfilment Service from WMS Adapter. Provides at-least-once
|
|
805
|
+
delivery with visibility timeout (30s). DLQ captures messages failing
|
|
806
|
+
after 3 receive attempts.
|
|
807
|
+
relationships:
|
|
808
|
+
- Fulfilment Service Lambda (send)
|
|
809
|
+
- WMS Adapter Lambda (consume)
|
|
810
|
+
|
|
811
|
+
- name: S3 Event Archive
|
|
812
|
+
type: storage
|
|
813
|
+
technology: AWS S3 (KMS encrypted, Object Lock, versioning)
|
|
814
|
+
responsibility: >
|
|
815
|
+
7-year immutable event archive for audit, compliance, and analytics.
|
|
816
|
+
Receives all EventBridge events via Kinesis Firehose.
|
|
817
|
+
relationships:
|
|
818
|
+
- Kinesis Firehose (write)
|
|
819
|
+
- Analytics Platform (read)
|
|
820
|
+
|
|
821
|
+
decisions:
|
|
822
|
+
- id: ADR-001
|
|
823
|
+
title: Event-driven choreography over synchronous request-response for inter-service communication
|
|
824
|
+
context: >
|
|
825
|
+
Six services (Order, Payment, Fulfilment, Notification, WMS Adapter,
|
|
826
|
+
Analytics) must coordinate to process an order. The services need to
|
|
827
|
+
be independently deployable by separate teams. DR-04 requires no
|
|
828
|
+
cross-service deployment coordination. DR-01 requires 10× scale without
|
|
829
|
+
re-platforming.
|
|
830
|
+
options:
|
|
831
|
+
- id: A
|
|
832
|
+
description: Synchronous REST — each service calls the next in the chain
|
|
833
|
+
pros:
|
|
834
|
+
- Simple to reason about; linear call trace
|
|
835
|
+
- Immediate consistency; error propagation is direct
|
|
836
|
+
cons:
|
|
837
|
+
- Tight coupling; callee changes break callers
|
|
838
|
+
- Cascading failures; one slow service blocks the chain
|
|
839
|
+
- Cannot independently deploy without coordinating all services
|
|
840
|
+
- Latency is additive across the chain
|
|
841
|
+
- id: B
|
|
842
|
+
description: Orchestration via step function (AWS Step Functions)
|
|
843
|
+
pros:
|
|
844
|
+
- Central visibility of workflow state
|
|
845
|
+
- Built-in retry and error handling
|
|
846
|
+
cons:
|
|
847
|
+
- Central orchestrator is a coupling point
|
|
848
|
+
- Teams must agree on orchestration contract changes
|
|
849
|
+
- Step Functions costs and throttle limits at high order volumes
|
|
850
|
+
- id: C
|
|
851
|
+
description: Choreography via EventBridge event bus (chosen)
|
|
852
|
+
pros:
|
|
853
|
+
- Services are fully decoupled; each publishes and subscribes independently
|
|
854
|
+
- Schema registry enforces contracts without runtime coupling
|
|
855
|
+
- EventBridge scales to millions of events/second
|
|
856
|
+
- New subscribers (analytics, fraud) added without modifying publishers
|
|
857
|
+
cons:
|
|
858
|
+
- End-to-end flow harder to trace (mitigated by X-Ray correlation IDs)
|
|
859
|
+
- Eventual consistency model requires idempotent consumers
|
|
860
|
+
- Schema evolution requires backwards-compatible changes
|
|
861
|
+
decision: Option C — EventBridge choreography
|
|
862
|
+
rationale: >
|
|
863
|
+
DR-04 (independent deployability) and DR-01 (10× scale) are best
|
|
864
|
+
served by choreography. EventBridge provides native schema enforcement,
|
|
865
|
+
eliminating the primary risk of loose coupling. X-Ray with correlation
|
|
866
|
+
IDs mitigates the observability trade-off. PRIN-01 (event-driven by
|
|
867
|
+
default) aligns with this choice.
|
|
868
|
+
tradeoffs: >
|
|
869
|
+
Accepted: harder end-to-end flow tracing (mitigated by X-Ray + correlation
|
|
870
|
+
IDs); eventual consistency requiring idempotent consumers; schema
|
|
871
|
+
evolution discipline required. Rejected: tight coupling, cascading
|
|
872
|
+
failures, deployment coordination overhead.
|
|
873
|
+
consequences: >
|
|
874
|
+
All service teams must implement idempotent consumers (DynamoDB
|
|
875
|
+
conditional writes for deduplication). EventBridge Schema Registry
|
|
876
|
+
is mandatory. X-Ray tracing is mandatory for all Lambda functions.
|
|
877
|
+
A change to an event schema requires a compatibility check in CI
|
|
878
|
+
before merge.
|
|
879
|
+
revisit_conditions: >
|
|
880
|
+
Revisit if: order event throughput exceeds 50K events/second (Step
|
|
881
|
+
Functions throttle may become attractive); or if strong consistency
|
|
882
|
+
requirements are introduced (e.g. inventory reservation); or if the
|
|
883
|
+
team size shrinks such that independent deployment is no longer a goal.
|
|
884
|
+
driver_refs: [DR-01, DR-04, DR-05]
|
|
885
|
+
|
|
886
|
+
- id: ADR-002
|
|
887
|
+
title: DynamoDB over RDS PostgreSQL as the primary order data store
|
|
888
|
+
context: >
|
|
889
|
+
Order data must support: order-by-id lookup (< 5ms), customer-order-list
|
|
890
|
+
query (< 10ms), status-update writes (< 5ms). DR-01 requires 10× scale.
|
|
891
|
+
DR-05 requires immutable audit trail. The access patterns are
|
|
892
|
+
well-defined and documented. Complex ad-hoc queries are handled by
|
|
893
|
+
the Analytics Platform (not this service).
|
|
894
|
+
options:
|
|
895
|
+
- id: A
|
|
896
|
+
description: DynamoDB single-table design with on-demand capacity (chosen)
|
|
897
|
+
pros:
|
|
898
|
+
- Scales to any throughput without provisioning
|
|
899
|
+
- Single-digit millisecond latency at any scale
|
|
900
|
+
- No connection pool management; Lambda-friendly
|
|
901
|
+
- PITR + Global Tables for DR
|
|
902
|
+
cons:
|
|
903
|
+
- No SQL; complex queries require GSI design upfront
|
|
904
|
+
- Schema changes require careful migration
|
|
905
|
+
- Relational queries (multi-entity joins) not supported
|
|
906
|
+
- id: B
|
|
907
|
+
description: RDS PostgreSQL with read replicas
|
|
908
|
+
pros:
|
|
909
|
+
- Full SQL; flexible queries without upfront access pattern design
|
|
910
|
+
- Familiar to most developers
|
|
911
|
+
cons:
|
|
912
|
+
- Connection pool limits (RDS Proxy needed for Lambda)
|
|
913
|
+
- Manual scaling; vertical scale requires downtime
|
|
914
|
+
- Not serverless — always-on cost even at zero load
|
|
915
|
+
- PITR window limited to 35 days
|
|
916
|
+
- id: C
|
|
917
|
+
description: Aurora Serverless v2 PostgreSQL
|
|
918
|
+
pros:
|
|
919
|
+
- SQL + serverless scaling
|
|
920
|
+
cons:
|
|
921
|
+
- Cold-start latency for Aurora v2 is higher than DynamoDB
|
|
922
|
+
- Still requires connection pooling for Lambda at scale
|
|
923
|
+
- Higher cost at sustained high throughput vs DynamoDB
|
|
924
|
+
decision: Option A — DynamoDB single-table design
|
|
925
|
+
rationale: >
|
|
926
|
+
Access patterns are well-defined (order-by-id, customer-order-list,
|
|
927
|
+
status queries). DynamoDB's serverless model aligns with Lambda
|
|
928
|
+
(DR-01, DR-03). No connection pool management at scale is a
|
|
929
|
+
significant operational advantage. Analytics queries (complex joins)
|
|
930
|
+
are delegated to the Analytics Platform via S3 event archive.
|
|
931
|
+
PRIN-04 (IaC only) is satisfied by CDK DynamoDB construct.
|
|
932
|
+
tradeoffs: >
|
|
933
|
+
Accepted: no SQL for ad-hoc queries; access patterns must be
|
|
934
|
+
documented and designed upfront; schema evolution requires migration
|
|
935
|
+
scripts. Gained: sub-5ms latency at any scale; zero ops overhead
|
|
936
|
+
for capacity management; serverless cost model.
|
|
937
|
+
consequences: >
|
|
938
|
+
Access pattern document (DynamoDB design appendix) must be maintained.
|
|
939
|
+
Any new query pattern requires a GSI — raise as Architecture Review
|
|
940
|
+
before implementation. Analytics queries must use S3 event archive
|
|
941
|
+
or DynamoDB exports, not direct DynamoDB scans.
|
|
942
|
+
revisit_conditions: >
|
|
943
|
+
Revisit if: complex relational queries become a product requirement
|
|
944
|
+
for the order service itself (not analytics); or if DynamoDB pricing
|
|
945
|
+
changes materially; or if the team grows to have dedicated DBA capacity
|
|
946
|
+
with preference for SQL.
|
|
947
|
+
driver_refs: [DR-01, DR-03]
|
|
948
|
+
|
|
949
|
+
- id: ADR-003
|
|
950
|
+
title: AWS Lambda over ECS Fargate for compute
|
|
951
|
+
context: >
|
|
952
|
+
Six service functions need to run on AWS. DR-01 requires elastic scale
|
|
953
|
+
to 5000 orders/minute burst. PRIN-04 requires IaC-only provisioning.
|
|
954
|
+
The team of 4 engineers cannot operate a container platform as well as
|
|
955
|
+
build services.
|
|
956
|
+
options:
|
|
957
|
+
- id: A
|
|
958
|
+
description: AWS Lambda (serverless functions) — chosen
|
|
959
|
+
pros:
|
|
960
|
+
- Zero operational overhead — no cluster management
|
|
961
|
+
- Automatic scaling from 0 to thousands of concurrent executions
|
|
962
|
+
- Pay-per-invocation; zero cost at zero load
|
|
963
|
+
- Native EventBridge and SQS triggers; no polling code
|
|
964
|
+
cons:
|
|
965
|
+
- 15-minute maximum execution time (not a constraint here)
|
|
966
|
+
- Cold-start latency (mitigated by provisioned concurrency for submission path)
|
|
967
|
+
- Package size limit 250MB unzipped (not a constraint here)
|
|
968
|
+
- id: B
|
|
969
|
+
description: ECS Fargate (containerised services)
|
|
970
|
+
pros:
|
|
971
|
+
- No cold start; consistent latency
|
|
972
|
+
- Arbitrary execution time
|
|
973
|
+
- Familiar container model
|
|
974
|
+
cons:
|
|
975
|
+
- Always-on minimum cost (min 1 task per service × 6 services)
|
|
976
|
+
- Auto-scaling is slower (minutes, not seconds)
|
|
977
|
+
- Requires more operational expertise; cluster networking
|
|
978
|
+
- Team would spend 30–40% of time on container operations
|
|
979
|
+
- id: C
|
|
980
|
+
description: EKS (Kubernetes)
|
|
981
|
+
pros:
|
|
982
|
+
- Maximum flexibility; portable
|
|
983
|
+
cons:
|
|
984
|
+
- Highest operational overhead; not appropriate for 4-person team
|
|
985
|
+
- Overkill for defined event-driven workloads
|
|
986
|
+
decision: Option A — AWS Lambda
|
|
987
|
+
rationale: >
|
|
988
|
+
The workload (event-triggered, short-duration, variable load) is the
|
|
989
|
+
canonical Lambda use case. DR-01 (10× scale) is satisfied without
|
|
990
|
+
operational overhead. PRIN-04 (IaC only) is trivially satisfied by
|
|
991
|
+
CDK Lambda construct. Provisioned concurrency on the Order Service
|
|
992
|
+
submission path addresses cold-start for DR-03 (P99 < 500ms).
|
|
993
|
+
tradeoffs: >
|
|
994
|
+
Accepted: cold-start risk (mitigated by provisioned concurrency);
|
|
995
|
+
15-minute execution limit (acceptable — order processing completes
|
|
996
|
+
in seconds); vendor lock-in to AWS Lambda. Gained: zero operational
|
|
997
|
+
overhead; automatic scaling; pay-per-use cost model.
|
|
998
|
+
consequences: >
|
|
999
|
+
Lambda package size must stay < 250MB. Lambda execution time must
|
|
1000
|
+
complete within 15 minutes (monitored in CloudWatch). Provisioned
|
|
1001
|
+
concurrency must be configured for Order Service (and reviewed
|
|
1002
|
+
quarterly for cost vs performance).
|
|
1003
|
+
revisit_conditions: >
|
|
1004
|
+
Revisit if: a service requires > 15-minute execution; or team grows
|
|
1005
|
+
to 10+ engineers with dedicated platform capacity; or Lambda pricing
|
|
1006
|
+
changes materially relative to ECS Fargate.
|
|
1007
|
+
driver_refs: [DR-01, DR-03]
|
|
1008
|
+
|
|
1009
|
+
- id: ADR-004
|
|
1010
|
+
title: EventBridge over Amazon SNS for the event bus
|
|
1011
|
+
context: >
|
|
1012
|
+
An event bus is required for pub-sub between six services (ADR-001
|
|
1013
|
+
selected choreography). Two candidate AWS services: SNS (topics +
|
|
1014
|
+
subscriptions) and EventBridge (custom event bus with routing rules).
|
|
1015
|
+
options:
|
|
1016
|
+
- id: A
|
|
1017
|
+
description: Amazon SNS (topics + Lambda subscriptions)
|
|
1018
|
+
pros:
|
|
1019
|
+
- Simpler mental model; familiar
|
|
1020
|
+
- Lower per-event cost at high volume
|
|
1021
|
+
cons:
|
|
1022
|
+
- No content-based routing (filter by event body requires Lambda code)
|
|
1023
|
+
- No schema registry; contracts enforced only in application code
|
|
1024
|
+
- No built-in event replay for debugging/DR
|
|
1025
|
+
- No Archive and Replay feature
|
|
1026
|
+
- id: B
|
|
1027
|
+
description: AWS EventBridge custom event bus with schema registry — chosen
|
|
1028
|
+
pros:
|
|
1029
|
+
- Content-based routing rules (filter by event source, type, and body fields)
|
|
1030
|
+
- Schema registry with automatic discovery and compatibility checks
|
|
1031
|
+
- Archive and Replay for debugging and DR
|
|
1032
|
+
- Native integration with many AWS services
|
|
1033
|
+
cons:
|
|
1034
|
+
- Higher cost per event than SNS (approximately 5× at high volume)
|
|
1035
|
+
- "Throughput limit: 10K events/second per event bus (region)"
|
|
1036
|
+
- id: C
|
|
1037
|
+
description: Amazon MSK (Managed Kafka)
|
|
1038
|
+
pros:
|
|
1039
|
+
- Extremely high throughput; replay by default
|
|
1040
|
+
cons:
|
|
1041
|
+
- Significant operational overhead for 4-person team
|
|
1042
|
+
- Not appropriate for current order volume (< 5000/minute)
|
|
1043
|
+
decision: Option B — AWS EventBridge
|
|
1044
|
+
rationale: >
|
|
1045
|
+
Schema registry enforcement (DR-04, PRIN-01) and content-based routing
|
|
1046
|
+
provide the decoupling guarantees required. Archive and Replay
|
|
1047
|
+
directly supports DR-05 (audit trail). Cost premium over SNS is
|
|
1048
|
+
justified by reduced application-layer filtering code. EventBridge
|
|
1049
|
+
throughput limit (10K events/s) is 120× the peak order volume
|
|
1050
|
+
(5000 orders/min = ~83/s), providing ample headroom.
|
|
1051
|
+
tradeoffs: >
|
|
1052
|
+
Accepted: higher cost vs SNS (~5× per event); throughput ceiling of
|
|
1053
|
+
10K events/s per bus (headroom: 120× current peak). Gained: schema
|
|
1054
|
+
enforcement; content-based routing; Archive and Replay; no custom
|
|
1055
|
+
routing logic in Lambda.
|
|
1056
|
+
consequences: >
|
|
1057
|
+
EventBridge Schema Registry is mandatory for all new events. Schema
|
|
1058
|
+
compatibility checks must run in CI. Archive must be enabled on the
|
|
1059
|
+
order-processing bus (7-day default; S3 Firehose for long-term storage).
|
|
1060
|
+
revisit_conditions: >
|
|
1061
|
+
Revisit if: event throughput approaches 8K events/second (80% of
|
|
1062
|
+
bus limit); or cost becomes a material concern (review quarterly).
|
|
1063
|
+
MSK would be the next step at sustained > 10K events/second.
|
|
1064
|
+
driver_refs: [DR-04, DR-05]
|
|
1065
|
+
|
|
1066
|
+
- id: ADR-005
|
|
1067
|
+
title: DynamoDB single-table design over multi-table design
|
|
1068
|
+
context: >
|
|
1069
|
+
DynamoDB was selected (ADR-002). DynamoDB supports two design approaches:
|
|
1070
|
+
single-table (all entities in one table, PK/SK encoding encodes entity
|
|
1071
|
+
type) and multi-table (one table per entity type, simpler model).
|
|
1072
|
+
Access patterns: order-by-id, customer-orders list, orders-by-status.
|
|
1073
|
+
options:
|
|
1074
|
+
- id: A
|
|
1075
|
+
description: Single-table design (all entities in one table) — chosen
|
|
1076
|
+
pros:
|
|
1077
|
+
- Single-digit ms latency for all access patterns via GSI
|
|
1078
|
+
- Fewer DynamoDB tables to manage and monitor
|
|
1079
|
+
- Can fetch related entities in a single request (if co-located)
|
|
1080
|
+
cons:
|
|
1081
|
+
- PK/SK design is non-obvious; requires documentation
|
|
1082
|
+
- Harder for developers unfamiliar with DynamoDB patterns
|
|
1083
|
+
- Mistakes in PK/SK design are expensive to fix post-deployment
|
|
1084
|
+
- id: B
|
|
1085
|
+
description: Multi-table design (one table per entity type)
|
|
1086
|
+
pros:
|
|
1087
|
+
- Simpler mental model; each table maps to one entity
|
|
1088
|
+
- IAM policies per table are more granular
|
|
1089
|
+
cons:
|
|
1090
|
+
- Cross-entity queries require multiple requests or scatter-gather
|
|
1091
|
+
- More tables to monitor, back up, and configure
|
|
1092
|
+
decision: Option A — Single-table design
|
|
1093
|
+
rationale: >
|
|
1094
|
+
Access patterns are well-defined at design time. Single-table design
|
|
1095
|
+
achieves all required patterns with GSIs and avoids scatter-gather
|
|
1096
|
+
queries. The operational simplicity (fewer tables) outweighs the
|
|
1097
|
+
design complexity given the team has DynamoDB expertise. DR-03
|
|
1098
|
+
(< 200ms P99) is best served by single-table with optimised GSIs.
|
|
1099
|
+
tradeoffs: >
|
|
1100
|
+
Accepted: PK/SK design complexity; requires upfront access pattern
|
|
1101
|
+
documentation; harder for new team members to onboard. Gained:
|
|
1102
|
+
optimal query performance; all access patterns served with single-digit
|
|
1103
|
+
ms latency; fewer DynamoDB resources to manage.
|
|
1104
|
+
consequences: >
|
|
1105
|
+
Access pattern document must be written and maintained. All DynamoDB
|
|
1106
|
+
changes must be reviewed by an engineer with DynamoDB expertise.
|
|
1107
|
+
New access patterns must be raised as an Architecture Decision before
|
|
1108
|
+
adding GSIs.
|
|
1109
|
+
revisit_conditions: >
|
|
1110
|
+
Revisit if: a new access pattern cannot be served by existing GSIs
|
|
1111
|
+
and would require a table scan; or if the team changes to have no
|
|
1112
|
+
DynamoDB expertise.
|
|
1113
|
+
driver_refs: [DR-03]
|
|
1114
|
+
|
|
1115
|
+
component_classification:
|
|
1116
|
+
components:
|
|
1117
|
+
- name: API Gateway (Regional, REST API)
|
|
1118
|
+
mandate_level: mandatory
|
|
1119
|
+
rationale: >
|
|
1120
|
+
Mandatory entry point for all order API traffic. Provides Cognito
|
|
1121
|
+
JWT authorisation, WAF integration, and request validation.
|
|
1122
|
+
Replacing with a custom reverse proxy would require duplicating
|
|
1123
|
+
security controls and is not permitted without Architecture Board
|
|
1124
|
+
exception.
|
|
1125
|
+
- name: Cognito User Pool
|
|
1126
|
+
mandate_level: mandatory
|
|
1127
|
+
rationale: >
|
|
1128
|
+
Corporate identity standard for customer and operator authentication.
|
|
1129
|
+
Provides JWT issuance, MFA, and token management. Replacing with
|
|
1130
|
+
alternative IdP requires Security Architecture review.
|
|
1131
|
+
- name: EventBridge Custom Event Bus
|
|
1132
|
+
mandate_level: mandatory
|
|
1133
|
+
rationale: >
|
|
1134
|
+
Mandatory inter-service communication channel. ADR-001 and ADR-004.
|
|
1135
|
+
Direct service-to-service calls are not permitted.
|
|
1136
|
+
- name: DynamoDB (single-table)
|
|
1137
|
+
mandate_level: mandatory
|
|
1138
|
+
rationale: >
|
|
1139
|
+
Mandatory for order state persistence. ADR-002. Alternative stores
|
|
1140
|
+
not permitted for order data without Architecture Board review.
|
|
1141
|
+
- name: Lambda (Node.js 20, ARM64)
|
|
1142
|
+
mandate_level: mandatory
|
|
1143
|
+
rationale: >
|
|
1144
|
+
Mandatory compute platform. ADR-003. Node.js 20 is the approved
|
|
1145
|
+
runtime. ARM64 selected for cost efficiency (~20% cheaper than x86).
|
|
1146
|
+
- name: AWS CDK v2 (IaC)
|
|
1147
|
+
mandate_level: mandatory
|
|
1148
|
+
rationale: PRIN-04 (IaC only). All infrastructure must be in CDK stacks.
|
|
1149
|
+
- name: CloudWatch + X-Ray (observability)
|
|
1150
|
+
mandate_level: mandatory
|
|
1151
|
+
rationale: >
|
|
1152
|
+
PRIN-05. Structured logging, metrics, and X-Ray tracing are mandatory
|
|
1153
|
+
for all Lambda functions. Correlation ID propagation is mandatory.
|
|
1154
|
+
- name: KMS Customer-Managed Keys
|
|
1155
|
+
mandate_level: mandatory
|
|
1156
|
+
rationale: >
|
|
1157
|
+
PCI-DSS requirement. All data stores must use KMS CMKs (not
|
|
1158
|
+
AWS-managed keys) to satisfy PCI-DSS key management controls.
|
|
1159
|
+
- name: SQS (work queues with DLQ)
|
|
1160
|
+
mandate_level: recommended
|
|
1161
|
+
rationale: >
|
|
1162
|
+
Recommended for all work-queue patterns (e.g. WMS adapter).
|
|
1163
|
+
Not mandatory for direct Lambda event triggers from EventBridge.
|
|
1164
|
+
- name: Kinesis Firehose → S3 (event archive)
|
|
1165
|
+
mandate_level: recommended
|
|
1166
|
+
rationale: >
|
|
1167
|
+
Recommended for all event buses where audit retention > 7 days is
|
|
1168
|
+
required. Mandatory only if DR-05 (7-year audit) applies to the
|
|
1169
|
+
specific domain.
|
|
1170
|
+
- name: Lambda Provisioned Concurrency
|
|
1171
|
+
mandate_level: recommended
|
|
1172
|
+
rationale: >
|
|
1173
|
+
Recommended for latency-sensitive synchronous paths (order submission).
|
|
1174
|
+
Optional for asynchronous event handlers where cold-start is acceptable.
|
|
1175
|
+
- name: CloudFront + WAF
|
|
1176
|
+
mandate_level: mandatory
|
|
1177
|
+
rationale: >
|
|
1178
|
+
All public-facing API traffic must traverse CloudFront + WAF.
|
|
1179
|
+
Mandatory for PCI-DSS (network security controls) and DDoS mitigation.
|
|
1180
|
+
- name: Notification channel (SES/SNS vs third-party)
|
|
1181
|
+
mandate_level: optional
|
|
1182
|
+
rationale: >
|
|
1183
|
+
Teams may substitute SES/SNS with an approved third-party notification
|
|
1184
|
+
provider (e.g. Twilio) if product requirements dictate. Must use
|
|
1185
|
+
the Notification Service adapter pattern; no direct notification
|
|
1186
|
+
calls from other services.
|
|
1187
|
+
extension_points:
|
|
1188
|
+
- name: Notification provider
|
|
1189
|
+
description: >
|
|
1190
|
+
The Notification Service uses an adapter pattern. Teams may substitute
|
|
1191
|
+
the SES/SNS implementation with any approved notification provider
|
|
1192
|
+
by implementing the NotificationAdapter interface.
|
|
1193
|
+
guidance: >
|
|
1194
|
+
Acceptable: Twilio, SendGrid, or any provider integrated via AWS
|
|
1195
|
+
Lambda. Provider must support at-least-once delivery guarantees
|
|
1196
|
+
and must not require storing customer contact data outside AWS.
|
|
1197
|
+
examples: Twilio SMS adapter, SendGrid email adapter
|
|
1198
|
+
- name: Payment gateway
|
|
1199
|
+
description: >
|
|
1200
|
+
The Payment Service uses a payment gateway adapter. Teams may
|
|
1201
|
+
substitute Stripe with another approved payment gateway.
|
|
1202
|
+
guidance: >
|
|
1203
|
+
New gateway must support idempotent charge requests (via idempotency
|
|
1204
|
+
key or equivalent). Must undergo Security Architecture review before
|
|
1205
|
+
substitution.
|
|
1206
|
+
examples: Adyen adapter, Braintree adapter
|
|
1207
|
+
- name: Programming language (Lambda runtime)
|
|
1208
|
+
description: >
|
|
1209
|
+
Node.js 20 is the approved default runtime. Teams may use Python 3.12
|
|
1210
|
+
or Java 21 if the team has demonstrably stronger expertise in that
|
|
1211
|
+
language and the performance characteristics are validated.
|
|
1212
|
+
guidance: >
|
|
1213
|
+
Must use the AWS Lambda Powertools library for the chosen runtime.
|
|
1214
|
+
Non-Node.js choices require Architecture Board approval and must
|
|
1215
|
+
demonstrate equivalent structured logging and X-Ray integration.
|
|
1216
|
+
examples: Python 3.12 with Lambda Powertools for Python
|
|
1217
|
+
|
|
1218
|
+
quality_attributes:
|
|
1219
|
+
- attribute: Availability
|
|
1220
|
+
target: 99.95% measured monthly
|
|
1221
|
+
measurement: Synthetic monitoring probes every 30 seconds via CloudWatch
|
|
1222
|
+
Synthetics Canary. Probe hits POST /orders with a test payload.
|
|
1223
|
+
SLO measured as proportion of successful probe responses over rolling
|
|
1224
|
+
30-day window.
|
|
1225
|
+
validation_strategy: >
|
|
1226
|
+
Chaos engineering quarterly (AWS Fault Injection Simulator): simulate
|
|
1227
|
+
single-AZ failure, Lambda throttling, DynamoDB throttling. Verify
|
|
1228
|
+
automatic recovery within 5 minutes. Load test monthly in staging
|
|
1229
|
+
at 150% of peak load (7500 orders/minute).
|
|
1230
|
+
fitness_function: >
|
|
1231
|
+
CloudWatch alarm fires if availability drops below 99.95% in any
|
|
1232
|
+
rolling 24-hour window. Alarm triggers PagerDuty P1. CDK deploys
|
|
1233
|
+
the alarm alongside the service stack.
|
|
1234
|
+
quality_scenario: >
|
|
1235
|
+
Stimulus: AZ failure in eu-west-1. Source: AWS infrastructure.
|
|
1236
|
+
Environment: Production, peak load 1000 orders/minute.
|
|
1237
|
+
Response: Lambda automatically routes traffic to remaining two AZs.
|
|
1238
|
+
Response measure: Zero failed order submissions; recovery within 2
|
|
1239
|
+
minutes; monthly availability remains >= 99.95%.
|
|
1240
|
+
|
|
1241
|
+
- attribute: Latency P99 (order submission)
|
|
1242
|
+
target: < 500ms end-to-end from API Gateway receipt to 202 response
|
|
1243
|
+
measurement: CloudWatch API Gateway P99 latency metric (IntegrationLatency).
|
|
1244
|
+
X-Ray service map shows breakdown per segment.
|
|
1245
|
+
validation_strategy: >
|
|
1246
|
+
Gatling load test at 1000 concurrent users, sustained 10 minutes.
|
|
1247
|
+
Test runs in staging CI before every production deployment.
|
|
1248
|
+
fitness_function: >
|
|
1249
|
+
Gatling test fails the CI build if P99 > 500ms at 1000 concurrent
|
|
1250
|
+
users. CloudWatch alarm on P99 > 400ms in production (warning);
|
|
1251
|
+
P99 > 500ms (critical, PagerDuty P2).
|
|
1252
|
+
quality_scenario: >
|
|
1253
|
+
Stimulus: 1000 concurrent order submissions. Source: Load test /
|
|
1254
|
+
production peak. Environment: Normal operating conditions.
|
|
1255
|
+
Response: API Gateway routes to Order Service Lambda (provisioned
|
|
1256
|
+
concurrency). Response measure: P99 <= 500ms for >= 99% of requests.
|
|
1257
|
+
|
|
1258
|
+
- attribute: Latency P99 (order status query)
|
|
1259
|
+
target: < 200ms end-to-end from API Gateway receipt to 200 response
|
|
1260
|
+
measurement: CloudWatch API Gateway P99 latency (GetOrderStatus endpoint).
|
|
1261
|
+
validation_strategy: >
|
|
1262
|
+
Gatling load test at 5000 concurrent status queries (read-heavy
|
|
1263
|
+
production pattern). DynamoDB GSI query measured separately.
|
|
1264
|
+
fitness_function: >
|
|
1265
|
+
Gatling test fails CI if P99 > 200ms at 5000 concurrent reads.
|
|
1266
|
+
quality_scenario: >
|
|
1267
|
+
Stimulus: 5000 concurrent status queries. Response measure:
|
|
1268
|
+
P99 <= 200ms; DynamoDB read latency <= 5ms.
|
|
1269
|
+
|
|
1270
|
+
- attribute: Throughput (sustained)
|
|
1271
|
+
target: 1000 orders/minute sustained; 5000 orders/minute burst (5 minutes)
|
|
1272
|
+
measurement: >
|
|
1273
|
+
CloudWatch metric: OrderCreated events/minute on EventBridge.
|
|
1274
|
+
Lambda concurrency utilisation dashboard.
|
|
1275
|
+
validation_strategy: >
|
|
1276
|
+
Monthly load test at 1000 orders/minute for 30 minutes in staging.
|
|
1277
|
+
Quarterly burst test at 5000 orders/minute for 5 minutes.
|
|
1278
|
+
fitness_function: >
|
|
1279
|
+
Load test step in CodePipeline deployment pipeline. Fails if
|
|
1280
|
+
< 1000 orders/minute throughput sustained or > 5% error rate.
|
|
1281
|
+
|
|
1282
|
+
- attribute: Error rate
|
|
1283
|
+
target: < 0.1% of order submissions result in a 5xx error
|
|
1284
|
+
measurement: CloudWatch API Gateway 5xxError metric as percentage of total requests.
|
|
1285
|
+
validation_strategy: >
|
|
1286
|
+
Continuous production monitoring. Monthly chaos injection (Lambda
|
|
1287
|
+
function errors, DynamoDB errors) to validate error handling.
|
|
1288
|
+
fitness_function: >
|
|
1289
|
+
CloudWatch alarm on 5xx error rate > 0.1% over rolling 5-minute
|
|
1290
|
+
window. PagerDuty P2 alert.
|
|
1291
|
+
|
|
1292
|
+
- attribute: Recovery Time Objective (RTO)
|
|
1293
|
+
target: 4 hours (full service restoration after catastrophic failure)
|
|
1294
|
+
measurement: "DR test: time from incident declaration to full order submission capacity."
|
|
1295
|
+
validation_strategy: >
|
|
1296
|
+
Annual DR exercise: simulate eu-west-1 complete unavailability.
|
|
1297
|
+
Execute failover runbook to eu-central-1. Measure time to first
|
|
1298
|
+
successful order in DR region.
|
|
1299
|
+
fitness_function: N/A — manual DR exercise with timer.
|
|
1300
|
+
quality_scenario: >
|
|
1301
|
+
Stimulus: eu-west-1 region failure. Source: AWS infrastructure.
|
|
1302
|
+
Environment: Production. Response: Route 53 failover activates
|
|
1303
|
+
eu-central-1 warm standby. Response measure: Full order submission
|
|
1304
|
+
capability restored within 4 hours.
|
|
1305
|
+
|
|
1306
|
+
- attribute: Recovery Point Objective (RPO)
|
|
1307
|
+
target: 1 hour (maximum data loss in catastrophic failure scenario)
|
|
1308
|
+
measurement: "DynamoDB Global Tables replication lag monitoring. Target: < 1s typical."
|
|
1309
|
+
validation_strategy: >
|
|
1310
|
+
Annual DR exercise: confirm DynamoDB Global Tables replica in
|
|
1311
|
+
eu-central-1 has all orders from the 60 minutes preceding the
|
|
1312
|
+
simulated failure.
|
|
1313
|
+
|
|
1314
|
+
- attribute: Security (PCI-DSS compliance)
|
|
1315
|
+
target: PCI-DSS Level 1 compliant (annual QSA audit pass)
|
|
1316
|
+
measurement: Annual PCI-DSS QSA audit. Quarterly internal penetration test.
|
|
1317
|
+
validation_strategy: >
|
|
1318
|
+
Monthly automated compliance scan (AWS Security Hub, PCI-DSS standard).
|
|
1319
|
+
Quarterly penetration test by approved security vendor.
|
|
1320
|
+
Annual QSA audit.
|
|
1321
|
+
fitness_function: >
|
|
1322
|
+
AWS Security Hub PCI-DSS standard enabled; findings of HIGH or
|
|
1323
|
+
CRITICAL severity block deployment via CodePipeline gate.
|
|
1324
|
+
|
|
1325
|
+
operational_model:
|
|
1326
|
+
slos:
|
|
1327
|
+
- name: Order API availability
|
|
1328
|
+
target: "99.95%"
|
|
1329
|
+
measurement_window: rolling 30 days
|
|
1330
|
+
error_budget: "21.9 minutes/month (99.95% availability)"
|
|
1331
|
+
- name: Order submission P99 latency
|
|
1332
|
+
target: "< 500ms"
|
|
1333
|
+
measurement_window: rolling 24 hours
|
|
1334
|
+
error_budget: "< 0.1% of requests may exceed 500ms"
|
|
1335
|
+
- name: Order status query P99 latency
|
|
1336
|
+
target: "< 200ms"
|
|
1337
|
+
measurement_window: rolling 24 hours
|
|
1338
|
+
error_budget: "< 0.1% of requests may exceed 200ms"
|
|
1339
|
+
- name: Payment authorisation success rate
|
|
1340
|
+
target: "> 99.5% (excluding card declines)"
|
|
1341
|
+
measurement_window: rolling 24 hours
|
|
1342
|
+
error_budget: "< 0.5% of authorisation attempts may fail due to system error"
|
|
1343
|
+
- name: Fulfilment dispatch success rate
|
|
1344
|
+
target: "> 99.9%"
|
|
1345
|
+
measurement_window: rolling 24 hours
|
|
1346
|
+
error_budget: "< 0.1% of fulfilment instructions may end in DLQ"
|
|
1347
|
+
|
|
1348
|
+
monitoring:
|
|
1349
|
+
strategy: >
|
|
1350
|
+
Three-signal observability: metrics (CloudWatch), structured logs
|
|
1351
|
+
(CloudWatch Logs Insights), and distributed traces (X-Ray). All Lambda
|
|
1352
|
+
functions emit structured JSON logs with correlation ID on every
|
|
1353
|
+
invocation. X-Ray active tracing on all Lambda functions. CloudWatch
|
|
1354
|
+
Container Insights for DynamoDB and SQS. On-call via PagerDuty
|
|
1355
|
+
with escalation policy (P1: immediate, P2: 30-minute SLA).
|
|
1356
|
+
CloudWatch dashboards provisioned by CDK alongside each service stack.
|
|
1357
|
+
metrics:
|
|
1358
|
+
- Order submission rate (orders/minute) — EventBridge PutEvents count
|
|
1359
|
+
- Order submission P99 latency — API Gateway IntegrationLatency P99
|
|
1360
|
+
- Payment authorisation success rate — Payment Service custom metric
|
|
1361
|
+
- Fulfilment DLQ depth — SQS ApproximateNumberOfMessagesNotVisible on DLQ
|
|
1362
|
+
- Lambda error rate (%) — Lambda Errors / Lambda Invocations per function
|
|
1363
|
+
- Lambda duration P99 — Lambda Duration P99 per function
|
|
1364
|
+
- Lambda concurrent executions — Lambda ConcurrentExecutions
|
|
1365
|
+
- DynamoDB read/write throttle events — DynamoDB ThrottledRequests
|
|
1366
|
+
- EventBridge throttled rules — EventBridge ThrottledRules
|
|
1367
|
+
- API Gateway 4xx and 5xx error rates
|
|
1368
|
+
- SQS message age (P99) — SQS ApproximateAgeOfOldestMessage
|
|
1369
|
+
dashboards: >
|
|
1370
|
+
CloudWatch dashboard templates provisioned by CDK in /infra/monitoring/.
|
|
1371
|
+
Dashboards: Order Platform Overview (SLO status), Service Health
|
|
1372
|
+
(per-Lambda metrics), Data Layer (DynamoDB + SQS), Security
|
|
1373
|
+
(WAF blocked requests, CloudTrail anomalies). Runbook links embedded
|
|
1374
|
+
in dashboard widgets.
|
|
1375
|
+
alerting_rules: >
|
|
1376
|
+
Alert configuration in /infra/monitoring/alerts.ts (CDK). Rules:
|
|
1377
|
+
P1 (immediate PagerDuty): availability < 99.9% (5-min window),
|
|
1378
|
+
DLQ depth > 10, payment success rate < 99%.
|
|
1379
|
+
P2 (30-min SLA): P99 latency > 400ms, Lambda error rate > 0.5%,
|
|
1380
|
+
DynamoDB throttle > 10 events/min.
|
|
1381
|
+
P3 (business hours): Lambda concurrent executions > 80% of limit,
|
|
1382
|
+
SQS message age > 5 minutes.
|
|
1383
|
+
|
|
1384
|
+
scaling:
|
|
1385
|
+
policies:
|
|
1386
|
+
- >
|
|
1387
|
+
Lambda: default concurrency 100 per function (soft limit). Reserved
|
|
1388
|
+
concurrency on Order Service: 500 (prevents throttling on burst).
|
|
1389
|
+
Provisioned concurrency on Order Service: 20 (eliminates cold starts
|
|
1390
|
+
on submission path). Lambda auto-scales to account concurrency limit
|
|
1391
|
+
(3000 default in eu-west-1).
|
|
1392
|
+
- >
|
|
1393
|
+
DynamoDB: on-demand capacity mode (no provisioning required).
|
|
1394
|
+
Auto-scales to required throughput. Monitor for throttle events
|
|
1395
|
+
in CloudWatch; if sustained throttle > 5 minutes, review access
|
|
1396
|
+
patterns for hot partition.
|
|
1397
|
+
- >
|
|
1398
|
+
SQS: scales transparently. WMS Adapter Lambda concurrency set to
|
|
1399
|
+
50 to match WMS API rate limit.
|
|
1400
|
+
- >
|
|
1401
|
+
API Gateway: regional endpoint; scales to 10K requests/second by
|
|
1402
|
+
default. Throttle limits: 1000 burst, 100 steady-state per stage.
|
|
1403
|
+
Increase by support request if DR-01 targets are approached.
|
|
1404
|
+
capacity_planning_notes: >
|
|
1405
|
+
Current production peak: ~200 orders/minute. Architecture is sized
|
|
1406
|
+
for 1000 orders/minute sustained without any provisioning changes.
|
|
1407
|
+
For 10× (10K orders/minute), Lambda concurrency limits and API Gateway
|
|
1408
|
+
throttle limits would require AWS support request. DynamoDB on-demand
|
|
1409
|
+
scales automatically. Review capacity quarterly using CloudWatch
|
|
1410
|
+
utilisation metrics.
|
|
1411
|
+
|
|
1412
|
+
disaster_recovery:
|
|
1413
|
+
rto: 4 hours
|
|
1414
|
+
rpo: 1 hour
|
|
1415
|
+
failover_procedures: >
|
|
1416
|
+
1. Incident declared by on-call SRE via PagerDuty.
|
|
1417
|
+
2. SRE confirms eu-west-1 is unavailable (Route 53 health checks fail).
|
|
1418
|
+
3. SRE executes DR runbook: update Route 53 weighted routing to
|
|
1419
|
+
100% eu-central-1. Estimated time: 15 minutes.
|
|
1420
|
+
4. CDK stacks in eu-central-1 are pre-deployed (warm standby).
|
|
1421
|
+
Activate by running cdk deploy --context env=dr.
|
|
1422
|
+
5. DynamoDB Global Table in eu-central-1 is the new primary.
|
|
1423
|
+
Verify data freshness (CloudWatch replication lag metric).
|
|
1424
|
+
6. Smoke test: submit test order in eu-central-1. Verify 202 response
|
|
1425
|
+
and event flow.
|
|
1426
|
+
7. Notify stakeholders. Begin post-incident review.
|
|
1427
|
+
8. Recovery to eu-west-1: reverse Route 53 after region confirmed stable.
|
|
1428
|
+
Re-sync DynamoDB Global Table. Estimated total recovery time: 4 hours.
|
|
1429
|
+
runbook_ref: ops/runbooks/dr-failover-eu-central-1.md
|
|
1430
|
+
|
|
1431
|
+
implementation_artifacts:
|
|
1432
|
+
iac_templates:
|
|
1433
|
+
- name: Order Platform CDK Stack
|
|
1434
|
+
type: cdk
|
|
1435
|
+
location: infra/stacks/order-platform-stack.ts
|
|
1436
|
+
description: >
|
|
1437
|
+
Main CDK stack deploying all Lambda functions, DynamoDB tables,
|
|
1438
|
+
EventBridge bus, SQS queues, API Gateway, Cognito User Pool,
|
|
1439
|
+
CloudWatch dashboards, and alarms. Single cdk deploy command
|
|
1440
|
+
provisions the full stack.
|
|
1441
|
+
- name: Networking CDK Stack
|
|
1442
|
+
type: cdk
|
|
1443
|
+
location: infra/stacks/networking-stack.ts
|
|
1444
|
+
description: >
|
|
1445
|
+
VPC with three private subnets, NAT Gateways (one per AZ),
|
|
1446
|
+
VPC endpoints for DynamoDB, SQS, EventBridge, S3, KMS.
|
|
1447
|
+
- name: Security CDK Stack
|
|
1448
|
+
type: cdk
|
|
1449
|
+
location: infra/stacks/security-stack.ts
|
|
1450
|
+
description: >
|
|
1451
|
+
KMS CMKs (one per service), IAM roles (least-privilege per Lambda),
|
|
1452
|
+
WAF WebACL with OWASP CRS, CloudTrail configuration.
|
|
1453
|
+
- name: DR CDK Stack (eu-central-1)
|
|
1454
|
+
type: cdk
|
|
1455
|
+
location: infra/stacks/dr-stack.ts
|
|
1456
|
+
description: >
|
|
1457
|
+
Warm standby stack for eu-central-1. Deploys all Lambda functions
|
|
1458
|
+
in standby mode; DynamoDB Global Tables replica; Route 53
|
|
1459
|
+
failover configuration.
|
|
1460
|
+
api_specifications:
|
|
1461
|
+
- name: Order API (OpenAPI 3.1)
|
|
1462
|
+
spec_type: openapi
|
|
1463
|
+
location: api/order-api.openapi.yaml
|
|
1464
|
+
- name: Order Events (AsyncAPI 2.6)
|
|
1465
|
+
spec_type: asyncapi
|
|
1466
|
+
location: api/order-events.asyncapi.yaml
|
|
1467
|
+
cicd_templates:
|
|
1468
|
+
- name: Order Platform Pipeline
|
|
1469
|
+
platform: other
|
|
1470
|
+
location: pipeline/order-platform-pipeline.ts
|
|
1471
|
+
description: >
|
|
1472
|
+
AWS CodePipeline: Source (CodeCommit) → Build (CodeBuild, unit tests,
|
|
1473
|
+
CDK synth) → Deploy to staging → Integration tests (Postman/Newman)
|
|
1474
|
+
→ Load test (Gatling, fails if SLO targets missed) → Manual approval
|
|
1475
|
+
→ Deploy to production (blue/green via CodeDeploy Lambda alias shift).
|
|
1476
|
+
scaffold_template:
|
|
1477
|
+
name: New Order Service Scaffold
|
|
1478
|
+
type: other
|
|
1479
|
+
location: scaffold/new-service/
|
|
1480
|
+
description: >
|
|
1481
|
+
Cookiecutter template generating a new Lambda function project with:
|
|
1482
|
+
pre-configured Lambda Powertools (structured logging, X-Ray tracing,
|
|
1483
|
+
correlation ID middleware), CDK construct stub, OpenAPI spec stub,
|
|
1484
|
+
AsyncAPI event stub, unit test harness, and Postman collection.
|
|
1485
|
+
Run: cookiecutter scaffold/new-service/ to generate a new service.
|
|
1486
|
+
sample_application:
|
|
1487
|
+
name: Order Processing Sample Application
|
|
1488
|
+
location: sample-app/
|
|
1489
|
+
description: >
|
|
1490
|
+
Fully functional reference implementation of the complete order processing
|
|
1491
|
+
flow. Deployable to a sandbox AWS account in under 1 hour using the
|
|
1492
|
+
getting-started guide. Includes all six Lambda functions, CDK stacks,
|
|
1493
|
+
OpenAPI and AsyncAPI specs, and a React-based test UI.
|
|
1494
|
+
|
|
1495
|
+
getting_started:
|
|
1496
|
+
estimated_time_to_first_deployment: >
|
|
1497
|
+
4 hours for engineers with AWS CDK experience; 8 hours for engineers
|
|
1498
|
+
new to CDK. A sandbox deployment (sample-app/) is achievable in 1 hour
|
|
1499
|
+
using the scaffold template and pre-configured CDK stacks.
|
|
1500
|
+
prerequisites:
|
|
1501
|
+
- AWS account with AdministratorAccess (sandbox) or specific IAM role (see infra/iam/deployer-policy.json)
|
|
1502
|
+
- AWS CLI v2 configured with appropriate credentials
|
|
1503
|
+
- Node.js 20 and npm installed
|
|
1504
|
+
- AWS CDK v2 installed globally (npm install -g aws-cdk)
|
|
1505
|
+
- Docker Desktop (for CDK asset bundling)
|
|
1506
|
+
- Git access to the platform repository
|
|
1507
|
+
- Postman (for API testing, optional)
|
|
1508
|
+
steps:
|
|
1509
|
+
- step: 1
|
|
1510
|
+
title: Clone the repository and install dependencies
|
|
1511
|
+
description: Clone the platform repository and install Node.js dependencies.
|
|
1512
|
+
command: >
|
|
1513
|
+
git clone https://git.internal/platform/order-platform.git &&
|
|
1514
|
+
cd order-platform && npm ci
|
|
1515
|
+
- step: 2
|
|
1516
|
+
title: Bootstrap the CDK environment
|
|
1517
|
+
description: >
|
|
1518
|
+
Bootstrap CDK in your target AWS account and region. Only required
|
|
1519
|
+
once per account/region pair. Provisions S3 bucket and IAM roles
|
|
1520
|
+
for CDK deployment.
|
|
1521
|
+
command: >
|
|
1522
|
+
npx cdk bootstrap aws://ACCOUNT_ID/eu-west-1
|
|
1523
|
+
- step: 3
|
|
1524
|
+
title: Configure environment variables
|
|
1525
|
+
description: >
|
|
1526
|
+
Copy the example environment file and set your environment-specific
|
|
1527
|
+
values (account ID, VPC CIDR, Cognito domain prefix).
|
|
1528
|
+
command: cp infra/config/sandbox.env.example infra/config/sandbox.env
|
|
1529
|
+
- step: 4
|
|
1530
|
+
title: Deploy the sample application
|
|
1531
|
+
description: >
|
|
1532
|
+
Deploy all CDK stacks to your sandbox account. This provisions VPC,
|
|
1533
|
+
Lambda functions, DynamoDB, API Gateway, Cognito, EventBridge, SQS,
|
|
1534
|
+
CloudWatch dashboards, and alarms. Expect ~15 minutes.
|
|
1535
|
+
command: >
|
|
1536
|
+
npx cdk deploy --all --context env=sandbox --require-approval never
|
|
1537
|
+
- step: 5
|
|
1538
|
+
title: Smoke test the deployment
|
|
1539
|
+
description: >
|
|
1540
|
+
Run the provided Postman collection against the deployed API Gateway
|
|
1541
|
+
endpoint. Collection tests order submission, status query, and
|
|
1542
|
+
error responses.
|
|
1543
|
+
command: >
|
|
1544
|
+
newman run postman/order-platform-smoke-test.json
|
|
1545
|
+
--env-var baseUrl=$(npx cdk outputs --json | jq -r '.OrderPlatformStack.ApiGatewayUrl')
|
|
1546
|
+
- step: 6
|
|
1547
|
+
title: Review CloudWatch dashboard
|
|
1548
|
+
description: >
|
|
1549
|
+
Open the Order Platform Overview CloudWatch dashboard. Confirm all
|
|
1550
|
+
metrics are populating after the smoke test. Dashboard URL is output
|
|
1551
|
+
by cdk deploy.
|
|
1552
|
+
troubleshooting:
|
|
1553
|
+
- symptom: cdk deploy fails with "Unable to resolve AWS account"
|
|
1554
|
+
cause: AWS CLI not configured or credentials expired
|
|
1555
|
+
resolution: Run "aws configure" or refresh credentials. Run "aws sts get-caller-identity" to verify.
|
|
1556
|
+
- symptom: Lambda function deployment fails with "Package exceeds 250MB limit"
|
|
1557
|
+
cause: node_modules not pruned; dev dependencies included
|
|
1558
|
+
resolution: Ensure CDK bundling uses --omit=dev. Check infra/stacks/order-platform-stack.ts bundling config.
|
|
1559
|
+
- symptom: Postman smoke test fails on POST /orders with 401
|
|
1560
|
+
cause: Cognito User Pool not yet propagated or test user not created
|
|
1561
|
+
resolution: Wait 2 minutes after deploy. Run "npm run create-test-user" to create a Cognito test user.
|
|
1562
|
+
- symptom: DLQ depth alarm fires after smoke test
|
|
1563
|
+
cause: WMS Adapter cannot reach WMS (no WMS in sandbox)
|
|
1564
|
+
resolution: Expected in sandbox. WMS Adapter is configured with a mock WMS stub in sandbox environment.
|
|
1565
|
+
|
|
1566
|
+
raid:
|
|
1567
|
+
risks:
|
|
1568
|
+
- id: RISK-01
|
|
1569
|
+
description: >
|
|
1570
|
+
DynamoDB hot partition: if a single order ID prefix dominates writes
|
|
1571
|
+
(e.g. all orders from a single promotion), a hot partition could
|
|
1572
|
+
cause write throttling, degrading P99 latency above 500ms target.
|
|
1573
|
+
likelihood: low
|
|
1574
|
+
impact: high
|
|
1575
|
+
mitigation: >
|
|
1576
|
+
DynamoDB partition key uses UUID v4 (random, uniformly distributed).
|
|
1577
|
+
CloudWatch alarm on ThrottledRequests metric. Access pattern review
|
|
1578
|
+
required before any bulk import or promotional campaign.
|
|
1579
|
+
owner: Platform SRE
|
|
1580
|
+
residual_risk: low
|
|
1581
|
+
|
|
1582
|
+
- id: RISK-02
|
|
1583
|
+
description: >
|
|
1584
|
+
Lambda cold-start latency spike: during a burst after a quiet period
|
|
1585
|
+
(e.g. Monday morning), Lambda may experience cold starts on the
|
|
1586
|
+
Order Service submission path, causing P99 > 500ms.
|
|
1587
|
+
likelihood: medium
|
|
1588
|
+
impact: medium
|
|
1589
|
+
mitigation: >
|
|
1590
|
+
Provisioned concurrency of 20 on Order Service Lambda, pre-warming
|
|
1591
|
+
the function. CloudWatch Lambda cold-start metric monitored.
|
|
1592
|
+
Provisioned concurrency level reviewed quarterly (cost vs performance).
|
|
1593
|
+
owner: Platform SRE
|
|
1594
|
+
residual_risk: low
|
|
1595
|
+
|
|
1596
|
+
- id: RISK-03
|
|
1597
|
+
description: >
|
|
1598
|
+
Stripe payment gateway outage: Stripe is an external dependency.
|
|
1599
|
+
If Stripe is unavailable, PaymentFailed events will be published
|
|
1600
|
+
and orders will not be authorised, degrading fulfilment.
|
|
1601
|
+
likelihood: low
|
|
1602
|
+
impact: high
|
|
1603
|
+
mitigation: >
|
|
1604
|
+
Retry logic with exponential backoff (3 attempts, 1s/2s/4s) in
|
|
1605
|
+
Payment Service. SQS-backed retry queue for failed payment events.
|
|
1606
|
+
Stripe SLA 99.99% — incidents < 1 per year historically.
|
|
1607
|
+
Runbook for Stripe incident: ops/runbooks/stripe-incident.md.
|
|
1608
|
+
Customer notification sent on PaymentFailed to reduce support load.
|
|
1609
|
+
owner: Order Platform Engineering
|
|
1610
|
+
residual_risk: medium
|
|
1611
|
+
|
|
1612
|
+
- id: RISK-04
|
|
1613
|
+
description: >
|
|
1614
|
+
Schema evolution breaking change: a service publishes an updated
|
|
1615
|
+
event schema removing a field that a subscriber depends on,
|
|
1616
|
+
causing subscriber Lambda errors and DLQ accumulation.
|
|
1617
|
+
likelihood: medium
|
|
1618
|
+
impact: medium
|
|
1619
|
+
mitigation: >
|
|
1620
|
+
EventBridge Schema Registry with backward compatibility check in CI
|
|
1621
|
+
(fails build if breaking change detected). Schema versioning:
|
|
1622
|
+
major version bump required for breaking changes; new event type
|
|
1623
|
+
published in parallel until all subscribers migrated.
|
|
1624
|
+
owner: Enterprise Architecture
|
|
1625
|
+
residual_risk: low
|
|
1626
|
+
|
|
1627
|
+
- id: RISK-05
|
|
1628
|
+
description: >
|
|
1629
|
+
PCI-DSS scope creep: a developer inadvertently stores card data
|
|
1630
|
+
in DynamoDB or logs, widening PCI-DSS scope and triggering a
|
|
1631
|
+
compliance remediation.
|
|
1632
|
+
likelihood: low
|
|
1633
|
+
impact: high
|
|
1634
|
+
mitigation: >
|
|
1635
|
+
No card data accepted by the Order API — only payment tokens.
|
|
1636
|
+
API Gateway request validator rejects payloads containing card
|
|
1637
|
+
number patterns (WAF custom rule). Developer training on PCI-DSS
|
|
1638
|
+
scope in onboarding. Quarterly code review by Security Architecture.
|
|
1639
|
+
owner: Information Security
|
|
1640
|
+
residual_risk: low
|
|
1641
|
+
|
|
1642
|
+
- id: RISK-06
|
|
1643
|
+
description: >
|
|
1644
|
+
WMS integration failure accumulation: WMS system maintenance or
|
|
1645
|
+
API changes cause WMS Adapter Lambda failures to accumulate in
|
|
1646
|
+
DLQ, delaying fulfilment for many orders.
|
|
1647
|
+
likelihood: medium
|
|
1648
|
+
impact: medium
|
|
1649
|
+
mitigation: >
|
|
1650
|
+
DLQ depth alarm (P1 if > 10 messages). Ops runbook for DLQ
|
|
1651
|
+
reprocessing after WMS recovery. WMS Adapter uses circuit breaker
|
|
1652
|
+
pattern — if WMS fails > 50% of calls in 1 minute, stops calling
|
|
1653
|
+
WMS and sends alert to WMS team.
|
|
1654
|
+
owner: Platform SRE
|
|
1655
|
+
residual_risk: medium
|
|
1656
|
+
|
|
1657
|
+
assumptions:
|
|
1658
|
+
- assumption: >
|
|
1659
|
+
EventBridge event delivery is at-least-once (AWS guarantee);
|
|
1660
|
+
consumers are designed to be idempotent.
|
|
1661
|
+
consequence_if_violated: >
|
|
1662
|
+
Duplicate order processing possible. All consumers use DynamoDB
|
|
1663
|
+
conditional writes to reject already-processed events.
|
|
1664
|
+
- assumption: >
|
|
1665
|
+
DynamoDB on-demand capacity can absorb 10× sustained throughput
|
|
1666
|
+
increase without pre-warming.
|
|
1667
|
+
consequence_if_violated: >
|
|
1668
|
+
DynamoDB may throttle on sudden 10× spike. Mitigation: monitor
|
|
1669
|
+
throttle events; switch to provisioned capacity with auto-scaling
|
|
1670
|
+
if on-demand throughput ramp-up is too slow.
|
|
1671
|
+
- assumption: >
|
|
1672
|
+
The WMS API is stable and does not change its request format
|
|
1673
|
+
without advance notice to the Platform team.
|
|
1674
|
+
consequence_if_violated: >
|
|
1675
|
+
WMS Adapter Lambda will fail and DLQ will accumulate. Runbook:
|
|
1676
|
+
ops/runbooks/wms-api-change.md.
|
|
1677
|
+
|
|
1678
|
+
constraints:
|
|
1679
|
+
- type: regulatory
|
|
1680
|
+
description: PCI-DSS Level 1 compliance is mandatory for the payment authorisation flow.
|
|
1681
|
+
implication: >
|
|
1682
|
+
No card data may be stored or logged anywhere in the platform.
|
|
1683
|
+
All data at rest must be encrypted with KMS CMKs. All API traffic
|
|
1684
|
+
must use TLS 1.2+. Annual QSA audit required.
|
|
1685
|
+
- type: organisational
|
|
1686
|
+
description: AWS is the mandated cloud provider; no multi-cloud or on-premises compute.
|
|
1687
|
+
implication: >
|
|
1688
|
+
All services must use AWS managed services. No Kubernetes, no
|
|
1689
|
+
on-premises databases, no third-party compute.
|
|
1690
|
+
- type: technical
|
|
1691
|
+
description: >
|
|
1692
|
+
Maximum Lambda package size is 250MB unzipped. Maximum Lambda
|
|
1693
|
+
execution time is 15 minutes.
|
|
1694
|
+
implication: >
|
|
1695
|
+
Lambda functions must have lean dependency trees. Any processing
|
|
1696
|
+
requiring > 15 minutes must be decomposed into smaller functions
|
|
1697
|
+
chained via EventBridge or SQS.
|
|
1698
|
+
- type: organisational
|
|
1699
|
+
description: Team size is 4 engineers; no dedicated DBA or platform operator.
|
|
1700
|
+
implication: >
|
|
1701
|
+
Architecture must minimise operational overhead. Fully managed
|
|
1702
|
+
services preferred. No self-managed databases or container clusters.
|
|
1703
|
+
- type: financial
|
|
1704
|
+
description: "Monthly AWS spend budget for order platform: $8,000/month at steady state."
|
|
1705
|
+
implication: >
|
|
1706
|
+
Lambda (pay-per-use) and DynamoDB (on-demand) chosen to minimise
|
|
1707
|
+
cost at current volumes. Cost review quarterly via AWS Cost Explorer.
|
|
1708
|
+
|
|
1709
|
+
tradeoffs:
|
|
1710
|
+
- decision_ref: ADR-001
|
|
1711
|
+
description: Event choreography vs synchronous REST
|
|
1712
|
+
what_was_sacrificed: Immediate consistency and simple linear call tracing
|
|
1713
|
+
what_was_gained: >
|
|
1714
|
+
Service decoupling, independent deployability (DR-04), elastic scale (DR-01),
|
|
1715
|
+
immutable audit trail (DR-05)
|
|
1716
|
+
- decision_ref: ADR-002
|
|
1717
|
+
description: DynamoDB vs RDS PostgreSQL
|
|
1718
|
+
what_was_sacrificed: >
|
|
1719
|
+
SQL query flexibility; complex ad-hoc queries not possible
|
|
1720
|
+
in the order service
|
|
1721
|
+
what_was_gained: >
|
|
1722
|
+
Sub-5ms latency at any scale; zero ops overhead; serverless
|
|
1723
|
+
cost model (DR-01, DR-03)
|
|
1724
|
+
- decision_ref: ADR-003
|
|
1725
|
+
description: Lambda vs ECS Fargate
|
|
1726
|
+
what_was_sacrificed: >
|
|
1727
|
+
Always-on latency consistency; container model familiarity
|
|
1728
|
+
what_was_gained: >
|
|
1729
|
+
Zero operational overhead; automatic scale to 0; pay-per-use
|
|
1730
|
+
cost model; no cluster management
|
|
1731
|
+
- decision_ref: ADR-004
|
|
1732
|
+
description: EventBridge vs SNS
|
|
1733
|
+
what_was_sacrificed: Lower per-event cost (SNS is cheaper at very high volume)
|
|
1734
|
+
what_was_gained: >
|
|
1735
|
+
Schema registry enforcement; content-based routing; Archive
|
|
1736
|
+
and Replay (DR-05); no custom routing logic in application code
|
|
1737
|
+
- decision_ref: ADR-005
|
|
1738
|
+
description: DynamoDB single-table vs multi-table
|
|
1739
|
+
what_was_sacrificed: >
|
|
1740
|
+
Simpler mental model; easier developer onboarding for DynamoDB newcomers
|
|
1741
|
+
what_was_gained: >
|
|
1742
|
+
Optimal query performance for all defined access patterns;
|
|
1743
|
+
single-digit ms reads via GSI; fewer tables to manage
|
|
1744
|
+
|
|
1745
|
+
governance:
|
|
1746
|
+
applicable_standards:
|
|
1747
|
+
- id: PCI-DSS
|
|
1748
|
+
name: Payment Card Industry Data Security Standard Level 1
|
|
1749
|
+
relevance: >
|
|
1750
|
+
The payment authorisation flow processes payment tokens and interacts
|
|
1751
|
+
with Stripe. PCI-DSS Level 1 applies to all systems that transmit
|
|
1752
|
+
cardholder data or interact with payment processing systems.
|
|
1753
|
+
- id: ISO-27001
|
|
1754
|
+
name: ISO/IEC 27001:2022 Information Security Management
|
|
1755
|
+
relevance: >
|
|
1756
|
+
Enterprise security standard. Applies to all production systems.
|
|
1757
|
+
Controls implemented via IAM least-privilege, KMS encryption,
|
|
1758
|
+
CloudTrail audit logging, and WAF.
|
|
1759
|
+
- id: GDPR
|
|
1760
|
+
name: General Data Protection Regulation (EU) 2016/679
|
|
1761
|
+
relevance: >
|
|
1762
|
+
Customer order data includes PII (name, address, email). GDPR
|
|
1763
|
+
applies to all processing of EU customer data. Data residency
|
|
1764
|
+
in eu-west-1 (Ireland) satisfies Chapter V transfers.
|
|
1765
|
+
- id: ESS-01
|
|
1766
|
+
name: Enterprise Security Standard 01 — Encryption in Transit
|
|
1767
|
+
relevance: Requires TLS 1.2+ for all API traffic. Satisfied by API Gateway.
|
|
1768
|
+
- id: ESS-03
|
|
1769
|
+
name: Enterprise Security Standard 03 — Encryption at Rest
|
|
1770
|
+
relevance: >
|
|
1771
|
+
Requires KMS CMK encryption for all data stores containing
|
|
1772
|
+
customer or payment data.
|
|
1773
|
+
compliance_mapping:
|
|
1774
|
+
- control: "PCI-DSS Requirement 1: Network security controls"
|
|
1775
|
+
design_element: >
|
|
1776
|
+
CloudFront + WAF (OWASP CRS v3.2); VPC with private subnets;
|
|
1777
|
+
no direct internet access to Lambda or DynamoDB; VPC endpoints
|
|
1778
|
+
for all AWS service traffic.
|
|
1779
|
+
evidence: Architecture Views — Deployment; Architecture Views — Security
|
|
1780
|
+
owner: Information Security
|
|
1781
|
+
|
|
1782
|
+
- control: "PCI-DSS Requirement 2: Secure configurations"
|
|
1783
|
+
design_element: >
|
|
1784
|
+
All infrastructure defined in CDK (PRIN-04); no manual console
|
|
1785
|
+
changes; CloudFormation drift detection nightly; Lambda functions
|
|
1786
|
+
have no inbound network access except via API Gateway.
|
|
1787
|
+
evidence: Implementation Artifacts — IaC templates; Architecture Decisions ADR-003
|
|
1788
|
+
owner: Platform SRE
|
|
1789
|
+
|
|
1790
|
+
- control: "PCI-DSS Requirement 3: Protect stored account data"
|
|
1791
|
+
design_element: >
|
|
1792
|
+
No card data stored anywhere in the platform. Order API accepts
|
|
1793
|
+
payment tokens only (Stripe tokenisation). WAF custom rule rejects
|
|
1794
|
+
payloads containing card number patterns. DynamoDB stores order
|
|
1795
|
+
reference and Stripe charge ID only.
|
|
1796
|
+
evidence: Architecture Views — Security; RAID Constraints (PCI-DSS); ADR-002
|
|
1797
|
+
owner: Information Security
|
|
1798
|
+
|
|
1799
|
+
- control: "PCI-DSS Requirement 4: Protect cardholder data in transit"
|
|
1800
|
+
design_element: >
|
|
1801
|
+
TLS 1.2+ enforced by API Gateway (minimum TLS policy). Stripe
|
|
1802
|
+
integration uses TLS 1.2+. All internal AWS SDK calls use HTTPS
|
|
1803
|
+
via VPC endpoints.
|
|
1804
|
+
evidence: Architecture Views — Security (Trust Zone 1–3)
|
|
1805
|
+
owner: Information Security
|
|
1806
|
+
|
|
1807
|
+
- control: "PCI-DSS Requirement 6: Develop and maintain secure systems"
|
|
1808
|
+
design_element: >
|
|
1809
|
+
AWS Security Hub PCI-DSS standard enabled; HIGH/CRITICAL findings
|
|
1810
|
+
block CodePipeline deployment. Quarterly penetration testing.
|
|
1811
|
+
OWASP CRS on WAF addresses OWASP Top 10.
|
|
1812
|
+
evidence: Quality Attributes — Security; Implementation Artifacts — CI/CD
|
|
1813
|
+
owner: Information Security
|
|
1814
|
+
|
|
1815
|
+
- control: "PCI-DSS Requirement 7: Restrict access to system components"
|
|
1816
|
+
design_element: >
|
|
1817
|
+
Each Lambda function has a least-privilege IAM role permitting
|
|
1818
|
+
only its specific resource actions. Cognito JWT authorisation on
|
|
1819
|
+
all API endpoints. No shared service accounts.
|
|
1820
|
+
evidence: Architecture Views — Security (Trust Zone 3); Component Classification
|
|
1821
|
+
owner: Information Security
|
|
1822
|
+
|
|
1823
|
+
- control: "PCI-DSS Requirement 8: Identify users and authenticate access"
|
|
1824
|
+
design_element: >
|
|
1825
|
+
Cognito User Pool with MFA for operator access. JWT tokens with
|
|
1826
|
+
1-hour expiry. Customer authentication via Cognito. No shared
|
|
1827
|
+
credentials.
|
|
1828
|
+
evidence: Architecture Views — Security (Trust Zone 2); element catalog (Cognito)
|
|
1829
|
+
owner: Information Security
|
|
1830
|
+
|
|
1831
|
+
- control: "PCI-DSS Requirement 10: Log and monitor all access"
|
|
1832
|
+
design_element: >
|
|
1833
|
+
CloudTrail (management and data events) with write-once S3 Object
|
|
1834
|
+
Lock. Structured JSON logs from all Lambda functions with correlation
|
|
1835
|
+
IDs. X-Ray distributed tracing. CloudWatch alarms for anomalous
|
|
1836
|
+
activity.
|
|
1837
|
+
evidence: Architecture Views — Security (Trust Zone 5); Operational Model — Monitoring
|
|
1838
|
+
owner: Platform SRE
|
|
1839
|
+
|
|
1840
|
+
- control: "PCI-DSS Requirement 12: Support information security with policies"
|
|
1841
|
+
design_element: >
|
|
1842
|
+
Architecture Board governance process. Annual QSA audit. Quarterly
|
|
1843
|
+
internal penetration testing. Exception register maintained (this document).
|
|
1844
|
+
evidence: Governance and Compliance (this section); Decisions and Actions
|
|
1845
|
+
owner: Risk and Compliance
|
|
1846
|
+
|
|
1847
|
+
- control: "GDPR Article 5: Data minimisation"
|
|
1848
|
+
design_element: >
|
|
1849
|
+
Order data schema includes only fields necessary for order processing.
|
|
1850
|
+
No free-text fields that could contain unexpected PII. Schema
|
|
1851
|
+
enforced by API Gateway request validator and EventBridge schema.
|
|
1852
|
+
evidence: API specification (api/order-api.openapi.yaml)
|
|
1853
|
+
owner: Risk and Compliance
|
|
1854
|
+
|
|
1855
|
+
- control: "GDPR Article 25: Data protection by design"
|
|
1856
|
+
design_element: >
|
|
1857
|
+
Customer data encrypted at rest (KMS CMK) and in transit (TLS 1.2+).
|
|
1858
|
+
Minimum data retention: orders deleted after 7 years per tax law
|
|
1859
|
+
(S3 lifecycle policy). Access to order data restricted by IAM.
|
|
1860
|
+
evidence: Architecture Views — Security; RAID Constraints
|
|
1861
|
+
owner: Risk and Compliance
|
|
1862
|
+
|
|
1863
|
+
- control: "ESS-01: Encryption in transit"
|
|
1864
|
+
design_element: >
|
|
1865
|
+
API Gateway minimum TLS 1.2 policy. VPC endpoints for internal
|
|
1866
|
+
traffic (no public internet for data plane). Stripe TLS 1.2+.
|
|
1867
|
+
evidence: Architecture Views — Security
|
|
1868
|
+
owner: Information Security
|
|
1869
|
+
|
|
1870
|
+
- control: "ESS-03: Encryption at rest"
|
|
1871
|
+
design_element: >
|
|
1872
|
+
DynamoDB KMS CMK. SQS KMS CMK. S3 KMS CMK. EventBridge does not
|
|
1873
|
+
store event data at rest beyond delivery. CloudTrail KMS CMK.
|
|
1874
|
+
evidence: Architecture Views — Security (Trust Zone 4); element catalog
|
|
1875
|
+
owner: Information Security
|
|
1876
|
+
exceptions: []
|
|
1877
|
+
|
|
1878
|
+
decisions_and_actions:
|
|
1879
|
+
governance_outcome: approved
|
|
1880
|
+
decision_statement: >
|
|
1881
|
+
This reference architecture is approved as the mandatory pattern for
|
|
1882
|
+
all event-driven order processing microservices in the e-commerce domain
|
|
1883
|
+
on AWS. All new order processing services must conform to this reference
|
|
1884
|
+
architecture or obtain an Architecture Board exception. Effective Q2 2026.
|
|
1885
|
+
conditions: []
|
|
1886
|
+
next_actions:
|
|
1887
|
+
- description: Publish architecture to internal developer portal (Backstage)
|
|
1888
|
+
owner: Enterprise Architecture
|
|
1889
|
+
target_date: "2026-04-05"
|
|
1890
|
+
- description: >
|
|
1891
|
+
Migrate Order Service v1 (monolith) to conform with this reference
|
|
1892
|
+
architecture (initial extraction of Order domain)
|
|
1893
|
+
owner: Order Platform Engineering
|
|
1894
|
+
target_date: "2026-06-30"
|
|
1895
|
+
- description: >
|
|
1896
|
+
Complete annual DR exercise to validate RTO 4h / RPO 1h targets
|
|
1897
|
+
owner: Platform SRE
|
|
1898
|
+
target_date: "2026-07-31"
|
|
1899
|
+
- description: >
|
|
1900
|
+
Conduct first annual PCI-DSS QSA audit against this architecture
|
|
1901
|
+
owner: Risk and Compliance
|
|
1902
|
+
target_date: "2026-09-30"
|
|
1903
|
+
- description: >
|
|
1904
|
+
Review and update this reference architecture at 6-month review
|
|
1905
|
+
(next_review_date: 2026-09-20)
|
|
1906
|
+
owner: Enterprise Architecture
|
|
1907
|
+
target_date: "2026-09-20"
|
|
1908
|
+
|
|
1909
|
+
evolution:
|
|
1910
|
+
version: "1.0.0"
|
|
1911
|
+
known_limitations:
|
|
1912
|
+
- >
|
|
1913
|
+
GraphQL API not yet supported — add as extension point in v1.1
|
|
1914
|
+
if product teams require GraphQL for mobile clients.
|
|
1915
|
+
- >
|
|
1916
|
+
Multi-region active-active not yet supported — active-passive DR
|
|
1917
|
+
only. Active-active requires conflict resolution strategy for
|
|
1918
|
+
DynamoDB Global Tables write conflicts.
|
|
1919
|
+
- >
|
|
1920
|
+
Observability does not yet include real-time business metrics
|
|
1921
|
+
(order conversion rate, cart abandonment). Add in v1.1 via
|
|
1922
|
+
EventBridge → Kinesis → real-time dashboard.
|
|
1923
|
+
roadmap:
|
|
1924
|
+
- version: "1.1.0"
|
|
1925
|
+
planned_date: "2026-09-20"
|
|
1926
|
+
planned_changes:
|
|
1927
|
+
- Add GraphQL API extension point (API Gateway + AppSync variant)
|
|
1928
|
+
- Add real-time business metrics (Kinesis + QuickSight)
|
|
1929
|
+
- Add fraud detection service integration (platform service adapter)
|
|
1930
|
+
- Update getting-started guide for Backstage Software Template
|
|
1931
|
+
- version: "2.0.0"
|
|
1932
|
+
planned_date: "2027-Q1"
|
|
1933
|
+
planned_changes:
|
|
1934
|
+
- Evaluate active-active multi-region (DynamoDB Global Tables v2)
|
|
1935
|
+
- Evaluate Lambda SnapStart for Java runtime variant
|
|
1936
|
+
- Incorporate AWS Bedrock for order anomaly detection
|
|
1937
|
+
deprecation_strategy: >
|
|
1938
|
+
When a major version (2.0.0) is published, v1.x will remain supported
|
|
1939
|
+
for 12 months. Adopting teams will receive migration guidance
|
|
1940
|
+
(migration/v1-to-v2.md) and 6 months advance notice. After 12 months,
|
|
1941
|
+
v1.x is deprecated and services must be migrated.
|
|
1942
|
+
feedback_channel: >
|
|
1943
|
+
Submit issues and feedback via the platform GitHub repository
|
|
1944
|
+
(Issues labelled "reference-architecture"). Architecture Board reviews
|
|
1945
|
+
feedback at quarterly governance meeting. SREs report operational
|
|
1946
|
+
issues via PagerDuty post-incident reviews.
|
|
1947
|
+
|
|
1948
|
+
glossary:
|
|
1949
|
+
- term: API Gateway
|
|
1950
|
+
definition: >
|
|
1951
|
+
AWS managed service providing REST API endpoint, TLS termination,
|
|
1952
|
+
Cognito JWT authorisation, and request validation.
|
|
1953
|
+
- term: CDK (Cloud Development Kit)
|
|
1954
|
+
definition: >
|
|
1955
|
+
AWS Cloud Development Kit v2 — TypeScript-based IaC framework that
|
|
1956
|
+
synthesises to CloudFormation templates.
|
|
1957
|
+
- term: Choreography
|
|
1958
|
+
definition: >
|
|
1959
|
+
Event-driven coordination pattern where services react to events
|
|
1960
|
+
published by other services without a central orchestrator.
|
|
1961
|
+
- term: Cognito
|
|
1962
|
+
definition: >
|
|
1963
|
+
AWS managed identity service providing user pools (authentication),
|
|
1964
|
+
JWT issuance, and MFA.
|
|
1965
|
+
- term: Correlation ID
|
|
1966
|
+
definition: >
|
|
1967
|
+
UUID v4 generated at order submission and threaded through all
|
|
1968
|
+
subsequent processing steps, logs, events, and database records
|
|
1969
|
+
for end-to-end tracing.
|
|
1970
|
+
- term: DLQ (Dead Letter Queue)
|
|
1971
|
+
definition: >
|
|
1972
|
+
SQS queue that receives messages failing processing after the
|
|
1973
|
+
maximum receive count (3 attempts). Used for error isolation
|
|
1974
|
+
and manual reprocessing.
|
|
1975
|
+
- term: DynamoDB
|
|
1976
|
+
definition: >
|
|
1977
|
+
AWS managed NoSQL key-value and document database with single-digit
|
|
1978
|
+
millisecond latency at any scale.
|
|
1979
|
+
- term: EventBridge
|
|
1980
|
+
definition: >
|
|
1981
|
+
AWS managed serverless event bus with content-based routing rules
|
|
1982
|
+
and schema registry.
|
|
1983
|
+
- term: Fitness Function
|
|
1984
|
+
definition: >
|
|
1985
|
+
Automated test or check in CI/CD that continuously validates a
|
|
1986
|
+
quality attribute target.
|
|
1987
|
+
- term: Golden Path
|
|
1988
|
+
definition: >
|
|
1989
|
+
Spotify-coined term for a well-lit, well-supported, easy-to-follow
|
|
1990
|
+
implementation path that reduces decision fatigue for development teams.
|
|
1991
|
+
- term: GSI (Global Secondary Index)
|
|
1992
|
+
definition: >
|
|
1993
|
+
DynamoDB index on non-primary-key attributes, enabling efficient
|
|
1994
|
+
queries by alternate access patterns.
|
|
1995
|
+
- term: IAM (Identity and Access Management)
|
|
1996
|
+
definition: >
|
|
1997
|
+
AWS service managing permissions. Each Lambda function has a
|
|
1998
|
+
least-privilege IAM role.
|
|
1999
|
+
- term: Idempotent
|
|
2000
|
+
definition: >
|
|
2001
|
+
An operation that produces the same result regardless of how many
|
|
2002
|
+
times it is executed with the same input. Required for all
|
|
2003
|
+
EventBridge subscribers to handle at-least-once delivery.
|
|
2004
|
+
- term: KMS (Key Management Service)
|
|
2005
|
+
definition: >
|
|
2006
|
+
AWS managed encryption key service. Customer-managed keys (CMKs)
|
|
2007
|
+
provide full key lifecycle control.
|
|
2008
|
+
- term: Lambda
|
|
2009
|
+
definition: >
|
|
2010
|
+
AWS serverless function-as-a-service compute. Executes code in
|
|
2011
|
+
response to events (API Gateway, EventBridge, SQS) with automatic
|
|
2012
|
+
scaling and pay-per-invocation billing.
|
|
2013
|
+
- term: PCI-DSS
|
|
2014
|
+
definition: >
|
|
2015
|
+
Payment Card Industry Data Security Standard. Level 1 applies to
|
|
2016
|
+
systems processing > 6 million card transactions per year.
|
|
2017
|
+
- term: Provisioned Concurrency
|
|
2018
|
+
definition: >
|
|
2019
|
+
Lambda feature that pre-initialises a specified number of function
|
|
2020
|
+
instances, eliminating cold-start latency.
|
|
2021
|
+
- term: RPO (Recovery Point Objective)
|
|
2022
|
+
definition: "Maximum acceptable data loss measured in time (target: 1 hour)."
|
|
2023
|
+
- term: RTO (Recovery Time Objective)
|
|
2024
|
+
definition: "Maximum acceptable time to restore service after an incident (target: 4 hours)."
|
|
2025
|
+
- term: RULERS
|
|
2026
|
+
definition: >
|
|
2027
|
+
EAROS evidence-anchoring protocol: for each criterion, extract a
|
|
2028
|
+
direct quote or reference from the artifact before assigning a score.
|
|
2029
|
+
- term: Schema Registry
|
|
2030
|
+
definition: >
|
|
2031
|
+
EventBridge feature that stores and validates event schemas,
|
|
2032
|
+
ensuring publishers and subscribers agree on event structure.
|
|
2033
|
+
- term: Single-table design
|
|
2034
|
+
definition: >
|
|
2035
|
+
DynamoDB design pattern where multiple entity types coexist in
|
|
2036
|
+
one table, encoded via composite PK/SK patterns.
|
|
2037
|
+
- term: SLO (Service Level Objective)
|
|
2038
|
+
definition: >
|
|
2039
|
+
Internal target for service reliability (availability, latency).
|
|
2040
|
+
Distinct from SLA (contractual commitment with customers).
|
|
2041
|
+
- term: VPC (Virtual Private Cloud)
|
|
2042
|
+
definition: >
|
|
2043
|
+
AWS isolated network environment. Lambda functions in private
|
|
2044
|
+
VPC subnets have no direct internet access.
|
|
2045
|
+
- term: WAF (Web Application Firewall)
|
|
2046
|
+
definition: >
|
|
2047
|
+
AWS managed firewall enforcing OWASP Core Rule Set v3.2 on all
|
|
2048
|
+
inbound API traffic.
|
|
2049
|
+
- term: WMS (Warehouse Management System)
|
|
2050
|
+
definition: >
|
|
2051
|
+
External fulfilment system consuming order dispatch instructions
|
|
2052
|
+
from the SQS Fulfilment Queue via the WMS Adapter Lambda.
|
|
2053
|
+
- term: X-Ray
|
|
2054
|
+
definition: >
|
|
2055
|
+
AWS distributed tracing service. Active tracing on all Lambda
|
|
2056
|
+
functions provides end-to-end request traces with correlation IDs.
|