@precepts/standards 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/LICENSE +30 -0
  2. package/README.md +115 -0
  3. package/package.json +40 -0
  4. package/schema/document-standard-template.md +139 -0
  5. package/schema/standards.schema.json +154 -0
  6. package/standards/integration/governance/_category_.json +1 -0
  7. package/standards/integration/governance/integration-styles.md +56 -0
  8. package/standards/integration/index.md +9 -0
  9. package/standards/integration/standards/_category_.json +1 -0
  10. package/standards/integration/standards/api/_category_.json +1 -0
  11. package/standards/integration/standards/api/error-handling.md +250 -0
  12. package/standards/integration/standards/api/resource-design.md +286 -0
  13. package/standards/integration/standards/data-formats/_category_.json +1 -0
  14. package/standards/integration/standards/data-formats/character-encoding.md +206 -0
  15. package/standards/integration/standards/data-formats/date-format.md +102 -0
  16. package/standards/integration/standards/data-formats/datetime-formats.md +265 -0
  17. package/standards/integration/standards/data-formats/monetary-format.md +61 -0
  18. package/standards/integration/standards/events/_category_.json +1 -0
  19. package/standards/integration/standards/events/event-envelope.md +270 -0
  20. package/standards/integration/standards/foundational/_category_.json +1 -0
  21. package/standards/integration/standards/foundational/naming-conventions.md +334 -0
  22. package/standards/integration/standards/observability/_category_.json +1 -0
  23. package/standards/integration/standards/observability/integration-observability.md +226 -0
  24. package/standards/integration/standards/resilience/_category_.json +1 -0
  25. package/standards/integration/standards/resilience/integration-resilience-patterns.md +291 -0
  26. package/standards/integration/standards/resilience/retry-policy.md +268 -0
  27. package/standards/integration/standards/resilience/timeout.md +269 -0
  28. package/standards/integration/standards/versioning/_category_.json +1 -0
  29. package/standards/integration/standards/versioning/backward-forward-compatibility.md +230 -0
  30. package/standards/product/Guidelines/_category_.json +1 -0
  31. package/standards/product/Guidelines/requirement-document.md +54 -0
  32. package/standards/product/index.md +9 -0
  33. package/standards/project-management/index.md +9 -0
  34. package/standards/ux/index.md +9 -0
@@ -0,0 +1,226 @@
1
+ ---
2
+ identifier: "INTG-STD-029"
3
+ name: "Integration Observability"
4
+ version: "1.0.0"
5
+ status: "MANDATORY"
6
+
7
+ domain: "INTEGRATION"
8
+ documentType: "standard"
9
+ category: "observability"
10
+ appliesTo: ["api", "events", "a2a", "files", "mcp", "webhooks", "grpc", "graphql", "batch", "streaming"]
11
+
12
+ lastUpdated: "2026-03-28"
13
+ owner: "Integration Architecture Board"
14
+
15
+ standardsCompliance:
16
+ iso: []
17
+ rfc: []
18
+ w3c: ["Trace-Context"]
19
+ other: ["OpenTelemetry-Specification", "OWASP-Logging-Cheat-Sheet"]
20
+
21
+ taxonomy:
22
+ capability: "observability"
23
+ subCapability: "logging-tracing-metrics"
24
+ layer: "infrastructure"
25
+
26
+ enforcement:
27
+ method: "hybrid"
28
+ validationRules:
29
+ traceContextHeader: "traceparent"
30
+ logFormat: "JSON"
31
+ requiredLogFields: ["timestamp", "level", "trace_id", "service", "message"]
32
+ rejectionCriteria:
33
+ - "Missing traceparent header propagation"
34
+ - "Unstructured (non-JSON) log output"
35
+ - "PII or secrets in log entries"
36
+ - "Missing correlation ID in API responses"
37
+
38
+ dependsOn: ["INTG-STD-004"]
39
+ supersedes: ""
40
+ ---
41
+
42
+ # Observability
43
+
44
+ ## Purpose
45
+
46
+ Every integration action - API call, event, file transfer, or agent handshake - **MUST** be traceable from origin to destination. This standard establishes mandatory requirements for distributed tracing (W3C Trace Context), structured logging (JSON), and metrics collection (OpenTelemetry) across all integration touchpoints. It also codifies what **MUST NOT** appear in logs to prevent data leakage per the OWASP Logging Cheat Sheet.
47
+
48
+ ## Rules
49
+
50
+ ### R-1: W3C Trace Context Propagation
51
+
52
+ All integration endpoints **MUST** propagate the `traceparent` HTTP header per the W3C Trace Context specification:
53
+
54
+ ```
55
+ traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}
56
+ Example: 00-0af7651916cd43dd8448eb211c80319c-b9c7c989f97918e1-01
57
+ ```
58
+
59
+ Services **MUST NOT** generate all-zero `trace-id` or `parent-id` values. If a request arrives without `traceparent`, the receiving service **MUST** generate a new trace context.
60
+
61
+ Services **SHOULD** propagate the `tracestate` header alongside `traceparent` and **MUST NOT** modify or strip `tracestate` entries they do not own.
62
+
63
+ For non-HTTP transports, trace context **MUST** be propagated via the protocol's native metadata mechanism (gRPC metadata keys, Kafka message headers, AMQP application properties, batch file manifest metadata).
64
+
65
+ ### R-2: Correlation IDs
66
+
67
+ All API responses **MUST** include an `X-Request-ID` header containing a UUID v4 or ULID. If the incoming request includes `X-Request-ID`, the service **MUST** echo that value; otherwise it **MUST** generate one.
68
+
69
+ The `X-Request-ID` **MUST** be included in all log entries for that request via the `request_id` field. The `X-Request-ID` is distinct from `trace-id` - it is a business-level identifier that **MAY** be shared with API consumers for support purposes.
70
+
71
+ ### R-3: Structured Logging Format
72
+
73
+ All integration components **MUST** emit logs in JSON format. Unstructured log output **MUST NOT** be used beyond local development.
74
+
75
+ **Required fields:**
76
+
77
+ | Field | Type | Description | Example |
78
+ | ----- | ---- | ----------- | ------- |
79
+ | `timestamp` | string | ISO 8601 with UTC (`Z`), microsecond precision | `"2026-03-28T14:32:01.482319Z"` |
80
+ | `level` | string | Log severity (uppercase) | `"INFO"` |
81
+ | `trace_id` | string | W3C trace identifier (32 hex chars) | `"0af7651916cd43dd8448eb211c80319c"` |
82
+ | `span_id` | string | Current span identifier (16 hex chars) | `"b9c7c989f97918e1"` |
83
+ | `service` | string | Service name (lowercase, hyphenated) | `"order-service"` |
84
+ | `message` | string | Human-readable event description | `"Payment authorization completed"` |
85
+
86
+ Optional fields **SHOULD** be included when applicable: `request_id`, `environment`, `version`, `operation`, `duration_ms`, `http.method`, `http.status_code`, `http.url` (sensitive parameters redacted), `error.type`, `error.message`.
87
+
88
+ The `service` field **MUST** match the OpenTelemetry `service.name` resource attribute.
89
+
90
+ ### R-4: Prohibited Log Content
91
+
92
+ Log entries **MUST NOT** contain:
93
+
94
+ | Category | Examples |
95
+ | -------- | -------- |
96
+ | PII | Names, emails, phone numbers, government IDs, dates of birth |
97
+ | Authentication credentials | Passwords, API keys, bearer tokens, JWTs, session IDs |
98
+ | Cryptographic material | Private keys, certificates, encryption keys |
99
+ | Financial data | Full card numbers, bank account numbers, CVV codes |
100
+ | Health data | Medical records, diagnoses, treatment information |
101
+ | Full request/response bodies | Use truncated or summarized representations instead |
102
+
103
+ If debugging requires logging intersecting data, it **MUST** be masked before writing (e.g., `"d***@example.com"`, `"****-****-****-4242"`, `"sk-prod-****"`, or log schema shape only: `"body_keys: [\"name\", \"address\"]"`).
104
+
105
+ Log entries **MUST** be sanitized against log injection - newlines, control characters, and ANSI escape sequences **MUST** be escaped or stripped from user-supplied values (ref: OWASP CWE-117). Stack traces **SHOULD** only appear at `ERROR` or `FATAL` level and **MUST** be reviewed for leaked secrets.
106
+
107
+ ### R-5: Log Levels
108
+
109
+ Services **MUST** use the following log levels consistently:
110
+
111
+ | Level | When to Use |
112
+ | ----- | ----------- |
113
+ | `FATAL` | Unrecoverable failure; service cannot continue |
114
+ | `ERROR` | Operation failed; requires attention but service continues |
115
+ | `WARN` | Unexpected condition that does not prevent operation |
116
+ | `INFO` | Normal operational events worth recording |
117
+ | `DEBUG` | Diagnostic detail for troubleshooting |
118
+ | `TRACE` | Protocol-level verbosity |
119
+
120
+ Production environments **MUST** default to `INFO`. `DEBUG` and `TRACE` **MUST** be activatable at runtime without redeployment. `ERROR` **MUST** be reserved for conditions requiring investigation - client 4xx errors **SHOULD** be logged at `WARN`, not `ERROR`.
121
+
122
+ ### R-6: Metrics
123
+
124
+ All custom integration metrics **MUST** follow OpenTelemetry semantic naming conventions: dot-separated namespaces, lowercase, no units in names.
125
+
126
+ **Required metrics for every integration endpoint:**
127
+
128
+ | Metric Name | Type | Unit | Description |
129
+ | ----------- | ---- | ---- | ----------- |
130
+ | `integration.request.duration` | Histogram | `s` | Request-to-response time |
131
+ | `integration.request.count` | Counter | `{request}` | Total requests |
132
+ | `integration.request.error.count` | Counter | `{request}` | Failed requests |
133
+ | `integration.request.active` | UpDownCounter | `{request}` | In-flight requests |
134
+
135
+ **Additional metrics for event-driven integrations:**
136
+
137
+ | Metric Name | Type | Unit | Description |
138
+ | ----------- | ---- | ---- | ----------- |
139
+ | `integration.event.publish.count` | Counter | `{event}` | Events published |
140
+ | `integration.event.consume.count` | Counter | `{event}` | Events consumed |
141
+ | `integration.event.consume.duration` | Histogram | `s` | Event processing time |
142
+ | `integration.event.consume.lag` | Gauge | `{event}` | Consumer lag |
143
+ | `integration.event.dlq.count` | Counter | `{event}` | Dead-letter queue events |
144
+
145
+ All metrics **MUST** include resource attributes `service.name`, `service.version`, and `deployment.environment.name`. Common attributes **MUST** include `integration.type`, `integration.target`, `network.protocol.name`, and `error.type` where applicable.
146
+
147
+ ### R-7: Span Attributes
148
+
149
+ All integration operations **MUST** be instrumented as OpenTelemetry spans. Span names **MUST** follow protocol conventions (e.g., `GET /api/v2/orders` for HTTP, `orders.created publish` for messaging).
150
+
151
+ HTTP spans **MUST** include: `http.request.method`, `http.response.status_code`, `url.path`, `server.address`. Messaging spans **MUST** include: `messaging.system`, `messaging.destination.name`, `messaging.operation.type`.
152
+
153
+ Spans **MUST** set appropriate `SpanKind`: `SERVER`/`CLIENT` for HTTP/gRPC, `PRODUCER`/`CONSUMER` for messaging, `INTERNAL` for local processing. Span status **MUST** be set to `ERROR` on failure; HTTP 5xx **MUST** set span error status, 4xx **SHOULD NOT**.
154
+
155
+ ### R-8: Audit Traceability
156
+
157
+ Every state-changing integration operation **MUST** produce an `INFO` log entry including at minimum: `trace_id`, `span_id`, `request_id`, `operation`, `service`, and outcome. It **MUST** be possible to reconstruct the complete execution path of any transaction using `trace_id` across all participating services. Audit-relevant entries **MUST** be retained per the organization's data retention policy.
158
+
159
+ ## Examples
160
+
161
+ ### Structured Log Entry
162
+
163
+ ```json
164
+ {
165
+ "timestamp": "2026-03-28T14:32:01.482319Z",
166
+ "level": "INFO",
167
+ "trace_id": "0af7651916cd43dd8448eb211c80319c",
168
+ "span_id": "b9c7c989f97918e1",
169
+ "service": "order-service",
170
+ "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
171
+ "operation": "createOrder",
172
+ "http.method": "POST",
173
+ "http.status_code": 201,
174
+ "duration_ms": 142.7,
175
+ "message": "Order created successfully"
176
+ }
177
+ ```
178
+
179
+ ### Trace Context Headers
180
+
181
+ ```http
182
+ GET /api/v2/orders/12345 HTTP/1.1
183
+ Host: order-service.internal
184
+ traceparent: 00-0af7651916cd43dd8448eb211c80319c-b9c7c989f97918e1-01
185
+ tracestate: vendorA=eyJhbGciOiJIUzI,vendorB=abc123
186
+ X-Request-ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
187
+ ```
188
+
189
+ ## Enforcement Rules
190
+
191
+ - **Gateway enforcement**: API gateways **MUST** generate `traceparent` and `X-Request-ID` for incoming external requests that lack them. Internal service-to-service requests missing `traceparent` **SHOULD** be flagged.
192
+ - **Build-time enforcement**: CI/CD pipelines **MUST** validate that all log output is valid JSON with required fields (`timestamp`, `level`, `trace_id`, `span_id`, `service`, `message`), ISO 8601 timestamps, and valid log levels. Non-JSON log output **MUST** fail validation.
193
+ - **Runtime enforcement**: Log aggregation systems **SHOULD** reject or quarantine entries missing required fields.
194
+ - **Security enforcement**: Log pipelines **SHOULD** include automated PII/credential pattern detection. Matches **MUST** trigger security team alerts. Repeated violations **MAY** result in deployment blocks.
195
+ - **Correlation ID check**: API gateways or integration test suites **MUST** verify all responses include `X-Request-ID`.
196
+
197
+ **Validation patterns:**
198
+
199
+ - traceparent: `^00-[0-9a-f]{32}-[0-9a-f]{16}-[0-9a-f]{2}$`
200
+ - X-Request-ID (UUID v4): `^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$`
201
+ - X-Request-ID (ULID): `^[0-9A-HJKMNP-TV-Z]{26}$`
202
+ - Timestamp (ISO 8601 UTC): `^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z$`
203
+
204
+ ## References
205
+
206
+ - [W3C Trace Context Specification](https://www.w3.org/TR/trace-context/)
207
+ - [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/)
208
+ - [OWASP Logging Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html)
209
+
210
+ ## Rationale
211
+
212
+ **W3C Trace Context over proprietary headers** - Vendor-neutral W3C Recommendation supported by all major observability platforms, preventing lock-in and ensuring partner interoperability.
213
+
214
+ **JSON structured logging** - Machine-parseable without custom grammars, natively supported by all major log aggregation platforms, and enables field-level indexing for correlation across services.
215
+
216
+ **Separate X-Request-ID from trace-id** - The trace-id is an internal tracing concern that may be regenerated at trust boundaries; X-Request-ID is a business-facing identifier consumers can reference in support tickets.
217
+
218
+ **Prohibit PII in logs** - Logs are stored with broader access controls than production databases, making aggregated log stores high-value targets (ref: OWASP CWE-532). Prevention is far more effective than post-hoc redaction.
219
+
220
+ **OpenTelemetry naming conventions** - CNCF-backed industry standard ensuring metrics and spans from different teams, languages, and frameworks are consistent and correlatable without manual mapping.
221
+
222
+ ## Version History
223
+
224
+ | Version | Date | Change |
225
+ | ------- | ---------- | ------------------ |
226
+ | 1.0.0 | 2026-03-28 | Initial definition |
@@ -0,0 +1 @@
1
+ {"label": "Resilience", "position": 5}
@@ -0,0 +1,291 @@
1
+ ---
2
+ identifier: "INTG-STD-033"
3
+ name: "Integration Resilience Patterns"
4
+ version: "1.0.0"
5
+ status: "MANDATORY"
6
+
7
+ domain: "INTEGRATION"
8
+ documentType: "standard"
9
+ category: "reliability"
10
+ appliesTo: ["api", "events", "a2a", "mcp", "webhooks", "grpc", "graphql", "batch"]
11
+
12
+ lastUpdated: "2026-03-28"
13
+ owner: "Integration Architecture Board"
14
+
15
+ standardsCompliance:
16
+ iso: []
17
+ rfc: []
18
+ w3c: []
19
+ other: ["Microsoft-Resilience-Patterns", "Resilience4j", "Release-It-Nygard"]
20
+
21
+ taxonomy:
22
+ capability: "reliability"
23
+ subCapability: "resilience-patterns"
24
+ layer: "infrastructure"
25
+
26
+ enforcement:
27
+ method: "review-based"
28
+ reviewChecklist:
29
+ - "Circuit breaker configured for all external dependencies"
30
+ - "Bulkhead isolation between independent dependency pools"
31
+ - "Fallback strategy defined for critical paths"
32
+ - "Health check endpoints implemented"
33
+ - "Resilience metrics exposed to monitoring"
34
+
35
+ dependsOn: ["INTG-STD-029", "INTG-STD-034", "INTG-STD-035"]
36
+ supersedes: ""
37
+ ---
38
+
39
+ # Resilience Patterns
40
+
41
+ ## Purpose
42
+
43
+ This standard defines **MANDATORY** resilience patterns for all integration points to ensure graceful degradation, automatic recovery, and full observability under failure. It covers circuit breaking, bulkhead isolation, fallback strategies, health checking, load shedding, pattern composition, and observability.
44
+
45
+ This standard works with INTG-STD-034 (Retry Policies) and INTG-STD-035 (Timeout Standards) to form a complete reliability framework. Retry and timeout behaviors are governed by those companion standards; this document governs the structural patterns that contain, isolate, and recover from failures.
46
+
47
+ ## Rules
48
+
49
+ ### R-1: Circuit Breaker Pattern
50
+
51
+ Every outbound integration call to an external dependency **MUST** be protected by a circuit breaker implementing three states:
52
+
53
+ ```
54
+ failure threshold
55
+ reached
56
+ [CLOSED] -----------------> [OPEN]
57
+ ^ |
58
+ | all probes timeout |
59
+ | succeed expires v
60
+ +---------- [HALF-OPEN] ---+
61
+ |
62
+ probe fails -> back to OPEN
63
+ ```
64
+
65
+ **State definitions:**
66
+
67
+ - **CLOSED** - Normal operation. Requests pass through. When failure rate in the sliding window exceeds the threshold, transitions to OPEN.
68
+ - **OPEN** - Fail-fast mode. All requests rejected immediately. After the wait duration, transitions to HALF-OPEN.
69
+ - **HALF-OPEN** - A limited number of trial requests are permitted. If all succeed, transitions to CLOSED. If any fail, transitions back to OPEN.
70
+
71
+ **Configuration parameters** - all circuit breakers **MUST** be configurable with:
72
+
73
+ | Parameter | Description | Required Default |
74
+ |---|---|---|
75
+ | `failureRateThreshold` | Percentage of failures that triggers OPEN state | 50% |
76
+ | `slowCallRateThreshold` | Percentage of slow calls that triggers OPEN state | 80% |
77
+ | `slowCallDurationThreshold` | Duration above which a call is considered slow | Per INTG-STD-035 |
78
+ | `slidingWindowType` | COUNT_BASED or TIME_BASED | COUNT_BASED |
79
+ | `slidingWindowSize` | Number of calls (count) or seconds (time) in the window | 100 calls or 60s |
80
+ | `minimumNumberOfCalls` | Minimum calls before failure rate is calculated | 20 |
81
+ | `waitDurationInOpenState` | Time in OPEN before transitioning to HALF-OPEN | 30s |
82
+ | `permittedNumberOfCallsInHalfOpen` | Trial calls allowed in HALF-OPEN state | 5 |
83
+ | `automaticTransitionEnabled` | Whether to auto-transition from OPEN to HALF-OPEN | true |
84
+
85
+ Teams **MAY** override defaults per dependency based on documented SLA characteristics, but **MUST NOT** set `failureRateThreshold` below 25% or `waitDurationInOpenState` below 5 seconds to prevent flapping.
86
+
87
+ **Failure classification:**
88
+
89
+ - HTTP 5xx and 429 responses **MUST** be counted as failures
90
+ - Connection timeouts and read timeouts **MUST** be counted as failures
91
+ - HTTP 4xx responses (except 429) **MUST NOT** be counted as failures (client errors, not dependency failures)
92
+ - Responses exceeding `slowCallDurationThreshold` **MUST** be counted as slow calls
93
+
94
+ **Security constraints:**
95
+
96
+ - Circuit breakers **MUST NOT** cache or replay authentication tokens when probing in HALF-OPEN state. Each probe **MUST** carry fresh credentials.
97
+ - Circuit breaker state **MUST NOT** be externally modifiable without administrative authorization. Manual override endpoints **MUST** require RBAC permissions and **MUST** log every override with actor identity.
98
+
99
+ **Observability:**
100
+
101
+ - Every state transition **MUST** be logged at WARN level with: timestamp, dependency name, previous state, new state, failure rate, and sliding window statistics.
102
+ - **MUST** expose metrics: `circuit_breaker_state` (gauge), `circuit_breaker_failure_rate` (gauge), `circuit_breaker_calls_total` (counter by outcome), `circuit_breaker_state_transitions_total` (counter by from/to state).
103
+ - Teams **MUST** alert on: any transition to OPEN, OPEN state lasting longer than 5 minutes, and flapping (>3 transitions in 5 minutes).
104
+
105
+ ### R-2: Bulkhead Isolation
106
+
107
+ Every integration point **MUST** be isolated using bulkhead patterns so that resource exhaustion in one dependency does not starve others. Teams **MUST** implement at least one strategy per dependency:
108
+
109
+ - **Thread pool isolation** - dedicated thread pool per dependency. **SHOULD** be used when the dependency has unpredictable latency or full isolation is required.
110
+ - **Semaphore isolation** - counting semaphore limiting concurrent calls. **SHOULD** be used for predictable-latency dependencies or single-threaded/async runtimes.
111
+ - **Connection pool isolation** - separate connection pools per dependency. **MUST** be used for all database and persistent-connection dependencies regardless of other strategies.
112
+
113
+ **Sizing** - bulkhead sizes **MUST** be calculated from measured dependency characteristics:
114
+
115
+ ```
116
+ maxConcurrent = (peakRPS * p99LatencySeconds) * safetyFactor(1.5-2.0)
117
+ ```
118
+
119
+ Example: Payment API at 200 RPS, 150ms p99 - maxConcurrent = (200 * 0.15) * 1.5 = 45 slots. Teams **MUST** review sizes quarterly or after significant traffic changes.
120
+
121
+ **Rejection handling** - when a bulkhead rejects a request, the system **MUST**: return immediately (fail-fast), count the rejection as a circuit breaker signal, log at WARN level, route to the fallback handler, and increment `bulkhead_rejections_total`.
122
+
123
+ ### R-3: Fallback Strategies
124
+
125
+ Every integration point on a critical user-facing path **MUST** define a fallback strategy. Background processes **SHOULD** define fallbacks where feasible. A fallback is invoked when the circuit breaker is OPEN, the bulkhead rejects, retries are exhausted, or the timeout is exceeded.
126
+
127
+ Teams **MUST** select and document one or more strategies per dependency:
128
+
129
+ | Strategy | Description | When to Use |
130
+ |---|---|---|
131
+ | **Static Default** | Predefined hardcoded response | When a reasonable default exists |
132
+ | **Cache Fallback** | Last known good response from cache | When stale data is acceptable for a bounded period |
133
+ | **Graceful Degradation** | Reduced functionality, service stays operational | When partial results beat no results |
134
+ | **Alternative Service** | Route to a backup service | When a redundant provider exists |
135
+ | **Queued Retry** | Accept and process asynchronously later | When eventual consistency is acceptable |
136
+ | **Fail with Context** | Structured error with degradation info | When the caller must know and adapt |
137
+
138
+ **Security constraints** - fallback strategies **MUST NOT** bypass authentication or authorization. Cache fallbacks **MUST** be keyed to include the caller's authorization context (tenant, role, scope). If a fallback returns stale data, the response **MUST** include staleness metadata (e.g., `X-Fallback-Active: true` header and a `data-age` field).
139
+
140
+ ### R-4: Health Check Patterns
141
+
142
+ Every service exposing integration endpoints **MUST** implement health check endpoints. Every service consuming external dependencies **MUST** perform dependency health checks. Services **MUST** implement at least two levels:
143
+
144
+ **Shallow health check (liveness)** - `GET /health/live`
145
+ - Verifies the process is running and can accept requests
146
+ - **MUST NOT** call external dependencies
147
+ - **MUST** respond within 100ms
148
+ - **MUST** return HTTP 200 if alive, 503 if not
149
+
150
+ **Deep health check (readiness)** - `GET /health/ready`
151
+ - Verifies all critical dependencies are operational
152
+ - **MUST** check connectivity to each critical dependency
153
+ - **MUST** respect a 5-second total timeout for all checks combined
154
+ - **MUST** return HTTP 200 if ready, 503 if any critical dependency is unhealthy
155
+ - **MUST** return structured health status in the response body
156
+
157
+ Circuit breakers **MAY** use dedicated health check endpoints to probe recovery in HALF-OPEN state. Health check probes **MUST NOT** be counted in circuit breaker failure statistics. Health check endpoints **MUST NOT** expose sensitive information. Deep health checks **SHOULD** require authentication when they expose dependency topology.
158
+
159
+ ### R-5: Load Shedding
160
+
161
+ Services on critical paths **MUST** implement load shedding to maintain quality for high-priority traffic under pressure.
162
+
163
+ All inbound requests **MUST** be classifiable into at least three priority tiers:
164
+
165
+ | Priority | Shedding Behavior |
166
+ |---|---|
167
+ | **CRITICAL** (revenue/safety-impacting) | Shed last - only under catastrophic load |
168
+ | **NORMAL** (standard business operations) | Shed when utilization exceeds 85% |
169
+ | **LOW** (deferrable operations) | Shed first when utilization exceeds 70% |
170
+
171
+ Shedding decisions **MUST** be based on measurable signals: bulkhead utilization, queue depth, p99 latency relative to SLA, CPU/memory utilization, or upstream circuit breaker states.
172
+
173
+ When shedding, the service **MUST**: return HTTP 503 with a `Retry-After` header, include a structured body indicating load shedding, log at INFO level (shedding is a designed behavior), and increment `load_shedding_total{priority="<tier>"}`.
174
+
175
+ ### R-6: Pattern Composition
176
+
177
+ Resilience patterns **MUST** be composed in the following order (outermost to innermost):
178
+
179
+ ```
180
+ Load Shedder -> Bulkhead -> Circuit Breaker -> Retry(STD-034) -> Timeout(STD-035) -> Call
181
+ ```
182
+
183
+ This means:
184
+ - Retry wraps the timeout-bounded call. A single retry attempt **MUST NOT** exceed the per-call timeout.
185
+ - Circuit breaker wraps retry. If the circuit is OPEN, retries do not execute.
186
+ - Bulkhead wraps circuit breaker. A rejected bulkhead request does not consume circuit breaker capacity.
187
+ - Load shedder wraps bulkhead. Shed requests never reach the resource pool.
188
+
189
+ **Timeout budget coordination** - the total timeout **MUST** satisfy:
190
+
191
+ ```
192
+ totalOperationTimeout >= retryAttempts * perCallTimeout + retryDelayBudget
193
+
194
+ Example: perCallTimeout=2s, retries=3, backoff=[0.5s,1.0s] -> 3*2s + 1.5s = 7.5s -> set totalTimeout=8s
195
+ ```
196
+
197
+ If the circuit breaker transitions to OPEN during a retry sequence, remaining retries **MUST** be abandoned immediately. Circuit breaker rejections are non-retryable.
198
+
199
+ **Anti-patterns that MUST be avoided:**
200
+
201
+ | Anti-Pattern | Correct Approach |
202
+ |---|---|
203
+ | Retry outside circuit breaker without coordination | Retry inside circuit breaker; CB rejection is non-retryable |
204
+ | Timeout longer than CB wait duration | Per-call timeout **MUST** be shorter than `waitDurationInOpenState` |
205
+ | Bulkhead inside circuit breaker | Bulkhead outside circuit breaker |
206
+ | Retry on circuit-breaker-rejected calls | Treat CB rejection as non-retryable |
207
+ | Per-call timeout exceeding total operation timeout | `perCallTimeout < totalOperationTimeout / retryAttempts` |
208
+
209
+ ### R-7: Observability
210
+
211
+ Every service **MUST** expose a resilience dashboard covering: circuit breaker state, bulkhead utilization, fallback activation rate, load shedding rate by tier, and health check status for all dependencies.
212
+
213
+ **Required metrics** (Prometheus, OpenTelemetry, or equivalent):
214
+
215
+ | Metric | Type | Labels |
216
+ |---|---|---|
217
+ | `circuit_breaker_state` | Gauge | `dependency` |
218
+ | `circuit_breaker_failure_rate` | Gauge | `dependency` |
219
+ | `circuit_breaker_calls_total` | Counter | `dependency`, `outcome` |
220
+ | `circuit_breaker_state_transitions_total` | Counter | `dependency`, `from`, `to` |
221
+ | `bulkhead_available_concurrent_calls` | Gauge | `dependency` |
222
+ | `bulkhead_max_concurrent_calls` | Gauge | `dependency` |
223
+ | `bulkhead_rejections_total` | Counter | `dependency` |
224
+ | `fallback_activations_total` | Counter | `dependency`, `strategy` |
225
+ | `load_shedding_total` | Counter | `priority` |
226
+ | `health_check_status` | Gauge | `dependency`, `level` |
227
+ | `health_check_duration_seconds` | Histogram | `dependency`, `level` |
228
+
229
+ **Structured logging** - all resilience events **MUST** be logged as structured JSON with: `timestamp` (ISO-8601), `level`, `dependency`, `pattern`, `event`, and `correlationId`.
230
+
231
+ **Alerting** - teams **MUST** configure alerts for:
232
+
233
+ | Condition | Severity |
234
+ |---|---|
235
+ | Circuit breaker transitions to OPEN | Warning - ack within 15 min |
236
+ | Circuit breaker OPEN > 5 minutes | High - ack within 5 min |
237
+ | Circuit breaker flapping (>3 transitions in 5 min) | High - ack within 5 min |
238
+ | Bulkhead utilization > 90% sustained 2 min | Warning - ack within 15 min |
239
+ | Fallback activation rate > 10% over 5 min | Warning - ack within 15 min |
240
+ | Load shedding CRITICAL tier requests | Critical - ack within 2 min |
241
+ | Deep health check failing > 2 min | High - ack within 5 min |
242
+
243
+ ## Examples
244
+
245
+ ### Circuit breaker composition ordering
246
+
247
+ ```
248
+ Inbound Request
249
+ -> Load Shedder (reject low-priority if overloaded)
250
+ -> Bulkhead (limit concurrency per dependency)
251
+ -> Circuit Breaker (fail-fast if dependency down)
252
+ -> Retry (recover from transient failures, per INTG-STD-034)
253
+ -> Timeout (bound call duration, per INTG-STD-035)
254
+ -> External Call
255
+ ```
256
+
257
+ If the circuit is OPEN, the request skips retry and timeout, goes directly to the fallback handler.
258
+
259
+ ## Enforcement Rules
260
+
261
+ - Every service exposing or consuming integration points **MUST** implement these resilience patterns before production deployment.
262
+ - Architecture reviews **MUST** verify implementation against the following checklist: circuit breaker configured per dependency, bulkhead isolation with measured sizing, fallback strategy documented and authZ-compliant, liveness/readiness endpoints implemented, patterns composed in correct order, timeout budgets consistent, all metrics exposed and alerts configured, and RBAC on circuit breaker overrides.
263
+ - Services without circuit breakers for external dependencies **MUST NOT** be approved for production.
264
+ - Fallback strategies **MUST** be tested via resilience testing (chaos engineering, dependency failure injection).
265
+ - Non-compliance discovered post-deployment **MUST** be remediated within one sprint or documented with an accepted risk exception signed by the service owner and integration architecture lead.
266
+ - Where tooling permits, CI/CD **SHOULD** validate: resilience config parsing, timeout budget consistency, HTTP client references to circuit breaker/bulkhead config, and health check endpoint definitions.
267
+
268
+ ## References
269
+
270
+ - [Microsoft Azure - Circuit Breaker Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker)
271
+ - [Resilience4j - CircuitBreaker](https://resilience4j.readme.io/docs/circuitbreaker)
272
+ - [Release It! Second Edition](https://pragprog.com/titles/mnee2/release-it-second-edition/) - Michael Nygard's stability patterns
273
+ - [AWS - Advanced Multi-AZ Resilience Patterns](https://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/pattern-1-health-check-circuit-breaker.html)
274
+
275
+ ## Rationale
276
+
277
+ **Why these specific patterns?** The six patterns represent the minimum viable resilience set validated by over a decade of production experience at Netflix, Amazon, Google, and Microsoft, codified in Nygard's "Release It!" and implemented in Hystrix, Resilience4j, and Polly.
278
+
279
+ **Why mandate composition order?** Incorrect composition is a common, subtle failure source - e.g., retry outside circuit breaker causes retries to fight the breaker, delaying fail-fast and wasting resources.
280
+
281
+ **Why include security constraints?** Resilience patterns introduce alternative code paths that can inadvertently bypass security controls - cache fallbacks can leak data across authorization boundaries, and HALF-OPEN probes can replay stale tokens.
282
+
283
+ **Why detailed observability?** Without mandatory metrics and logging, teams cannot distinguish "breaker correctly protecting from failure" from "breaker incorrectly blocking all traffic due to misconfigured threshold."
284
+
285
+ **Why not mandate specific libraries?** This standard specifies behavior and configuration, not implementations. Resilience4j, Polly, and custom implementations all satisfy these requirements without limiting technology choice.
286
+
287
+ ## Version History
288
+
289
+ | Version | Date | Change |
290
+ | ------- | ---------- | ------------------ |
291
+ | 1.0.0 | 2026-03-28 | Initial definition |