@precepts/standards 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +30 -0
- package/README.md +115 -0
- package/package.json +40 -0
- package/schema/document-standard-template.md +139 -0
- package/schema/standards.schema.json +154 -0
- package/standards/integration/governance/_category_.json +1 -0
- package/standards/integration/governance/integration-styles.md +56 -0
- package/standards/integration/index.md +9 -0
- package/standards/integration/standards/_category_.json +1 -0
- package/standards/integration/standards/api/_category_.json +1 -0
- package/standards/integration/standards/api/error-handling.md +250 -0
- package/standards/integration/standards/api/resource-design.md +286 -0
- package/standards/integration/standards/data-formats/_category_.json +1 -0
- package/standards/integration/standards/data-formats/character-encoding.md +206 -0
- package/standards/integration/standards/data-formats/date-format.md +102 -0
- package/standards/integration/standards/data-formats/datetime-formats.md +265 -0
- package/standards/integration/standards/data-formats/monetary-format.md +61 -0
- package/standards/integration/standards/events/_category_.json +1 -0
- package/standards/integration/standards/events/event-envelope.md +270 -0
- package/standards/integration/standards/foundational/_category_.json +1 -0
- package/standards/integration/standards/foundational/naming-conventions.md +334 -0
- package/standards/integration/standards/observability/_category_.json +1 -0
- package/standards/integration/standards/observability/integration-observability.md +226 -0
- package/standards/integration/standards/resilience/_category_.json +1 -0
- package/standards/integration/standards/resilience/integration-resilience-patterns.md +291 -0
- package/standards/integration/standards/resilience/retry-policy.md +268 -0
- package/standards/integration/standards/resilience/timeout.md +269 -0
- package/standards/integration/standards/versioning/_category_.json +1 -0
- package/standards/integration/standards/versioning/backward-forward-compatibility.md +230 -0
- package/standards/product/Guidelines/_category_.json +1 -0
- package/standards/product/Guidelines/requirement-document.md +54 -0
- package/standards/product/index.md +9 -0
- package/standards/project-management/index.md +9 -0
- package/standards/ux/index.md +9 -0
|
@@ -0,0 +1,268 @@
|
|
|
1
|
+
---
|
|
2
|
+
identifier: "INTG-STD-034"
|
|
3
|
+
name: "Retry Policy"
|
|
4
|
+
version: "1.0.0"
|
|
5
|
+
status: "MANDATORY"
|
|
6
|
+
|
|
7
|
+
domain: "INTEGRATION"
|
|
8
|
+
documentType: "standard"
|
|
9
|
+
category: "reliability"
|
|
10
|
+
appliesTo: ["api", "events", "a2a", "mcp", "webhooks", "grpc", "graphql", "batch"]
|
|
11
|
+
|
|
12
|
+
lastUpdated: "2026-03-28"
|
|
13
|
+
owner: "Integration Architecture Board"
|
|
14
|
+
|
|
15
|
+
standardsCompliance:
|
|
16
|
+
iso: []
|
|
17
|
+
rfc: ["RFC-9110", "RFC-6585"]
|
|
18
|
+
w3c: []
|
|
19
|
+
other: ["AWS-Builders-Library", "Google-Cloud-Retry-Strategy"]
|
|
20
|
+
|
|
21
|
+
taxonomy:
|
|
22
|
+
capability: "reliability"
|
|
23
|
+
subCapability: "retry-policy"
|
|
24
|
+
layer: "infrastructure"
|
|
25
|
+
|
|
26
|
+
enforcement:
|
|
27
|
+
method: "hybrid"
|
|
28
|
+
validationRules:
|
|
29
|
+
algorithm: "exponential-backoff-with-jitter"
|
|
30
|
+
maxRetries: 5
|
|
31
|
+
maxRetryDuration: "30s"
|
|
32
|
+
rejectionCriteria:
|
|
33
|
+
- "Fixed-interval retry without backoff"
|
|
34
|
+
- "Retry on 4xx client errors (except 429 and 408)"
|
|
35
|
+
- "Retry without idempotency guarantee on non-idempotent operations"
|
|
36
|
+
- "Missing Retry-After header respect"
|
|
37
|
+
|
|
38
|
+
dependsOn: ["INTG-STD-029", "INTG-STD-035"]
|
|
39
|
+
supersedes: ""
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
# Retry Policy
|
|
43
|
+
|
|
44
|
+
## Purpose
|
|
45
|
+
|
|
46
|
+
Uncontrolled retries are a leading cause of cascading failures. When clients retry simultaneously with correlated timing, the resulting "thundering herd" overwhelms recovering services. This standard mandates exponential backoff with jitter, retry budgets, and idempotency requirements across all integration boundaries.
|
|
47
|
+
|
|
48
|
+
Companion standards: INTG-STD-033 (Resilience), INTG-STD-035 (Timeout).
|
|
49
|
+
|
|
50
|
+
> Normative language follows RFC 2119 semantics.
|
|
51
|
+
|
|
52
|
+
## Rules
|
|
53
|
+
|
|
54
|
+
### R-1: Exponential Backoff with Full Jitter
|
|
55
|
+
|
|
56
|
+
All retry implementations **MUST** use exponential backoff with full jitter as the default algorithm:
|
|
57
|
+
|
|
58
|
+
```
|
|
59
|
+
sleep = random_between(0, min(cap, base * 2 ^ attempt))
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
- `base` **MUST** default to 1 second
|
|
63
|
+
- `cap` **MUST** default to 30 seconds
|
|
64
|
+
- Decorrelated jitter **MAY** be used as an alternative
|
|
65
|
+
- Equal jitter and fixed-interval retry **MUST NOT** be used
|
|
66
|
+
|
|
67
|
+
### R-2: Maximum Retry Count
|
|
68
|
+
|
|
69
|
+
All retry implementations **MUST** enforce a maximum retry count.
|
|
70
|
+
|
|
71
|
+
| Context | Default | Allowed Range |
|
|
72
|
+
|---------|---------|---------------|
|
|
73
|
+
| Synchronous API calls | 3 | 1-5 |
|
|
74
|
+
| Async event processing | 5 | 1-10 |
|
|
75
|
+
| Webhook delivery | 5 | 3-8 |
|
|
76
|
+
| Batch job items | 3 | 1-5 |
|
|
77
|
+
| gRPC unary calls | 3 | 1-5 |
|
|
78
|
+
|
|
79
|
+
Exceeding the upper bound requires Integration Architecture Board approval.
|
|
80
|
+
|
|
81
|
+
### R-3: Total Retry Duration
|
|
82
|
+
|
|
83
|
+
All retry loops **MUST** enforce a total duration cap regardless of retry count:
|
|
84
|
+
|
|
85
|
+
- Synchronous API calls: **MUST NOT** exceed 30 seconds
|
|
86
|
+
- Async events / webhooks: **MUST NOT** exceed 24 hours
|
|
87
|
+
|
|
88
|
+
Webhook retry schedule **SHOULD** use increasing intervals:
|
|
89
|
+
|
|
90
|
+
| Attempt | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|
|
91
|
+
|---------|---|---|---|---|---|---|---|---|
|
|
92
|
+
| Delay | 0s | 1s | 5s | 30s | 2m | 15m | 1h | 4h |
|
|
93
|
+
|
|
94
|
+
After the final attempt, the message **MUST** be routed to a DLQ and an alert **MUST** fire.
|
|
95
|
+
|
|
96
|
+
### R-4: Retryable vs Non-Retryable Classification
|
|
97
|
+
|
|
98
|
+
Implementations **MUST** classify failures before deciding whether to retry.
|
|
99
|
+
|
|
100
|
+
Retryable (**MUST** retry with backoff):
|
|
101
|
+
|
|
102
|
+
| HTTP | gRPC | Network |
|
|
103
|
+
|------|------|---------|
|
|
104
|
+
| 408 Request Timeout | UNAVAILABLE (14) | Connection refused |
|
|
105
|
+
| 429 Too Many Requests | DEADLINE_EXCEEDED (4) | Connection reset |
|
|
106
|
+
| 500 Internal Server Error | RESOURCE_EXHAUSTED (8) | Socket timeout |
|
|
107
|
+
| 502 Bad Gateway | ABORTED (10) | DNS failure (max 2 attempts) |
|
|
108
|
+
| 503 Service Unavailable | | TLS handshake timeout |
|
|
109
|
+
| 504 Gateway Timeout | | |
|
|
110
|
+
|
|
111
|
+
Non-retryable (**MUST NOT** retry):
|
|
112
|
+
|
|
113
|
+
| HTTP | gRPC | Other |
|
|
114
|
+
|------|------|-------|
|
|
115
|
+
| 400 Bad Request | INVALID_ARGUMENT (3) | TLS certificate error |
|
|
116
|
+
| 401 Unauthorized | NOT_FOUND (5) | Serialization error |
|
|
117
|
+
| 403 Forbidden | PERMISSION_DENIED (7) | Invalid configuration |
|
|
118
|
+
| 404 Not Found | UNAUTHENTICATED (16) | Token permanently revoked |
|
|
119
|
+
| 409 Conflict | UNIMPLEMENTED (12) | |
|
|
120
|
+
| 422 Unprocessable Entity | | |
|
|
121
|
+
|
|
122
|
+
### R-5: Retry-After Compliance
|
|
123
|
+
|
|
124
|
+
When a response includes a `Retry-After` header (RFC 9110 Section 10.2.3):
|
|
125
|
+
|
|
126
|
+
1. The client **MUST** wait at least the specified duration
|
|
127
|
+
2. `Retry-After` **MUST** take precedence over calculated backoff when greater
|
|
128
|
+
3. If `Retry-After` exceeds remaining retry budget, the client **MUST** fail immediately
|
|
129
|
+
|
|
130
|
+
### R-6: Idempotency Requirements
|
|
131
|
+
|
|
132
|
+
Retries **MUST** be safe. A retry is safe only when the operation is idempotent or protected by an idempotency mechanism.
|
|
133
|
+
|
|
134
|
+
Safe to retry without additional measures: GET, HEAD, OPTIONS, PUT, DELETE.
|
|
135
|
+
|
|
136
|
+
**MUST** use `Idempotency-Key` header for retries: POST, PATCH.
|
|
137
|
+
|
|
138
|
+
Idempotency key rules:
|
|
139
|
+
- Client **MUST** generate a UUID v4 per logical operation
|
|
140
|
+
- Same key **MUST** be reused across all retry attempts
|
|
141
|
+
- Servers **MUST** store keys for minimum 24 hours and return cached responses for duplicates
|
|
142
|
+
- Keys **MUST** be at most 64 characters
|
|
143
|
+
|
|
144
|
+
Event consumers **MUST** implement idempotent processing using a deduplication store keyed on event ID (minimum 7-day window).
|
|
145
|
+
|
|
146
|
+
### R-7: Retry Budget
|
|
147
|
+
|
|
148
|
+
All services **MUST** enforce a retry budget to prevent amplification:
|
|
149
|
+
|
|
150
|
+
- Maximum **20%** of total requests **MAY** be retries over a rolling 30-second window
|
|
151
|
+
- When budget is exhausted, retries **MUST** be suppressed and original error returned
|
|
152
|
+
- Budget **MUST** be tracked per downstream dependency
|
|
153
|
+
- Budget exhaustion **SHOULD** trigger circuit breaker open (INTG-STD-033)
|
|
154
|
+
|
|
155
|
+
### R-8: Dead-Letter Queue Routing
|
|
156
|
+
|
|
157
|
+
For async integrations (events, webhooks, queues):
|
|
158
|
+
|
|
159
|
+
1. After retry exhaustion, messages **MUST** be routed to a DLQ
|
|
160
|
+
2. DLQ messages **MUST** retain original payload, headers, and retry metadata
|
|
161
|
+
3. DLQ messages **MUST** trigger an alert
|
|
162
|
+
4. Messages **MUST NOT** be silently dropped
|
|
163
|
+
|
|
164
|
+
### R-9: Security
|
|
165
|
+
|
|
166
|
+
Retry logic **MUST NOT** introduce security vulnerabilities:
|
|
167
|
+
|
|
168
|
+
1. Retries **MUST** use the original auth context (refresh token if expired, never degrade)
|
|
169
|
+
2. Retry logs **MUST NOT** include request bodies, tokens, or PII
|
|
170
|
+
3. TLS certificate errors **MUST NOT** be retried (potential MITM)
|
|
171
|
+
4. Retry budget (R-7) is mandatory to prevent DDoS amplification
|
|
172
|
+
|
|
173
|
+
### R-10: Observability
|
|
174
|
+
|
|
175
|
+
Every retry attempt **MUST** be logged with: `correlation_id`, `dependency`, `attempt`, `max_attempts`, `backoff_ms`, `error_type`, `idempotency_key`.
|
|
176
|
+
|
|
177
|
+
Required metrics:
|
|
178
|
+
|
|
179
|
+
| Metric | Type |
|
|
180
|
+
|--------|------|
|
|
181
|
+
| `retry_attempts_total` | Counter (by service, dependency, attempt_number) |
|
|
182
|
+
| `retry_exhausted_total` | Counter (by service, dependency) |
|
|
183
|
+
| `retry_backoff_duration_seconds` | Histogram |
|
|
184
|
+
| `retry_budget_utilization_ratio` | Gauge |
|
|
185
|
+
| `dlq_messages_total` | Counter (by queue) |
|
|
186
|
+
|
|
187
|
+
## Examples
|
|
188
|
+
|
|
189
|
+
### Retry with full jitter
|
|
190
|
+
|
|
191
|
+
```
|
|
192
|
+
function retry(operation, max_retries=3, base=1.0, cap=30.0, budget):
|
|
193
|
+
deadline = now() + max_duration
|
|
194
|
+
|
|
195
|
+
for attempt in 0..max_retries:
|
|
196
|
+
result = operation()
|
|
197
|
+
if result.success:
|
|
198
|
+
return result
|
|
199
|
+
|
|
200
|
+
if not is_retryable(result.error):
|
|
201
|
+
fail(result.error)
|
|
202
|
+
|
|
203
|
+
remaining = deadline - now()
|
|
204
|
+
if attempt == max_retries or remaining <= 0:
|
|
205
|
+
fail("retries exhausted", attempts=attempt+1)
|
|
206
|
+
if not budget.may_retry():
|
|
207
|
+
fail("retry budget exhausted")
|
|
208
|
+
|
|
209
|
+
delay = min(random(0, min(cap, base * 2^attempt)), remaining)
|
|
210
|
+
log(attempt=attempt+1, backoff_ms=delay*1000, error=result.error)
|
|
211
|
+
wait(delay)
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### Idempotent retry on a non-idempotent operation
|
|
215
|
+
|
|
216
|
+
First attempt:
|
|
217
|
+
|
|
218
|
+
```
|
|
219
|
+
POST /v1/payments
|
|
220
|
+
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2
|
|
221
|
+
|
|
222
|
+
{"amount": 100.00, "currency": "USD"}
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Retry (same key - server returns cached response, no duplicate side effect):
|
|
226
|
+
|
|
227
|
+
```
|
|
228
|
+
POST /v1/payments
|
|
229
|
+
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2
|
|
230
|
+
|
|
231
|
+
{"amount": 100.00, "currency": "USD"}
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
## Enforcement Rules
|
|
235
|
+
|
|
236
|
+
| Rule | Gate | Action |
|
|
237
|
+
|------|------|--------|
|
|
238
|
+
| Fixed-interval retry detected | CI lint | Block merge |
|
|
239
|
+
| POST/PATCH retry without idempotency key | CI lint | Block merge |
|
|
240
|
+
| Retry on non-retryable status code | CI lint | Block merge |
|
|
241
|
+
| Missing retry budget | Production readiness | Block deploy |
|
|
242
|
+
| Max retries exceeds allowed range | Architecture review | IAB approval |
|
|
243
|
+
| DLQ not configured for async consumers | Production readiness | Block deploy |
|
|
244
|
+
|
|
245
|
+
## References
|
|
246
|
+
|
|
247
|
+
- [RFC 9110 Section 10.2.3 - Retry-After](https://www.rfc-editor.org/rfc/rfc9110#section-10.2.3)
|
|
248
|
+
- [RFC 6585 - 429 Too Many Requests](https://www.rfc-editor.org/rfc/rfc6585)
|
|
249
|
+
- [AWS Builders' Library - Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)
|
|
250
|
+
- [AWS Architecture Blog - Exponential Backoff and Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
|
|
251
|
+
- [Google Cloud - Retry Strategy](https://cloud.google.com/storage/docs/retry-strategy)
|
|
252
|
+
- [Stripe - Idempotent Requests](https://docs.stripe.com/api/idempotent_requests)
|
|
253
|
+
|
|
254
|
+
## Rationale
|
|
255
|
+
|
|
256
|
+
**Full jitter over alternatives:** AWS research shows full jitter produces the least total work across competing clients. Equal jitter's guaranteed minimum floor creates clustering that partially defeats the purpose of jitter.
|
|
257
|
+
|
|
258
|
+
**20% retry budget:** Google's gRPC default. Without a budget, a service at 1,000 req/s with 50% failure and 3 retries amplifies to 2,500 req/s. At 20%, it stays at 1,200 req/s - manageable headroom for recovery.
|
|
259
|
+
|
|
260
|
+
**Idempotency keys:** Stripe's pattern makes retries safe at the protocol level without requiring retry logic to understand business semantics.
|
|
261
|
+
|
|
262
|
+
**DLQ over infinite retry:** Infinite retry causes unbounded queue growth, head-of-line blocking, and masks bugs. DLQ cleanly separates transient failures from problems needing human attention.
|
|
263
|
+
|
|
264
|
+
## Version History
|
|
265
|
+
|
|
266
|
+
| Version | Date | Change |
|
|
267
|
+
| ------- | ---------- | ------------------ |
|
|
268
|
+
| 1.0.0 | 2026-03-28 | Initial definition |
|
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
---
|
|
2
|
+
identifier: "INTG-STD-035"
|
|
3
|
+
name: "Timeout Standard"
|
|
4
|
+
version: "1.0.0"
|
|
5
|
+
status: "MANDATORY"
|
|
6
|
+
|
|
7
|
+
domain: "INTEGRATION"
|
|
8
|
+
documentType: "standard"
|
|
9
|
+
category: "reliability"
|
|
10
|
+
appliesTo: ["api", "events", "a2a", "mcp", "webhooks", "grpc", "graphql", "batch"]
|
|
11
|
+
|
|
12
|
+
lastUpdated: "2026-03-28"
|
|
13
|
+
owner: "Integration Architecture Board"
|
|
14
|
+
|
|
15
|
+
standardsCompliance:
|
|
16
|
+
iso: []
|
|
17
|
+
rfc: ["RFC-9110"]
|
|
18
|
+
w3c: []
|
|
19
|
+
other: ["Zalando-Engineering-Timeouts", "AWS-Builders-Library", "gRPC-Deadlines"]
|
|
20
|
+
|
|
21
|
+
taxonomy:
|
|
22
|
+
capability: "reliability"
|
|
23
|
+
subCapability: "timeout-management"
|
|
24
|
+
layer: "infrastructure"
|
|
25
|
+
|
|
26
|
+
enforcement:
|
|
27
|
+
method: "hybrid"
|
|
28
|
+
validationRules:
|
|
29
|
+
connectionTimeout: "2s"
|
|
30
|
+
readTimeout: "5s"
|
|
31
|
+
totalTimeout: "30s"
|
|
32
|
+
rejectionCriteria:
|
|
33
|
+
- "Missing timeout configuration on any external call"
|
|
34
|
+
- "Infinite or unset timeouts"
|
|
35
|
+
- "Timeout longer than upstream caller's deadline"
|
|
36
|
+
|
|
37
|
+
dependsOn: ["INTG-STD-029"]
|
|
38
|
+
supersedes: ""
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
# Timeout
|
|
42
|
+
|
|
43
|
+
## Purpose
|
|
44
|
+
|
|
45
|
+
Every external call - whether to an API, database, message broker, or file system - **MUST** have an explicit timeout. Unbounded waits are the single most common cause of cascading failures in distributed systems. This standard defines mandatory timeout categories, default values by integration type, deadline propagation rules, and observability requirements. It complements INTG-STD-033 (Resilience) and INTG-STD-034 (Retry).
|
|
46
|
+
|
|
47
|
+
## Rules
|
|
48
|
+
|
|
49
|
+
### R-1: Explicit Timeout on Every External Call
|
|
50
|
+
|
|
51
|
+
Every outbound call to an external dependency **MUST** have an explicit timeout configured. This includes HTTP/REST, gRPC, database queries, message broker operations, file/object storage, DNS lookups, cache reads/writes, and any third-party SDK call performing network I/O. Relying on language or library default timeouts is **NOT** acceptable - many libraries default to infinite timeouts.
|
|
52
|
+
|
|
53
|
+
### R-2: Separate Connection and Read Timeouts
|
|
54
|
+
|
|
55
|
+
Services **MUST** configure connection timeout and read timeout as independent values. Connection timeout governs TCP handshake completion; read timeout governs time waiting for the server response after the connection is established.
|
|
56
|
+
|
|
57
|
+
Connection timeout **SHOULD** follow: `connection_timeout = round_trip_time * 3`. For most intra-region calls, 2 seconds provides ample headroom. Read timeout **MUST** be based on measured downstream latency percentiles (p99.9 recommended), not guesswork.
|
|
58
|
+
|
|
59
|
+
### R-3: Default Timeout Values by Integration Type
|
|
60
|
+
|
|
61
|
+
The following defaults **MUST** be used unless a documented exception exists with architectural approval:
|
|
62
|
+
|
|
63
|
+
| Integration Type | Connection | Read | Total | Rationale |
|
|
64
|
+
| ---------------------- | ---------- | ----- | ------ | --------------------------------------------- |
|
|
65
|
+
| REST/HTTP API | 2s | 5s | 10s | Most API calls complete within 1-2s at p99 |
|
|
66
|
+
| gRPC (unary) | 2s | 5s | 10s | Comparable to REST for request-response |
|
|
67
|
+
| gRPC (streaming) | 2s | 30s | 300s | Streams require longer read windows |
|
|
68
|
+
| Database query | 2s | 3s | 5s | Queries beyond 3s indicate missing indexes |
|
|
69
|
+
| Database transaction | 2s | 3s | 10s | Multi-statement transactions need headroom |
|
|
70
|
+
| Message publish | 2s | 5s | 10s | Broker acknowledgment should be fast |
|
|
71
|
+
| Message consume | 2s | 30s | 60s | Long-poll patterns require extended waits |
|
|
72
|
+
| Cache (Redis/Memcache) | 1s | 1s | 2s | Cache misses should fail fast |
|
|
73
|
+
| File/Object storage | 5s | 60s | 120s | Large transfers need proportional budgets |
|
|
74
|
+
| SMTP/Email | 5s | 30s | 60s | Mail servers vary widely in response time |
|
|
75
|
+
| DNS resolution | 2s | N/A | 2s | DNS should resolve from local cache |
|
|
76
|
+
| MCP tool invocation | 2s | 10s | 15s | AI tool calls may involve upstream LLM calls |
|
|
77
|
+
| Webhook delivery | 2s | 5s | 10s | Receiver should acknowledge quickly |
|
|
78
|
+
|
|
79
|
+
Exceptions **MUST** be documented in the service's integration manifest with the dependency name, overridden value, justification, architecture team approval, and a review date no longer than 6 months out.
|
|
80
|
+
|
|
81
|
+
### R-4: Deadline Propagation
|
|
82
|
+
|
|
83
|
+
Services that receive inbound requests with a deadline **MUST** propagate a reduced deadline to downstream calls:
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
downstream_deadline = incoming_deadline - elapsed_time - safety_margin
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
A safety margin of 100-500ms is **RECOMMENDED**. Protocol-specific mechanisms:
|
|
90
|
+
|
|
91
|
+
| Protocol | Mechanism | Notes |
|
|
92
|
+
| -------- | ------------------------------------------------- | ------------------------------------------- |
|
|
93
|
+
| gRPC | `grpc-timeout` header (automatic) | Framework propagates via context |
|
|
94
|
+
| HTTP | `X-Request-Deadline` header (epoch millis) | Application layer must read and propagate |
|
|
95
|
+
| Kafka | Message header `x-deadline` or record timestamp + TTL | Consumer checks before processing |
|
|
96
|
+
| GraphQL | `extensions.deadline` field | Resolver checks remaining budget per field |
|
|
97
|
+
|
|
98
|
+
For gRPC, services **MUST NOT** create new contexts that discard the incoming deadline. For HTTP, services **SHOULD** include and honor the `X-Request-Deadline` header.
|
|
99
|
+
|
|
100
|
+
### R-5: Deadline Budget Enforcement
|
|
101
|
+
|
|
102
|
+
A service **MUST NOT** initiate a downstream call if the remaining deadline budget is less than the minimum time required to complete it. Instead, the service **MUST** return immediately with a timeout error, log the budget exhaustion event, and increment the `timeout.budget_exhausted` metric.
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
function call_downstream(incoming_deadline, safety_margin, downstream_min):
|
|
106
|
+
remaining = incoming_deadline - now() - safety_margin
|
|
107
|
+
if remaining < downstream_min:
|
|
108
|
+
log_warn("Deadline budget exhausted",
|
|
109
|
+
remaining_ms=remaining, required_ms=downstream_min)
|
|
110
|
+
emit_metric("timeout.budget_exhausted_total")
|
|
111
|
+
return TIMEOUT_ERROR
|
|
112
|
+
|
|
113
|
+
downstream_deadline = now() + remaining
|
|
114
|
+
return call(downstream, deadline=downstream_deadline)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### R-6: Protocol-Specific Timeout Rules
|
|
118
|
+
|
|
119
|
+
Services **MUST** apply the following protocol-specific rules:
|
|
120
|
+
|
|
121
|
+
| Protocol | Rule | Severity |
|
|
122
|
+
| --------------- | -------------------------------------------------------------------------------------- | -------- |
|
|
123
|
+
| HTTP/REST | Return `408` when server terminates a slow client; `504` when gateway times out upstream | ERROR |
|
|
124
|
+
| HTTP/REST | TLS handshake timeout **MUST** be included in total timeout budget | ERROR |
|
|
125
|
+
| gRPC | Every RPC **MUST** have a deadline set; calls without deadlines are unbounded | ERROR |
|
|
126
|
+
| gRPC | Servers **MUST** check context cancellation and abort work on expired deadlines | ERROR |
|
|
127
|
+
| Kafka | `request.timeout.ms` on producer and `max.poll.interval.ms` on consumer **MUST** be set | ERROR |
|
|
128
|
+
| RabbitMQ | Consumer ack timeout **MUST** be configured; message TTL **SHOULD** be set | ERROR |
|
|
129
|
+
| SQS | `VisibilityTimeout` **MUST** be at least 6x expected processing time | ERROR |
|
|
130
|
+
| Database | `statement_timeout` and `idle_in_transaction_session_timeout` **MUST** be configured | ERROR |
|
|
131
|
+
| File/Object | Per-part upload timeout and stalled transfer detection (30s recommended) **MUST** be set | ERROR |
|
|
132
|
+
|
|
133
|
+
### R-7: Client-Side and Server-Side Timeouts
|
|
134
|
+
|
|
135
|
+
Both client and server **MUST** configure timeouts independently. Client-side timeouts protect against slow servers; server-side timeouts protect against slow or malicious clients.
|
|
136
|
+
|
|
137
|
+
The client-side timeout **MUST** be at most the server-side timeout for the same operation. If the client times out first, the server wastes resources processing a request whose result will be discarded.
|
|
138
|
+
|
|
139
|
+
### R-8: Timeout and Circuit Breaker Interaction
|
|
140
|
+
|
|
141
|
+
Timeouts and circuit breakers (INTG-STD-033) **MUST** work together:
|
|
142
|
+
|
|
143
|
+
1. Each timeout event **MUST** increment the circuit breaker's failure counter. When the threshold is reached, the circuit **MUST** open.
|
|
144
|
+
2. When the circuit is open, requests **MUST** fail immediately without waiting for a timeout.
|
|
145
|
+
3. Half-open probe requests **SHOULD** use 50% of the normal timeout for faster degradation detection.
|
|
146
|
+
|
|
147
|
+
### R-9: Security - Timeouts as Defense
|
|
148
|
+
|
|
149
|
+
Timeouts **MUST** defend against resource exhaustion attacks:
|
|
150
|
+
|
|
151
|
+
- **Slowloris prevention:** Server read-header timeout **MUST** be 5s or less.
|
|
152
|
+
- **Slow POST prevention:** Server **MUST** enforce a minimum data rate; terminate connections below 500 bytes/second for more than 10 seconds.
|
|
153
|
+
- **Connection pool exhaustion:** Idle connections beyond 120s **SHOULD** be closed.
|
|
154
|
+
- **Query of death:** Database statement timeouts **MUST** prevent single queries from monopolizing resources.
|
|
155
|
+
|
|
156
|
+
Services **MUST NOT** extend timeouts under load. Longer timeouts during overload consume more resources and accelerate cascading failures. The correct response is to shed load via circuit breakers or rate limiting.
|
|
157
|
+
|
|
158
|
+
### R-10: Timeout Metrics
|
|
159
|
+
|
|
160
|
+
Services **MUST** emit the following metrics for every timed external call:
|
|
161
|
+
|
|
162
|
+
| Metric | Type | Labels |
|
|
163
|
+
| ------------------------------------- | --------- | ----------------------------------------- |
|
|
164
|
+
| `external_call.duration_ms` | Histogram | `dependency`, `operation`, `result` |
|
|
165
|
+
| `external_call.timeout_total` | Counter | `dependency`, `operation`, `timeout_type` |
|
|
166
|
+
| `external_call.deadline_remaining_ms` | Histogram | `dependency`, `operation` |
|
|
167
|
+
| `timeout.budget_exhausted_total` | Counter | `dependency`, `operation` |
|
|
168
|
+
|
|
169
|
+
Where `timeout_type` is one of: `connection`, `read`, `write`, `total`, `deadline_exceeded`. `result` is one of: `success`, `timeout`, `error`.
|
|
170
|
+
|
|
171
|
+
### R-11: Timeout Logging
|
|
172
|
+
|
|
173
|
+
Every timeout event **MUST** be logged with at minimum: `dependency`, `operation`, `timeout_type`, `configured_timeout_ms`, and `elapsed_ms`. Additional **RECOMMENDED** fields: `trace_id`, `span_id`, `deadline_remaining_ms`, `retry_attempt`, `circuit_breaker_state`.
|
|
174
|
+
|
|
175
|
+
### R-12: Timeout Alerting
|
|
176
|
+
|
|
177
|
+
Services **MUST** configure alerts for:
|
|
178
|
+
|
|
179
|
+
| Condition | Severity | Action |
|
|
180
|
+
| ------------------------------------------ | -------- | ---------------------------------------------- |
|
|
181
|
+
| Timeout rate above 5% for a dependency | WARNING | Investigate dependency health |
|
|
182
|
+
| Timeout rate above 20% for a dependency | CRITICAL | Trigger incident; circuit breaker should open |
|
|
183
|
+
| Budget exhaustion rate above 1% | WARNING | Review timeout budgets and call chain |
|
|
184
|
+
| p99 latency above 80% of configured timeout | WARNING | Timeout too tight or dependency is degrading |
|
|
185
|
+
|
|
186
|
+
## Enforcement Rules
|
|
187
|
+
|
|
188
|
+
The following **MUST** be enforced via static analysis, configuration scanning, or integration tests:
|
|
189
|
+
|
|
190
|
+
| Rule | Check | Severity |
|
|
191
|
+
| ------- | ----------------------------------------------------------------- | -------- |
|
|
192
|
+
| TMO-001 | Every HTTP client has explicit connection timeout | ERROR |
|
|
193
|
+
| TMO-002 | Every HTTP client has explicit read timeout | ERROR |
|
|
194
|
+
| TMO-003 | Connection timeout is at most 5s | ERROR |
|
|
195
|
+
| TMO-004 | Read timeout is at most 30s (exceptions require approval) | WARNING |
|
|
196
|
+
| TMO-005 | Total timeout is at most 120s (exceptions require approval) | WARNING |
|
|
197
|
+
| TMO-006 | Database `statement_timeout` is configured | ERROR |
|
|
198
|
+
| TMO-007 | Kafka `max.poll.interval.ms` is at most 300s | WARNING |
|
|
199
|
+
| TMO-008 | No infinite or zero timeout values in config | ERROR |
|
|
200
|
+
| TMO-009 | Server read-header timeout is at most 5s | ERROR |
|
|
201
|
+
| TMO-010 | gRPC calls have deadline set | ERROR |
|
|
202
|
+
|
|
203
|
+
Additional enforcement:
|
|
204
|
+
|
|
205
|
+
- **Gateway:** API gateways **MUST** enforce a maximum total timeout; requests exceeding it receive `504`.
|
|
206
|
+
- **Code review:** PRs introducing new external calls **MUST** include timeout configuration.
|
|
207
|
+
- **Runtime:** Services **SHOULD** log a warning when any call exceeds 80% of its configured timeout.
|
|
208
|
+
|
|
209
|
+
## Examples
|
|
210
|
+
|
|
211
|
+
### Deadline Propagation
|
|
212
|
+
|
|
213
|
+
The following pseudocode illustrates how a service receives an upstream deadline and propagates a reduced deadline to each downstream call:
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
function handle_request(request):
|
|
217
|
+
deadline = parse_deadline_header(request)
|
|
218
|
+
if deadline is null:
|
|
219
|
+
deadline = now() + DEFAULT_TIMEOUT
|
|
220
|
+
|
|
221
|
+
# Local processing
|
|
222
|
+
result_a = fetch_from_service_a(request, deadline)
|
|
223
|
+
|
|
224
|
+
# Recalculate remaining budget before next call
|
|
225
|
+
remaining = deadline - now() - SAFETY_MARGIN
|
|
226
|
+
if remaining < MIN_DOWNSTREAM_TIMEOUT:
|
|
227
|
+
return error(408, "Deadline budget exhausted")
|
|
228
|
+
|
|
229
|
+
result_b = fetch_from_service_b(request, deadline)
|
|
230
|
+
return combine(result_a, result_b)
|
|
231
|
+
|
|
232
|
+
function fetch_from_service_a(request, upstream_deadline):
|
|
233
|
+
remaining = upstream_deadline - now() - SAFETY_MARGIN
|
|
234
|
+
if remaining < MIN_DOWNSTREAM_TIMEOUT:
|
|
235
|
+
raise TimeoutError("Budget exhausted before calling Service A")
|
|
236
|
+
|
|
237
|
+
return http_call(
|
|
238
|
+
url = SERVICE_A_URL,
|
|
239
|
+
timeout = remaining,
|
|
240
|
+
headers = {"X-Request-Deadline": upstream_deadline}
|
|
241
|
+
)
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
## Rationale
|
|
245
|
+
|
|
246
|
+
**Why separate connection and read timeouts?** They measure different failure modes. Connection timeout detects network unreachability (host down); read timeout measures server processing speed. Conflating them either slows failure detection or sets unrealistic response expectations.
|
|
247
|
+
|
|
248
|
+
**Why mandate deadline propagation?** Without it, a 5-service chain with 10s timeouts per hop can block the originator for 50s - long after the client has disconnected - while downstream services continue wasted work.
|
|
249
|
+
|
|
250
|
+
**Why aggressive defaults?** Short timeouts force architectural discipline. A 3s database timeout surfaces missing indexes during development, not production incidents. Services needing longer timeouts can request documented exceptions.
|
|
251
|
+
|
|
252
|
+
**Why not just circuit breakers?** Timeouts bound a single call's duration; circuit breakers prevent repeated calls to failing dependencies. Without timeouts, circuit breakers have no signal for slow failures. Both are required; neither is sufficient alone.
|
|
253
|
+
|
|
254
|
+
**Why never extend timeouts under load?** Longer timeouts during overload mean more in-flight requests, more consumed resources, and faster cascading failure. The correct response is load shedding, not longer waits.
|
|
255
|
+
|
|
256
|
+
## References
|
|
257
|
+
|
|
258
|
+
- [**RFC 9110**](https://httpwg.org/specs/rfc9110.html) - HTTP Semantics (408 Request Timeout, 504 Gateway Timeout)
|
|
259
|
+
- [**Zalando Engineering - Timeouts**](https://engineering.zalando.com/posts/2023/07/all-you-need-to-know-about-timeouts.html) - Connection timeout formula, latency percentile baselines
|
|
260
|
+
- [**AWS Builders' Library - Timeouts, Retries, and Backoff**](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) - False-timeout rate, retry multiplication risks
|
|
261
|
+
- [**gRPC Deadlines**](https://grpc.io/docs/guides/deadlines/) - Automatic deadline propagation, DEADLINE_EXCEEDED
|
|
262
|
+
- **INTG-STD-033** - Resilience Standard (circuit breakers, bulkheads, fallbacks)
|
|
263
|
+
- **INTG-STD-034** - Retry Standard (retry policies, backoff, idempotency)
|
|
264
|
+
|
|
265
|
+
## Version History
|
|
266
|
+
|
|
267
|
+
| Version | Date | Change |
|
|
268
|
+
| ------- | ---------- | ------------------ |
|
|
269
|
+
| 1.0.0 | 2026-03-28 | Initial definition |
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"label": "Versioning", "position": 7}
|