@precepts/standards 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/LICENSE +30 -0
  2. package/README.md +115 -0
  3. package/package.json +40 -0
  4. package/schema/document-standard-template.md +139 -0
  5. package/schema/standards.schema.json +154 -0
  6. package/standards/integration/governance/_category_.json +1 -0
  7. package/standards/integration/governance/integration-styles.md +56 -0
  8. package/standards/integration/index.md +9 -0
  9. package/standards/integration/standards/_category_.json +1 -0
  10. package/standards/integration/standards/api/_category_.json +1 -0
  11. package/standards/integration/standards/api/error-handling.md +250 -0
  12. package/standards/integration/standards/api/resource-design.md +286 -0
  13. package/standards/integration/standards/data-formats/_category_.json +1 -0
  14. package/standards/integration/standards/data-formats/character-encoding.md +206 -0
  15. package/standards/integration/standards/data-formats/date-format.md +102 -0
  16. package/standards/integration/standards/data-formats/datetime-formats.md +265 -0
  17. package/standards/integration/standards/data-formats/monetary-format.md +61 -0
  18. package/standards/integration/standards/events/_category_.json +1 -0
  19. package/standards/integration/standards/events/event-envelope.md +270 -0
  20. package/standards/integration/standards/foundational/_category_.json +1 -0
  21. package/standards/integration/standards/foundational/naming-conventions.md +334 -0
  22. package/standards/integration/standards/observability/_category_.json +1 -0
  23. package/standards/integration/standards/observability/integration-observability.md +226 -0
  24. package/standards/integration/standards/resilience/_category_.json +1 -0
  25. package/standards/integration/standards/resilience/integration-resilience-patterns.md +291 -0
  26. package/standards/integration/standards/resilience/retry-policy.md +268 -0
  27. package/standards/integration/standards/resilience/timeout.md +269 -0
  28. package/standards/integration/standards/versioning/_category_.json +1 -0
  29. package/standards/integration/standards/versioning/backward-forward-compatibility.md +230 -0
  30. package/standards/product/Guidelines/_category_.json +1 -0
  31. package/standards/product/Guidelines/requirement-document.md +54 -0
  32. package/standards/product/index.md +9 -0
  33. package/standards/project-management/index.md +9 -0
  34. package/standards/ux/index.md +9 -0
@@ -0,0 +1,268 @@
1
+ ---
2
+ identifier: "INTG-STD-034"
3
+ name: "Retry Policy"
4
+ version: "1.0.0"
5
+ status: "MANDATORY"
6
+
7
+ domain: "INTEGRATION"
8
+ documentType: "standard"
9
+ category: "reliability"
10
+ appliesTo: ["api", "events", "a2a", "mcp", "webhooks", "grpc", "graphql", "batch"]
11
+
12
+ lastUpdated: "2026-03-28"
13
+ owner: "Integration Architecture Board"
14
+
15
+ standardsCompliance:
16
+ iso: []
17
+ rfc: ["RFC-9110", "RFC-6585"]
18
+ w3c: []
19
+ other: ["AWS-Builders-Library", "Google-Cloud-Retry-Strategy"]
20
+
21
+ taxonomy:
22
+ capability: "reliability"
23
+ subCapability: "retry-policy"
24
+ layer: "infrastructure"
25
+
26
+ enforcement:
27
+ method: "hybrid"
28
+ validationRules:
29
+ algorithm: "exponential-backoff-with-jitter"
30
+ maxRetries: 5
31
+ maxRetryDuration: "30s"
32
+ rejectionCriteria:
33
+ - "Fixed-interval retry without backoff"
34
+ - "Retry on 4xx client errors (except 429 and 408)"
35
+ - "Retry without idempotency guarantee on non-idempotent operations"
36
+ - "Missing Retry-After header respect"
37
+
38
+ dependsOn: ["INTG-STD-029", "INTG-STD-035"]
39
+ supersedes: ""
40
+ ---
41
+
42
+ # Retry Policy
43
+
44
+ ## Purpose
45
+
46
+ Uncontrolled retries are a leading cause of cascading failures. When clients retry simultaneously with correlated timing, the resulting "thundering herd" overwhelms recovering services. This standard mandates exponential backoff with jitter, retry budgets, and idempotency requirements across all integration boundaries.
47
+
48
+ Companion standards: INTG-STD-033 (Resilience), INTG-STD-035 (Timeout).
49
+
50
+ > Normative language follows RFC 2119 semantics.
51
+
52
+ ## Rules
53
+
54
+ ### R-1: Exponential Backoff with Full Jitter
55
+
56
+ All retry implementations **MUST** use exponential backoff with full jitter as the default algorithm:
57
+
58
+ ```
59
+ sleep = random_between(0, min(cap, base * 2 ^ attempt))
60
+ ```
61
+
62
+ - `base` **MUST** default to 1 second
63
+ - `cap` **MUST** default to 30 seconds
64
+ - Decorrelated jitter **MAY** be used as an alternative
65
+ - Equal jitter and fixed-interval retry **MUST NOT** be used
66
+
67
+ ### R-2: Maximum Retry Count
68
+
69
+ All retry implementations **MUST** enforce a maximum retry count.
70
+
71
+ | Context | Default | Allowed Range |
72
+ |---------|---------|---------------|
73
+ | Synchronous API calls | 3 | 1-5 |
74
+ | Async event processing | 5 | 1-10 |
75
+ | Webhook delivery | 5 | 3-8 |
76
+ | Batch job items | 3 | 1-5 |
77
+ | gRPC unary calls | 3 | 1-5 |
78
+
79
+ Exceeding the upper bound requires Integration Architecture Board approval.
80
+
81
+ ### R-3: Total Retry Duration
82
+
83
+ All retry loops **MUST** enforce a total duration cap regardless of retry count:
84
+
85
+ - Synchronous API calls: **MUST NOT** exceed 30 seconds
86
+ - Async events / webhooks: **MUST NOT** exceed 24 hours
87
+
88
+ Webhook retry schedule **SHOULD** use increasing intervals:
89
+
90
+ | Attempt | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
91
+ |---------|---|---|---|---|---|---|---|---|
92
+ | Delay | 0s | 1s | 5s | 30s | 2m | 15m | 1h | 4h |
93
+
94
+ After the final attempt, the message **MUST** be routed to a DLQ and an alert **MUST** fire.
95
+
96
+ ### R-4: Retryable vs Non-Retryable Classification
97
+
98
+ Implementations **MUST** classify failures before deciding whether to retry.
99
+
100
+ Retryable (**MUST** retry with backoff):
101
+
102
+ | HTTP | gRPC | Network |
103
+ |------|------|---------|
104
+ | 408 Request Timeout | UNAVAILABLE (14) | Connection refused |
105
+ | 429 Too Many Requests | DEADLINE_EXCEEDED (4) | Connection reset |
106
+ | 500 Internal Server Error | RESOURCE_EXHAUSTED (8) | Socket timeout |
107
+ | 502 Bad Gateway | ABORTED (10) | DNS failure (max 2 attempts) |
108
+ | 503 Service Unavailable | | TLS handshake timeout |
109
+ | 504 Gateway Timeout | | |
110
+
111
+ Non-retryable (**MUST NOT** retry):
112
+
113
+ | HTTP | gRPC | Other |
114
+ |------|------|-------|
115
+ | 400 Bad Request | INVALID_ARGUMENT (3) | TLS certificate error |
116
+ | 401 Unauthorized | NOT_FOUND (5) | Serialization error |
117
+ | 403 Forbidden | PERMISSION_DENIED (7) | Invalid configuration |
118
+ | 404 Not Found | UNAUTHENTICATED (16) | Token permanently revoked |
119
+ | 409 Conflict | UNIMPLEMENTED (12) | |
120
+ | 422 Unprocessable Entity | | |
121
+
122
+ ### R-5: Retry-After Compliance
123
+
124
+ When a response includes a `Retry-After` header (RFC 9110 Section 10.2.3):
125
+
126
+ 1. The client **MUST** wait at least the specified duration
127
+ 2. `Retry-After` **MUST** take precedence over calculated backoff when greater
128
+ 3. If `Retry-After` exceeds remaining retry budget, the client **MUST** fail immediately
129
+
130
+ ### R-6: Idempotency Requirements
131
+
132
+ Retries **MUST** be safe. A retry is safe only when the operation is idempotent or protected by an idempotency mechanism.
133
+
134
+ Safe to retry without additional measures: GET, HEAD, OPTIONS, PUT, DELETE.
135
+
136
+ **MUST** use `Idempotency-Key` header for retries: POST, PATCH.
137
+
138
+ Idempotency key rules:
139
+ - Client **MUST** generate a UUID v4 per logical operation
140
+ - Same key **MUST** be reused across all retry attempts
141
+ - Servers **MUST** store keys for minimum 24 hours and return cached responses for duplicates
142
+ - Keys **MUST** be at most 64 characters
143
+
144
+ Event consumers **MUST** implement idempotent processing using a deduplication store keyed on event ID (minimum 7-day window).
145
+
146
+ ### R-7: Retry Budget
147
+
148
+ All services **MUST** enforce a retry budget to prevent amplification:
149
+
150
+ - Maximum **20%** of total requests **MAY** be retries over a rolling 30-second window
151
+ - When budget is exhausted, retries **MUST** be suppressed and original error returned
152
+ - Budget **MUST** be tracked per downstream dependency
153
+ - Budget exhaustion **SHOULD** trigger circuit breaker open (INTG-STD-033)
154
+
155
+ ### R-8: Dead-Letter Queue Routing
156
+
157
+ For async integrations (events, webhooks, queues):
158
+
159
+ 1. After retry exhaustion, messages **MUST** be routed to a DLQ
160
+ 2. DLQ messages **MUST** retain original payload, headers, and retry metadata
161
+ 3. DLQ messages **MUST** trigger an alert
162
+ 4. Messages **MUST NOT** be silently dropped
163
+
164
+ ### R-9: Security
165
+
166
+ Retry logic **MUST NOT** introduce security vulnerabilities:
167
+
168
+ 1. Retries **MUST** use the original auth context (refresh token if expired, never degrade)
169
+ 2. Retry logs **MUST NOT** include request bodies, tokens, or PII
170
+ 3. TLS certificate errors **MUST NOT** be retried (potential MITM)
171
+ 4. Retry budget (R-7) is mandatory to prevent DDoS amplification
172
+
173
+ ### R-10: Observability
174
+
175
+ Every retry attempt **MUST** be logged with: `correlation_id`, `dependency`, `attempt`, `max_attempts`, `backoff_ms`, `error_type`, `idempotency_key`.
176
+
177
+ Required metrics:
178
+
179
+ | Metric | Type |
180
+ |--------|------|
181
+ | `retry_attempts_total` | Counter (by service, dependency, attempt_number) |
182
+ | `retry_exhausted_total` | Counter (by service, dependency) |
183
+ | `retry_backoff_duration_seconds` | Histogram |
184
+ | `retry_budget_utilization_ratio` | Gauge |
185
+ | `dlq_messages_total` | Counter (by queue) |
186
+
187
+ ## Examples
188
+
189
+ ### Retry with full jitter
190
+
191
+ ```
192
+ function retry(operation, max_retries=3, base=1.0, cap=30.0, budget):
193
+ deadline = now() + max_duration
194
+
195
+ for attempt in 0..max_retries:
196
+ result = operation()
197
+ if result.success:
198
+ return result
199
+
200
+ if not is_retryable(result.error):
201
+ fail(result.error)
202
+
203
+ remaining = deadline - now()
204
+ if attempt == max_retries or remaining <= 0:
205
+ fail("retries exhausted", attempts=attempt+1)
206
+ if not budget.may_retry():
207
+ fail("retry budget exhausted")
208
+
209
+ delay = min(random(0, min(cap, base * 2^attempt)), remaining)
210
+ log(attempt=attempt+1, backoff_ms=delay*1000, error=result.error)
211
+ wait(delay)
212
+ ```
213
+
214
+ ### Idempotent retry on a non-idempotent operation
215
+
216
+ First attempt:
217
+
218
+ ```
219
+ POST /v1/payments
220
+ Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2
221
+
222
+ {"amount": 100.00, "currency": "USD"}
223
+ ```
224
+
225
+ Retry (same key - server returns cached response, no duplicate side effect):
226
+
227
+ ```
228
+ POST /v1/payments
229
+ Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2
230
+
231
+ {"amount": 100.00, "currency": "USD"}
232
+ ```
233
+
234
+ ## Enforcement Rules
235
+
236
+ | Rule | Gate | Action |
237
+ |------|------|--------|
238
+ | Fixed-interval retry detected | CI lint | Block merge |
239
+ | POST/PATCH retry without idempotency key | CI lint | Block merge |
240
+ | Retry on non-retryable status code | CI lint | Block merge |
241
+ | Missing retry budget | Production readiness | Block deploy |
242
+ | Max retries exceeds allowed range | Architecture review | IAB approval |
243
+ | DLQ not configured for async consumers | Production readiness | Block deploy |
244
+
245
+ ## References
246
+
247
+ - [RFC 9110 Section 10.2.3 - Retry-After](https://www.rfc-editor.org/rfc/rfc9110#section-10.2.3)
248
+ - [RFC 6585 - 429 Too Many Requests](https://www.rfc-editor.org/rfc/rfc6585)
249
+ - [AWS Builders' Library - Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)
250
+ - [AWS Architecture Blog - Exponential Backoff and Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
251
+ - [Google Cloud - Retry Strategy](https://cloud.google.com/storage/docs/retry-strategy)
252
+ - [Stripe - Idempotent Requests](https://docs.stripe.com/api/idempotent_requests)
253
+
254
+ ## Rationale
255
+
256
+ **Full jitter over alternatives:** AWS research shows full jitter produces the least total work across competing clients. Equal jitter's guaranteed minimum floor creates clustering that partially defeats the purpose of jitter.
257
+
258
+ **20% retry budget:** Google's gRPC default. Without a budget, a service at 1,000 req/s with 50% failure and 3 retries amplifies to 2,500 req/s. At 20%, it stays at 1,200 req/s - manageable headroom for recovery.
259
+
260
+ **Idempotency keys:** Stripe's pattern makes retries safe at the protocol level without requiring retry logic to understand business semantics.
261
+
262
+ **DLQ over infinite retry:** Infinite retry causes unbounded queue growth, head-of-line blocking, and masks bugs. DLQ cleanly separates transient failures from problems needing human attention.
263
+
264
+ ## Version History
265
+
266
+ | Version | Date | Change |
267
+ | ------- | ---------- | ------------------ |
268
+ | 1.0.0 | 2026-03-28 | Initial definition |
@@ -0,0 +1,269 @@
1
+ ---
2
+ identifier: "INTG-STD-035"
3
+ name: "Timeout Standard"
4
+ version: "1.0.0"
5
+ status: "MANDATORY"
6
+
7
+ domain: "INTEGRATION"
8
+ documentType: "standard"
9
+ category: "reliability"
10
+ appliesTo: ["api", "events", "a2a", "mcp", "webhooks", "grpc", "graphql", "batch"]
11
+
12
+ lastUpdated: "2026-03-28"
13
+ owner: "Integration Architecture Board"
14
+
15
+ standardsCompliance:
16
+ iso: []
17
+ rfc: ["RFC-9110"]
18
+ w3c: []
19
+ other: ["Zalando-Engineering-Timeouts", "AWS-Builders-Library", "gRPC-Deadlines"]
20
+
21
+ taxonomy:
22
+ capability: "reliability"
23
+ subCapability: "timeout-management"
24
+ layer: "infrastructure"
25
+
26
+ enforcement:
27
+ method: "hybrid"
28
+ validationRules:
29
+ connectionTimeout: "2s"
30
+ readTimeout: "5s"
31
+ totalTimeout: "30s"
32
+ rejectionCriteria:
33
+ - "Missing timeout configuration on any external call"
34
+ - "Infinite or unset timeouts"
35
+ - "Timeout longer than upstream caller's deadline"
36
+
37
+ dependsOn: ["INTG-STD-029"]
38
+ supersedes: ""
39
+ ---
40
+
41
+ # Timeout
42
+
43
+ ## Purpose
44
+
45
+ Every external call - whether to an API, database, message broker, or file system - **MUST** have an explicit timeout. Unbounded waits are the single most common cause of cascading failures in distributed systems. This standard defines mandatory timeout categories, default values by integration type, deadline propagation rules, and observability requirements. It complements INTG-STD-033 (Resilience) and INTG-STD-034 (Retry).
46
+
47
+ ## Rules
48
+
49
+ ### R-1: Explicit Timeout on Every External Call
50
+
51
+ Every outbound call to an external dependency **MUST** have an explicit timeout configured. This includes HTTP/REST, gRPC, database queries, message broker operations, file/object storage, DNS lookups, cache reads/writes, and any third-party SDK call performing network I/O. Relying on language or library default timeouts is **NOT** acceptable - many libraries default to infinite timeouts.
52
+
53
+ ### R-2: Separate Connection and Read Timeouts
54
+
55
+ Services **MUST** configure connection timeout and read timeout as independent values. Connection timeout governs TCP handshake completion; read timeout governs time waiting for the server response after the connection is established.
56
+
57
+ Connection timeout **SHOULD** follow: `connection_timeout = round_trip_time * 3`. For most intra-region calls, 2 seconds provides ample headroom. Read timeout **MUST** be based on measured downstream latency percentiles (p99.9 recommended), not guesswork.
58
+
59
+ ### R-3: Default Timeout Values by Integration Type
60
+
61
+ The following defaults **MUST** be used unless a documented exception exists with architectural approval:
62
+
63
+ | Integration Type | Connection | Read | Total | Rationale |
64
+ | ---------------------- | ---------- | ----- | ------ | --------------------------------------------- |
65
+ | REST/HTTP API | 2s | 5s | 10s | Most API calls complete within 1-2s at p99 |
66
+ | gRPC (unary) | 2s | 5s | 10s | Comparable to REST for request-response |
67
+ | gRPC (streaming) | 2s | 30s | 300s | Streams require longer read windows |
68
+ | Database query | 2s | 3s | 5s | Queries beyond 3s indicate missing indexes |
69
+ | Database transaction | 2s | 3s | 10s | Multi-statement transactions need headroom |
70
+ | Message publish | 2s | 5s | 10s | Broker acknowledgment should be fast |
71
+ | Message consume | 2s | 30s | 60s | Long-poll patterns require extended waits |
72
+ | Cache (Redis/Memcache) | 1s | 1s | 2s | Cache misses should fail fast |
73
+ | File/Object storage | 5s | 60s | 120s | Large transfers need proportional budgets |
74
+ | SMTP/Email | 5s | 30s | 60s | Mail servers vary widely in response time |
75
+ | DNS resolution | 2s | N/A | 2s | DNS should resolve from local cache |
76
+ | MCP tool invocation | 2s | 10s | 15s | AI tool calls may involve upstream LLM calls |
77
+ | Webhook delivery | 2s | 5s | 10s | Receiver should acknowledge quickly |
78
+
79
+ Exceptions **MUST** be documented in the service's integration manifest with the dependency name, overridden value, justification, architecture team approval, and a review date no longer than 6 months out.
80
+
81
+ ### R-4: Deadline Propagation
82
+
83
+ Services that receive inbound requests with a deadline **MUST** propagate a reduced deadline to downstream calls:
84
+
85
+ ```
86
+ downstream_deadline = incoming_deadline - elapsed_time - safety_margin
87
+ ```
88
+
89
+ A safety margin of 100-500ms is **RECOMMENDED**. Protocol-specific mechanisms:
90
+
91
+ | Protocol | Mechanism | Notes |
92
+ | -------- | ------------------------------------------------- | ------------------------------------------- |
93
+ | gRPC | `grpc-timeout` header (automatic) | Framework propagates via context |
94
+ | HTTP | `X-Request-Deadline` header (epoch millis) | Application layer must read and propagate |
95
+ | Kafka | Message header `x-deadline` or record timestamp + TTL | Consumer checks before processing |
96
+ | GraphQL | `extensions.deadline` field | Resolver checks remaining budget per field |
97
+
98
+ For gRPC, services **MUST NOT** create new contexts that discard the incoming deadline. For HTTP, services **SHOULD** include and honor the `X-Request-Deadline` header.
99
+
100
+ ### R-5: Deadline Budget Enforcement
101
+
102
+ A service **MUST NOT** initiate a downstream call if the remaining deadline budget is less than the minimum time required to complete it. Instead, the service **MUST** return immediately with a timeout error, log the budget exhaustion event, and increment the `timeout.budget_exhausted` metric.
103
+
104
+ ```
105
+ function call_downstream(incoming_deadline, safety_margin, downstream_min):
106
+ remaining = incoming_deadline - now() - safety_margin
107
+ if remaining < downstream_min:
108
+ log_warn("Deadline budget exhausted",
109
+ remaining_ms=remaining, required_ms=downstream_min)
110
+ emit_metric("timeout.budget_exhausted_total")
111
+ return TIMEOUT_ERROR
112
+
113
+ downstream_deadline = now() + remaining
114
+ return call(downstream, deadline=downstream_deadline)
115
+ ```
116
+
117
+ ### R-6: Protocol-Specific Timeout Rules
118
+
119
+ Services **MUST** apply the following protocol-specific rules:
120
+
121
+ | Protocol | Rule | Severity |
122
+ | --------------- | -------------------------------------------------------------------------------------- | -------- |
123
+ | HTTP/REST | Return `408` when server terminates a slow client; `504` when gateway times out upstream | ERROR |
124
+ | HTTP/REST | TLS handshake timeout **MUST** be included in total timeout budget | ERROR |
125
+ | gRPC | Every RPC **MUST** have a deadline set; calls without deadlines are unbounded | ERROR |
126
+ | gRPC | Servers **MUST** check context cancellation and abort work on expired deadlines | ERROR |
127
+ | Kafka | `request.timeout.ms` on producer and `max.poll.interval.ms` on consumer **MUST** be set | ERROR |
128
+ | RabbitMQ | Consumer ack timeout **MUST** be configured; message TTL **SHOULD** be set | ERROR |
129
+ | SQS | `VisibilityTimeout` **MUST** be at least 6x expected processing time | ERROR |
130
+ | Database | `statement_timeout` and `idle_in_transaction_session_timeout` **MUST** be configured | ERROR |
131
+ | File/Object | Per-part upload timeout and stalled transfer detection (30s recommended) **MUST** be set | ERROR |
132
+
133
+ ### R-7: Client-Side and Server-Side Timeouts
134
+
135
+ Both client and server **MUST** configure timeouts independently. Client-side timeouts protect against slow servers; server-side timeouts protect against slow or malicious clients.
136
+
137
+ The client-side timeout **MUST** be at most the server-side timeout for the same operation. If the client times out first, the server wastes resources processing a request whose result will be discarded.
138
+
139
+ ### R-8: Timeout and Circuit Breaker Interaction
140
+
141
+ Timeouts and circuit breakers (INTG-STD-033) **MUST** work together:
142
+
143
+ 1. Each timeout event **MUST** increment the circuit breaker's failure counter. When the threshold is reached, the circuit **MUST** open.
144
+ 2. When the circuit is open, requests **MUST** fail immediately without waiting for a timeout.
145
+ 3. Half-open probe requests **SHOULD** use 50% of the normal timeout for faster degradation detection.
146
+
147
+ ### R-9: Security - Timeouts as Defense
148
+
149
+ Timeouts **MUST** defend against resource exhaustion attacks:
150
+
151
+ - **Slowloris prevention:** Server read-header timeout **MUST** be 5s or less.
152
+ - **Slow POST prevention:** Server **MUST** enforce a minimum data rate; terminate connections below 500 bytes/second for more than 10 seconds.
153
+ - **Connection pool exhaustion:** Idle connections beyond 120s **SHOULD** be closed.
154
+ - **Query of death:** Database statement timeouts **MUST** prevent single queries from monopolizing resources.
155
+
156
+ Services **MUST NOT** extend timeouts under load. Longer timeouts during overload consume more resources and accelerate cascading failures. The correct response is to shed load via circuit breakers or rate limiting.
157
+
158
+ ### R-10: Timeout Metrics
159
+
160
+ Services **MUST** emit the following metrics for every timed external call:
161
+
162
+ | Metric | Type | Labels |
163
+ | ------------------------------------- | --------- | ----------------------------------------- |
164
+ | `external_call.duration_ms` | Histogram | `dependency`, `operation`, `result` |
165
+ | `external_call.timeout_total` | Counter | `dependency`, `operation`, `timeout_type` |
166
+ | `external_call.deadline_remaining_ms` | Histogram | `dependency`, `operation` |
167
+ | `timeout.budget_exhausted_total` | Counter | `dependency`, `operation` |
168
+
169
+ Where `timeout_type` is one of: `connection`, `read`, `write`, `total`, `deadline_exceeded`. `result` is one of: `success`, `timeout`, `error`.
170
+
171
+ ### R-11: Timeout Logging
172
+
173
+ Every timeout event **MUST** be logged with at minimum: `dependency`, `operation`, `timeout_type`, `configured_timeout_ms`, and `elapsed_ms`. Additional **RECOMMENDED** fields: `trace_id`, `span_id`, `deadline_remaining_ms`, `retry_attempt`, `circuit_breaker_state`.
174
+
175
+ ### R-12: Timeout Alerting
176
+
177
+ Services **MUST** configure alerts for:
178
+
179
+ | Condition | Severity | Action |
180
+ | ------------------------------------------ | -------- | ---------------------------------------------- |
181
+ | Timeout rate above 5% for a dependency | WARNING | Investigate dependency health |
182
+ | Timeout rate above 20% for a dependency | CRITICAL | Trigger incident; circuit breaker should open |
183
+ | Budget exhaustion rate above 1% | WARNING | Review timeout budgets and call chain |
184
+ | p99 latency above 80% of configured timeout | WARNING | Timeout too tight or dependency is degrading |
185
+
186
+ ## Enforcement Rules
187
+
188
+ The following **MUST** be enforced via static analysis, configuration scanning, or integration tests:
189
+
190
+ | Rule | Check | Severity |
191
+ | ------- | ----------------------------------------------------------------- | -------- |
192
+ | TMO-001 | Every HTTP client has explicit connection timeout | ERROR |
193
+ | TMO-002 | Every HTTP client has explicit read timeout | ERROR |
194
+ | TMO-003 | Connection timeout is at most 5s | ERROR |
195
+ | TMO-004 | Read timeout is at most 30s (exceptions require approval) | WARNING |
196
+ | TMO-005 | Total timeout is at most 120s (exceptions require approval) | WARNING |
197
+ | TMO-006 | Database `statement_timeout` is configured | ERROR |
198
+ | TMO-007 | Kafka `max.poll.interval.ms` is at most 300s | WARNING |
199
+ | TMO-008 | No infinite or zero timeout values in config | ERROR |
200
+ | TMO-009 | Server read-header timeout is at most 5s | ERROR |
201
+ | TMO-010 | gRPC calls have deadline set | ERROR |
202
+
203
+ Additional enforcement:
204
+
205
+ - **Gateway:** API gateways **MUST** enforce a maximum total timeout; requests exceeding it receive `504`.
206
+ - **Code review:** PRs introducing new external calls **MUST** include timeout configuration.
207
+ - **Runtime:** Services **SHOULD** log a warning when any call exceeds 80% of its configured timeout.
208
+
209
+ ## Examples
210
+
211
+ ### Deadline Propagation
212
+
213
+ The following pseudocode illustrates how a service receives an upstream deadline and propagates a reduced deadline to each downstream call:
214
+
215
+ ```
216
+ function handle_request(request):
217
+ deadline = parse_deadline_header(request)
218
+ if deadline is null:
219
+ deadline = now() + DEFAULT_TIMEOUT
220
+
221
+ # Local processing
222
+ result_a = fetch_from_service_a(request, deadline)
223
+
224
+ # Recalculate remaining budget before next call
225
+ remaining = deadline - now() - SAFETY_MARGIN
226
+ if remaining < MIN_DOWNSTREAM_TIMEOUT:
227
+ return error(408, "Deadline budget exhausted")
228
+
229
+ result_b = fetch_from_service_b(request, deadline)
230
+ return combine(result_a, result_b)
231
+
232
+ function fetch_from_service_a(request, upstream_deadline):
233
+ remaining = upstream_deadline - now() - SAFETY_MARGIN
234
+ if remaining < MIN_DOWNSTREAM_TIMEOUT:
235
+ raise TimeoutError("Budget exhausted before calling Service A")
236
+
237
+ return http_call(
238
+ url = SERVICE_A_URL,
239
+ timeout = remaining,
240
+ headers = {"X-Request-Deadline": upstream_deadline}
241
+ )
242
+ ```
243
+
244
+ ## Rationale
245
+
246
+ **Why separate connection and read timeouts?** They measure different failure modes. Connection timeout detects network unreachability (host down); read timeout measures server processing speed. Conflating them either slows failure detection or sets unrealistic response expectations.
247
+
248
+ **Why mandate deadline propagation?** Without it, a 5-service chain with 10s timeouts per hop can block the originator for 50s - long after the client has disconnected - while downstream services continue wasted work.
249
+
250
+ **Why aggressive defaults?** Short timeouts force architectural discipline. A 3s database timeout surfaces missing indexes during development, not production incidents. Services needing longer timeouts can request documented exceptions.
251
+
252
+ **Why not just circuit breakers?** Timeouts bound a single call's duration; circuit breakers prevent repeated calls to failing dependencies. Without timeouts, circuit breakers have no signal for slow failures. Both are required; neither is sufficient alone.
253
+
254
+ **Why never extend timeouts under load?** Longer timeouts during overload mean more in-flight requests, more consumed resources, and faster cascading failure. The correct response is load shedding, not longer waits.
255
+
256
+ ## References
257
+
258
+ - [**RFC 9110**](https://httpwg.org/specs/rfc9110.html) - HTTP Semantics (408 Request Timeout, 504 Gateway Timeout)
259
+ - [**Zalando Engineering - Timeouts**](https://engineering.zalando.com/posts/2023/07/all-you-need-to-know-about-timeouts.html) - Connection timeout formula, latency percentile baselines
260
+ - [**AWS Builders' Library - Timeouts, Retries, and Backoff**](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) - False-timeout rate, retry multiplication risks
261
+ - [**gRPC Deadlines**](https://grpc.io/docs/guides/deadlines/) - Automatic deadline propagation, DEADLINE_EXCEEDED
262
+ - **INTG-STD-033** - Resilience Standard (circuit breakers, bulkheads, fallbacks)
263
+ - **INTG-STD-034** - Retry Standard (retry policies, backoff, idempotency)
264
+
265
+ ## Version History
266
+
267
+ | Version | Date | Change |
268
+ | ------- | ---------- | ------------------ |
269
+ | 1.0.0 | 2026-03-28 | Initial definition |
@@ -0,0 +1 @@
1
+ {"label": "Versioning", "position": 7}