dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (243) hide show
  1. package/courses/GENERATION_LOG.md +27 -0
  2. package/courses/api-error-handling/course.yaml +16 -0
  3. package/courses/api-error-handling/scenarios/level-1/error-response-format.yaml +131 -0
  4. package/courses/api-error-handling/scenarios/level-1/http-status-codes-basics.yaml +90 -0
  5. package/courses/api-error-handling/scenarios/level-1/rate-limiting-basics.yaml +135 -0
  6. package/courses/api-error-handling/scenarios/level-1/request-validation-errors.yaml +208 -0
  7. package/courses/api-error-handling/scenarios/level-2/circuit-breaker-pattern.yaml +189 -0
  8. package/courses/api-error-handling/scenarios/level-2/idempotency-retry-logic.yaml +159 -0
  9. package/courses/api-error-handling/scenarios/level-2/rfc-7807-problem-details.yaml +178 -0
  10. package/courses/api-error-handling/scenarios/level-2/webhook-error-handling.yaml +211 -0
  11. package/courses/api-error-handling/scenarios/level-3/distributed-tracing-errors.yaml +275 -0
  12. package/courses/github-actions-cicd/course.yaml +10 -0
  13. package/courses/github-actions-cicd/scenarios/level-1/actions-and-runners.yaml +58 -0
  14. package/courses/github-actions-cicd/scenarios/level-1/basic-workflow-syntax.yaml +52 -0
  15. package/courses/github-actions-cicd/scenarios/level-1/branch-protection-checks.yaml +63 -0
  16. package/courses/github-actions-cicd/scenarios/level-1/environment-variables-secrets.yaml +65 -0
  17. package/courses/github-actions-cicd/scenarios/level-1/first-cicd-shift.yaml +62 -0
  18. package/courses/github-actions-cicd/scenarios/level-1/job-dependencies-outputs.yaml +62 -0
  19. package/courses/github-actions-cicd/scenarios/level-1/simple-ci-pipeline.yaml +57 -0
  20. package/courses/github-actions-cicd/scenarios/level-1/workflow-debugging.yaml +90 -0
  21. package/courses/github-actions-cicd/scenarios/level-1/workflow-status-notifications.yaml +59 -0
  22. package/courses/github-actions-cicd/scenarios/level-1/workflow-triggers.yaml +56 -0
  23. package/courses/github-actions-cicd/scenarios/level-2/concurrency-control.yaml +58 -0
  24. package/courses/github-actions-cicd/scenarios/level-2/conditional-execution.yaml +60 -0
  25. package/courses/github-actions-cicd/scenarios/level-2/custom-actions-development.yaml +55 -0
  26. package/courses/github-actions-cicd/scenarios/level-2/dependency-caching.yaml +58 -0
  27. package/courses/github-actions-cicd/scenarios/level-2/deployment-workflows.yaml +61 -0
  28. package/courses/github-actions-cicd/scenarios/level-2/github-packages-publishing.yaml +59 -0
  29. package/courses/github-actions-cicd/scenarios/level-2/intermediate-cicd-shift.yaml +68 -0
  30. package/courses/github-actions-cicd/scenarios/level-2/matrix-builds.yaml +59 -0
  31. package/courses/github-actions-cicd/scenarios/level-2/reusable-workflows.yaml +61 -0
  32. package/courses/github-actions-cicd/scenarios/level-2/workflow-cost-optimization.yaml +61 -0
  33. package/courses/github-actions-cicd/scenarios/level-3/advanced-cicd-shift.yaml +64 -0
  34. package/courses/github-actions-cicd/scenarios/level-3/compliance-automation.yaml +68 -0
  35. package/courses/github-actions-cicd/scenarios/level-3/docker-action-development.yaml +65 -0
  36. package/courses/github-actions-cicd/scenarios/level-3/github-environments.yaml +65 -0
  37. package/courses/github-actions-cicd/scenarios/level-3/monorepo-ci.yaml +68 -0
  38. package/courses/github-actions-cicd/scenarios/level-3/oidc-cloud-deployments.yaml +55 -0
  39. package/courses/github-actions-cicd/scenarios/level-3/release-automation.yaml +61 -0
  40. package/courses/github-actions-cicd/scenarios/level-3/security-hardening.yaml +63 -0
  41. package/courses/github-actions-cicd/scenarios/level-3/self-hosted-runners.yaml +60 -0
  42. package/courses/github-actions-cicd/scenarios/level-3/workflow-optimization.yaml +59 -0
  43. package/courses/github-actions-cicd/scenarios/level-4/cicd-data-architecture.yaml +63 -0
  44. package/courses/github-actions-cicd/scenarios/level-4/cicd-economics-roi.yaml +63 -0
  45. package/courses/github-actions-cicd/scenarios/level-4/cicd-executive-communication.yaml +58 -0
  46. package/courses/github-actions-cicd/scenarios/level-4/cicd-incident-response.yaml +60 -0
  47. package/courses/github-actions-cicd/scenarios/level-4/cicd-org-design.yaml +59 -0
  48. package/courses/github-actions-cicd/scenarios/level-4/cicd-platform-architecture.yaml +63 -0
  49. package/courses/github-actions-cicd/scenarios/level-4/cicd-training-program.yaml +65 -0
  50. package/courses/github-actions-cicd/scenarios/level-4/cicd-vendor-evaluation.yaml +59 -0
  51. package/courses/github-actions-cicd/scenarios/level-4/enterprise-cicd-governance.yaml +55 -0
  52. package/courses/github-actions-cicd/scenarios/level-4/expert-cicd-shift.yaml +60 -0
  53. package/courses/github-actions-cicd/scenarios/level-5/cicd-ai-future.yaml +63 -0
  54. package/courses/github-actions-cicd/scenarios/level-5/cicd-behavioral-science.yaml +70 -0
  55. package/courses/github-actions-cicd/scenarios/level-5/cicd-board-strategy.yaml +56 -0
  56. package/courses/github-actions-cicd/scenarios/level-5/cicd-consulting-engagement.yaml +61 -0
  57. package/courses/github-actions-cicd/scenarios/level-5/cicd-industry-benchmarks.yaml +63 -0
  58. package/courses/github-actions-cicd/scenarios/level-5/cicd-ma-integration.yaml +73 -0
  59. package/courses/github-actions-cicd/scenarios/level-5/cicd-product-development.yaml +68 -0
  60. package/courses/github-actions-cicd/scenarios/level-5/cicd-regulatory-landscape.yaml +72 -0
  61. package/courses/github-actions-cicd/scenarios/level-5/comprehensive-cicd-system.yaml +66 -0
  62. package/courses/github-actions-cicd/scenarios/level-5/master-cicd-shift.yaml +76 -0
  63. package/courses/github-pr-review/scenarios/level-2/api-change-review.yaml +82 -0
  64. package/courses/github-pr-review/scenarios/level-2/automated-review-tooling.yaml +53 -0
  65. package/courses/github-pr-review/scenarios/level-2/cross-team-review.yaml +61 -0
  66. package/courses/github-pr-review/scenarios/level-2/intermediate-review-shift.yaml +66 -0
  67. package/courses/github-pr-review/scenarios/level-2/performance-review-patterns.yaml +99 -0
  68. package/courses/github-pr-review/scenarios/level-2/review-disagreement-resolution.yaml +64 -0
  69. package/courses/github-pr-review/scenarios/level-2/review-metrics-analysis.yaml +63 -0
  70. package/courses/github-pr-review/scenarios/level-2/review-turnaround-sla.yaml +54 -0
  71. package/courses/github-pr-review/scenarios/level-2/stacked-pr-review.yaml +65 -0
  72. package/courses/github-pr-review/scenarios/level-3/advanced-review-shift.yaml +65 -0
  73. package/courses/github-pr-review/scenarios/level-3/ai-powered-review.yaml +58 -0
  74. package/courses/github-pr-review/scenarios/level-3/compliance-review-process.yaml +64 -0
  75. package/courses/github-pr-review/scenarios/level-3/cross-functional-review.yaml +60 -0
  76. package/courses/github-pr-review/scenarios/level-3/incident-driven-review.yaml +63 -0
  77. package/courses/github-pr-review/scenarios/level-3/large-scale-review-operations.yaml +55 -0
  78. package/courses/github-pr-review/scenarios/level-3/monorepo-review-process.yaml +68 -0
  79. package/courses/github-pr-review/scenarios/level-3/review-automation-platform.yaml +61 -0
  80. package/courses/github-pr-review/scenarios/level-3/review-culture-design.yaml +62 -0
  81. package/courses/github-pr-review/scenarios/level-3/review-data-pipeline.yaml +62 -0
  82. package/courses/github-pr-review/scenarios/level-4/enterprise-review-operations.yaml +61 -0
  83. package/courses/github-pr-review/scenarios/level-4/expert-review-shift.yaml +62 -0
  84. package/courses/github-pr-review/scenarios/level-4/review-data-architecture.yaml +69 -0
  85. package/courses/github-pr-review/scenarios/level-4/review-economics-roi.yaml +63 -0
  86. package/courses/github-pr-review/scenarios/level-4/review-executive-communication.yaml +61 -0
  87. package/courses/github-pr-review/scenarios/level-4/review-incident-postmortem.yaml +69 -0
  88. package/courses/github-pr-review/scenarios/level-4/review-org-design.yaml +62 -0
  89. package/courses/github-pr-review/scenarios/level-4/review-platform-architecture.yaml +64 -0
  90. package/courses/github-pr-review/scenarios/level-4/review-training-program.yaml +66 -0
  91. package/courses/github-pr-review/scenarios/level-4/review-vendor-evaluation.yaml +76 -0
  92. package/courses/github-pr-review/scenarios/level-5/comprehensive-review-system.yaml +68 -0
  93. package/courses/github-pr-review/scenarios/level-5/master-review-shift.yaml +73 -0
  94. package/courses/github-pr-review/scenarios/level-5/review-ai-future.yaml +69 -0
  95. package/courses/github-pr-review/scenarios/level-5/review-behavioral-science.yaml +66 -0
  96. package/courses/github-pr-review/scenarios/level-5/review-board-strategy.yaml +62 -0
  97. package/courses/github-pr-review/scenarios/level-5/review-consulting-engagement.yaml +62 -0
  98. package/courses/github-pr-review/scenarios/level-5/review-devtools-product.yaml +71 -0
  99. package/courses/github-pr-review/scenarios/level-5/review-industry-benchmarks.yaml +64 -0
  100. package/courses/github-pr-review/scenarios/level-5/review-ma-integration.yaml +76 -0
  101. package/courses/github-pr-review/scenarios/level-5/review-regulatory-landscape.yaml +78 -0
  102. package/courses/postgresql-query-optimization/course.yaml +11 -0
  103. package/courses/postgresql-query-optimization/scenarios/level-1/explain-analyze-basics.yaml +80 -0
  104. package/courses/postgresql-query-optimization/scenarios/level-1/first-optimization-shift.yaml +77 -0
  105. package/courses/postgresql-query-optimization/scenarios/level-1/index-fundamentals.yaml +76 -0
  106. package/courses/postgresql-query-optimization/scenarios/level-1/join-basics.yaml +73 -0
  107. package/courses/postgresql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml +62 -0
  108. package/courses/postgresql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml +69 -0
  109. package/courses/postgresql-query-optimization/scenarios/level-1/select-star-problems.yaml +69 -0
  110. package/courses/postgresql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml +63 -0
  111. package/courses/postgresql-query-optimization/scenarios/level-1/vacuum-and-statistics.yaml +62 -0
  112. package/courses/postgresql-query-optimization/scenarios/level-1/where-clause-optimization.yaml +74 -0
  113. package/courses/postgresql-query-optimization/scenarios/level-2/autovacuum-tuning.yaml +76 -0
  114. package/courses/postgresql-query-optimization/scenarios/level-2/composite-index-design.yaml +81 -0
  115. package/courses/postgresql-query-optimization/scenarios/level-2/covering-indexes.yaml +74 -0
  116. package/courses/postgresql-query-optimization/scenarios/level-2/cte-optimization.yaml +83 -0
  117. package/courses/postgresql-query-optimization/scenarios/level-2/intermediate-optimization-shift.yaml +66 -0
  118. package/courses/postgresql-query-optimization/scenarios/level-2/join-optimization.yaml +72 -0
  119. package/courses/postgresql-query-optimization/scenarios/level-2/partial-and-expression-indexes.yaml +75 -0
  120. package/courses/postgresql-query-optimization/scenarios/level-2/query-planner-settings.yaml +62 -0
  121. package/courses/postgresql-query-optimization/scenarios/level-2/subquery-optimization.yaml +67 -0
  122. package/courses/postgresql-query-optimization/scenarios/level-2/window-function-optimization.yaml +63 -0
  123. package/courses/postgresql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml +71 -0
  124. package/courses/postgresql-query-optimization/scenarios/level-3/connection-pooling.yaml +60 -0
  125. package/courses/postgresql-query-optimization/scenarios/level-3/full-text-search-optimization.yaml +66 -0
  126. package/courses/postgresql-query-optimization/scenarios/level-3/jsonb-optimization.yaml +88 -0
  127. package/courses/postgresql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml +80 -0
  128. package/courses/postgresql-query-optimization/scenarios/level-3/materialized-view-optimization.yaml +73 -0
  129. package/courses/postgresql-query-optimization/scenarios/level-3/parallel-query-execution.yaml +74 -0
  130. package/courses/postgresql-query-optimization/scenarios/level-3/partitioning-strategies.yaml +71 -0
  131. package/courses/postgresql-query-optimization/scenarios/level-3/specialized-index-types.yaml +67 -0
  132. package/courses/postgresql-query-optimization/scenarios/level-3/write-optimization.yaml +65 -0
  133. package/courses/postgresql-query-optimization/scenarios/level-4/data-architecture-analytics.yaml +64 -0
  134. package/courses/postgresql-query-optimization/scenarios/level-4/database-executive-communication.yaml +64 -0
  135. package/courses/postgresql-query-optimization/scenarios/level-4/database-migration-planning.yaml +57 -0
  136. package/courses/postgresql-query-optimization/scenarios/level-4/enterprise-database-governance.yaml +52 -0
  137. package/courses/postgresql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml +73 -0
  138. package/courses/postgresql-query-optimization/scenarios/level-4/high-availability-architecture.yaml +62 -0
  139. package/courses/postgresql-query-optimization/scenarios/level-4/optimizer-internals.yaml +69 -0
  140. package/courses/postgresql-query-optimization/scenarios/level-4/performance-sla-design.yaml +58 -0
  141. package/courses/postgresql-query-optimization/scenarios/level-4/read-replica-optimization.yaml +62 -0
  142. package/courses/postgresql-query-optimization/scenarios/level-4/vendor-evaluation.yaml +73 -0
  143. package/courses/rest-api-error-handling/course.yaml +11 -0
  144. package/courses/rest-api-error-handling/scenarios/level-1/authentication-errors.yaml +71 -0
  145. package/courses/rest-api-error-handling/scenarios/level-1/content-negotiation-errors.yaml +63 -0
  146. package/courses/rest-api-error-handling/scenarios/level-1/error-logging-basics.yaml +63 -0
  147. package/courses/rest-api-error-handling/scenarios/level-1/error-response-format.yaml +58 -0
  148. package/courses/rest-api-error-handling/scenarios/level-1/first-error-handling-shift.yaml +67 -0
  149. package/courses/rest-api-error-handling/scenarios/level-1/http-status-codes.yaml +46 -0
  150. package/courses/rest-api-error-handling/scenarios/level-1/not-found-errors.yaml +52 -0
  151. package/courses/rest-api-error-handling/scenarios/level-1/rate-limiting-errors.yaml +56 -0
  152. package/courses/rest-api-error-handling/scenarios/level-1/request-validation-errors.yaml +59 -0
  153. package/courses/rest-api-error-handling/scenarios/level-1/server-error-handling.yaml +55 -0
  154. package/courses/rest-api-error-handling/scenarios/level-2/api-versioning-errors.yaml +66 -0
  155. package/courses/rest-api-error-handling/scenarios/level-2/batch-request-errors.yaml +61 -0
  156. package/courses/rest-api-error-handling/scenarios/level-2/circuit-breaker-pattern.yaml +52 -0
  157. package/courses/rest-api-error-handling/scenarios/level-2/error-code-taxonomy.yaml +62 -0
  158. package/courses/rest-api-error-handling/scenarios/level-2/error-monitoring-alerting.yaml +53 -0
  159. package/courses/rest-api-error-handling/scenarios/level-2/intermediate-error-shift.yaml +69 -0
  160. package/courses/rest-api-error-handling/scenarios/level-2/pagination-errors.yaml +66 -0
  161. package/courses/rest-api-error-handling/scenarios/level-2/retry-and-idempotency.yaml +60 -0
  162. package/courses/rest-api-error-handling/scenarios/level-2/rfc7807-problem-details.yaml +60 -0
  163. package/courses/rest-api-error-handling/scenarios/level-2/webhook-error-handling.yaml +55 -0
  164. package/courses/rest-api-error-handling/scenarios/level-3/advanced-error-shift.yaml +72 -0
  165. package/courses/rest-api-error-handling/scenarios/level-3/api-gateway-errors.yaml +71 -0
  166. package/courses/rest-api-error-handling/scenarios/level-3/async-api-errors.yaml +67 -0
  167. package/courses/rest-api-error-handling/scenarios/level-3/caching-error-scenarios.yaml +65 -0
  168. package/courses/rest-api-error-handling/scenarios/level-3/chaos-engineering-apis.yaml +62 -0
  169. package/courses/rest-api-error-handling/scenarios/level-3/database-error-handling.yaml +79 -0
  170. package/courses/rest-api-error-handling/scenarios/level-3/distributed-error-propagation.yaml +63 -0
  171. package/courses/rest-api-error-handling/scenarios/level-3/error-budgets-sre.yaml +61 -0
  172. package/courses/rest-api-error-handling/scenarios/level-3/error-correlation.yaml +58 -0
  173. package/courses/rest-api-error-handling/scenarios/level-3/graphql-vs-rest-errors.yaml +73 -0
  174. package/courses/rest-api-error-handling/scenarios/level-4/compliance-error-handling.yaml +65 -0
  175. package/courses/rest-api-error-handling/scenarios/level-4/enterprise-error-governance.yaml +62 -0
  176. package/courses/rest-api-error-handling/scenarios/level-4/error-analytics-platform.yaml +65 -0
  177. package/courses/rest-api-error-handling/scenarios/level-4/error-cost-optimization.yaml +63 -0
  178. package/courses/rest-api-error-handling/scenarios/level-4/error-executive-communication.yaml +60 -0
  179. package/courses/rest-api-error-handling/scenarios/level-4/error-handling-architecture.yaml +67 -0
  180. package/courses/rest-api-error-handling/scenarios/level-4/error-org-design.yaml +68 -0
  181. package/courses/rest-api-error-handling/scenarios/level-4/error-sla-design.yaml +65 -0
  182. package/courses/rest-api-error-handling/scenarios/level-4/error-training-program.yaml +61 -0
  183. package/courses/rest-api-error-handling/scenarios/level-4/expert-error-shift.yaml +63 -0
  184. package/courses/rest-api-error-handling/scenarios/level-5/comprehensive-error-system.yaml +68 -0
  185. package/courses/rest-api-error-handling/scenarios/level-5/error-ai-future.yaml +75 -0
  186. package/courses/rest-api-error-handling/scenarios/level-5/error-behavioral-science.yaml +73 -0
  187. package/courses/rest-api-error-handling/scenarios/level-5/error-board-strategy.yaml +60 -0
  188. package/courses/rest-api-error-handling/scenarios/level-5/error-consulting-engagement.yaml +58 -0
  189. package/courses/rest-api-error-handling/scenarios/level-5/error-industry-benchmarks.yaml +72 -0
  190. package/courses/rest-api-error-handling/scenarios/level-5/error-ma-integration.yaml +68 -0
  191. package/courses/rest-api-error-handling/scenarios/level-5/error-product-development.yaml +66 -0
  192. package/courses/rest-api-error-handling/scenarios/level-5/error-regulatory-landscape.yaml +80 -0
  193. package/courses/rest-api-error-handling/scenarios/level-5/master-error-shift.yaml +73 -0
  194. package/dist/cli/commands/add.d.ts.map +1 -1
  195. package/dist/cli/commands/add.js +6 -5
  196. package/dist/cli/commands/add.js.map +1 -1
  197. package/dist/cli/commands/generate.d.ts.map +1 -1
  198. package/dist/cli/commands/generate.js +4 -0
  199. package/dist/cli/commands/generate.js.map +1 -1
  200. package/dist/cli/commands/list.d.ts.map +1 -1
  201. package/dist/cli/commands/list.js +6 -18
  202. package/dist/cli/commands/list.js.map +1 -1
  203. package/dist/cli/commands/train.d.ts.map +1 -1
  204. package/dist/cli/commands/train.js +18 -18
  205. package/dist/cli/commands/train.js.map +1 -1
  206. package/dist/cli/index.js +93 -55
  207. package/dist/cli/index.js.map +1 -1
  208. package/dist/cli/run-demo.js +2 -1
  209. package/dist/cli/run-demo.js.map +1 -1
  210. package/dist/cli/setup.d.ts +18 -0
  211. package/dist/cli/setup.d.ts.map +1 -0
  212. package/dist/cli/setup.js +154 -0
  213. package/dist/cli/setup.js.map +1 -0
  214. package/dist/engine/agent-bridge.d.ts +5 -2
  215. package/dist/engine/agent-bridge.d.ts.map +1 -1
  216. package/dist/engine/agent-bridge.js +36 -9
  217. package/dist/engine/agent-bridge.js.map +1 -1
  218. package/dist/engine/loader.d.ts +21 -0
  219. package/dist/engine/loader.d.ts.map +1 -1
  220. package/dist/engine/loader.js +54 -1
  221. package/dist/engine/loader.js.map +1 -1
  222. package/dist/engine/training-loop.d.ts.map +1 -1
  223. package/dist/engine/training-loop.js +1 -0
  224. package/dist/engine/training-loop.js.map +1 -1
  225. package/dist/engine/training.d.ts.map +1 -1
  226. package/dist/engine/training.js +1 -0
  227. package/dist/engine/training.js.map +1 -1
  228. package/dist/generator/skill-generator.d.ts +1 -1
  229. package/dist/generator/skill-generator.d.ts.map +1 -1
  230. package/dist/generator/skill-generator.js +21 -2
  231. package/dist/generator/skill-generator.js.map +1 -1
  232. package/dist/mcp/server.d.ts.map +1 -1
  233. package/dist/mcp/server.js +11 -26
  234. package/dist/mcp/server.js.map +1 -1
  235. package/dist/mcp/session-manager.d.ts +3 -1
  236. package/dist/mcp/session-manager.d.ts.map +1 -1
  237. package/dist/mcp/session-manager.js +44 -22
  238. package/dist/mcp/session-manager.js.map +1 -1
  239. package/dist/types/schemas.d.ts +38 -13
  240. package/dist/types/schemas.d.ts.map +1 -1
  241. package/dist/types/schemas.js +9 -5
  242. package/dist/types/schemas.js.map +1 -1
  243. package/package.json +1 -1
@@ -0,0 +1,60 @@
1
+ meta:
2
+ id: rfc7807-problem-details
3
+ level: 2
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Implement RFC 7807 Problem Details — adopt the standard error format for consistent, machine-readable API error responses"
7
+ tags: [REST, API, RFC-7807, problem-details, standard, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your company has 12 microservices, each with its own error format.
13
+ The API gateway team is frustrated because they can't build unified
14
+ error handling. The CTO mandates adopting RFC 7807 (Problem Details
15
+ for HTTP APIs) as the company standard.
16
+
17
+ Current service error formats:
18
+ - User service: { "error": { "code": "USR001", "msg": "..." } }
19
+ - Order service: { "status": "error", "message": "..." }
20
+ - Payment service: { "errors": [{ "field": "...", "message": "..." }] }
21
+ - Inventory service: { "fault": { "type": "...", "detail": "..." } }
22
+ - Notification service: plain text error messages
23
+
24
+ You need to migrate all services to RFC 7807. The standard defines:
25
+ - type (URI): identifies the error type
26
+ - title: short human-readable summary
27
+ - status: HTTP status code
28
+ - detail: human-readable explanation specific to this occurrence
29
+ - instance (URI): identifies this specific occurrence
30
+
31
+ Challenges to address:
32
+ 1. How to define the "type" URI — what namespace? Public or internal?
33
+ 2. How to handle validation errors (multiple fields) within RFC 7807
34
+ 3. How to extend the standard with custom fields (error codes,
35
+ timestamps, retry-after)
36
+ 4. How the API gateway should aggregate errors from multiple services
37
+ 5. How to version error types when error semantics change
38
+ 6. Backward compatibility — how existing API consumers handle the
39
+ format change
40
+
41
+ Task: Design the RFC 7807 implementation for the company. Write:
42
+ the error type URI scheme, example Problem Details responses for
43
+ 5 common error scenarios (validation, not found, auth, rate limit,
44
+ server error), the extension strategy for custom fields, the
45
+ migration plan for the 5 services, and the API gateway error
46
+ aggregation approach.
47
+
48
+ assertions:
49
+ - type: llm_judge
50
+ criteria: "RFC 7807 implementation is technically correct — all required fields (type, title, status, detail) are present, type URIs are well-designed (resolvable or documented namespace), instance URIs uniquely identify error occurrences, and the Content-Type is application/problem+json"
51
+ weight: 0.35
52
+ description: "Technically correct RFC 7807"
53
+ - type: llm_judge
54
+ criteria: "Extension strategy is practical — handles validation errors with an 'errors' array extension, adds useful custom fields (error_code, timestamp, request_id) without conflicting with standard fields, and the type URI scheme supports versioning and categorization"
55
+ weight: 0.35
56
+ description: "Practical extension strategy"
57
+ - type: llm_judge
58
+ criteria: "Migration plan is realistic — phases the 5 services by complexity, addresses backward compatibility (content negotiation, dual-format period), handles API gateway aggregation of errors from services at different migration stages, and includes client SDK updates"
59
+ weight: 0.30
60
+ description: "Realistic migration plan"
@@ -0,0 +1,55 @@
1
+ meta:
2
+ id: webhook-error-handling
3
+ level: 2
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Handle webhook delivery errors — design retry, dead letter queues, and error reporting for outbound webhook systems"
7
+ tags: [REST, API, webhooks, retry, delivery, dead-letter, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your SaaS platform sends webhooks to customer endpoints when events
13
+ occur (order created, payment received, subscription changed). The
14
+ webhook system is unreliable and customers are losing data.
15
+
16
+ Current problems:
17
+ 1. If a customer's endpoint returns 500, you retry 3 times with no
18
+ delay and then give up permanently — events are lost
19
+ 2. If a customer's endpoint is slow (>5s), you timeout and retry,
20
+ causing duplicate deliveries
21
+ 3. Customer's endpoint returns 301 redirect — you don't follow it
22
+ and mark the delivery as failed
23
+ 4. Customer's SSL certificate expired — you fail silently, no
24
+ notification to the customer
25
+ 5. Customer's endpoint returns 200 but the response body says
26
+ { "error": "processing failed" } — you mark it as delivered
27
+ 6. Events arrive out of order (order.paid before order.created)
28
+ because retries for the earlier event haven't completed
29
+ 7. A customer's endpoint is down for 3 days — you've queued 50,000
30
+ events and when they come back online, you flood them
31
+
32
+ Customer complaints:
33
+ - "We missed 200 payment events last month and our accounting is off"
34
+ - "We get duplicate events and our system processes them twice"
35
+ - "We had no idea our endpoint was failing until we lost data"
36
+
37
+ Task: Redesign the webhook error handling system. Write: the retry
38
+ policy (backoff schedule, max attempts, retry-after logic), the
39
+ dead letter queue design, the customer notification system for
40
+ delivery failures, the duplicate prevention mechanism, and the
41
+ endpoint health monitoring and automatic disable logic.
42
+
43
+ assertions:
44
+ - type: llm_judge
45
+ criteria: "Retry policy is robust — uses exponential backoff with jitter (not fixed intervals), retries over hours/days (not seconds), has a maximum retry window (e.g., 72 hours), and handles different failure types appropriately (4xx vs 5xx vs timeout vs DNS failure)"
46
+ weight: 0.35
47
+ description: "Robust retry policy"
48
+ - type: llm_judge
49
+ criteria: "Dead letter queue and notification system prevents data loss — failed events after max retries go to a DLQ that customers can inspect and replay, customers are notified of delivery failures (email/dashboard), and the system tracks delivery status with full history per event"
50
+ weight: 0.35
51
+ description: "Data loss prevention"
52
+ - type: llm_judge
53
+ criteria: "Addresses all 7 problems — duplicate prevention (idempotency keys in webhook payloads), ordering guarantees or handling (sequence numbers), flood protection (rate limiting replays), SSL/redirect handling, response body validation, and automatic endpoint disabling with re-enable flow"
54
+ weight: 0.30
55
+ description: "All problems addressed"
@@ -0,0 +1,72 @@
1
+ meta:
2
+ id: advanced-error-shift
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Advanced error handling shift — manage a distributed system failure with data consistency implications"
7
+ tags: [REST, API, error-handling, shift-simulation, distributed, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're the principal engineer on-call for a financial services
13
+ platform. At 11:45 PM on a Friday (month-end processing), you
14
+ receive a cascade of alerts.
15
+
16
+ The system:
17
+ - 30 microservices processing $50M in daily transactions
18
+ - PostgreSQL primary with 4 read replicas
19
+ - Redis cluster (6 nodes) for caching and distributed locking
20
+ - RabbitMQ for async event processing
21
+ - API gateway serving 500 requests/second
22
+
23
+ 11:45 PM — Alert: Redis cluster node 3 failed
24
+ - Redis cluster reshards, 15-second disruption
25
+ - Distributed locks are lost during reshard
26
+ - Two instances of the payment processor run simultaneously
27
+ (lock was protecting against this)
28
+
29
+ 11:46 PM — Alert: Duplicate transactions detected
30
+ - 23 payments processed twice (total $47,000 in duplicates)
31
+ - Some customers charged twice, some merchants paid twice
32
+ - The idempotency keys were stored in the Redis node that failed
33
+
34
+ 11:48 PM — Alert: Read replica lag at 45 seconds
35
+ - Month-end batch job is hammering the primary
36
+ - API reads from replicas are returning stale data
37
+ - Account balance checks are approving payments they shouldn't
38
+ (stale balance data)
39
+
40
+ 11:50 PM — Alert: RabbitMQ dead letter queue growing
41
+ - 5,000 events in DLQ — all "balance update" events
42
+ - These events normally update cached balances in Redis
43
+ - But Redis was resharding, so events failed delivery
44
+ - Now balances in Redis don't match balances in PostgreSQL
45
+
46
+ 11:55 PM — Business impact:
47
+ - 23 duplicate transactions ($47K)
48
+ - Unknown number of over-approved payments (stale balances)
49
+ - Event backlog growing (15,000 now in DLQ)
50
+ - Month-end reconciliation report will be wrong
51
+ - Customer-facing error rate: 8% and climbing
52
+
53
+ Task: Navigate this crisis from the error handling perspective.
54
+ Write: the triage prioritization, the immediate containment
55
+ actions, the data consistency recovery plan (duplicate transactions,
56
+ stale balances, DLQ replay), the API error responses during
57
+ recovery (what clients see), and the post-incident architectural
58
+ changes to prevent this cascade.
59
+
60
+ assertions:
61
+ - type: llm_judge
62
+ criteria: "Triage is prioritized correctly — stops the bleeding first (halt new payments to prevent more duplicates), then contains (identify all affected transactions), then recovers (resolve duplicates, replay DLQ, reconcile balances). Financial integrity takes precedence over availability"
63
+ weight: 0.35
64
+ description: "Correct triage prioritization"
65
+ - type: llm_judge
66
+ criteria: "Data consistency recovery is thorough — identifies all 23 duplicates for reversal, detects over-approved payments from stale balances, plans DLQ replay with idempotency checks, and includes a full reconciliation process to verify PostgreSQL and Redis are consistent before resuming"
67
+ weight: 0.35
68
+ description: "Thorough data consistency recovery"
69
+ - type: llm_judge
70
+ criteria: "Post-incident changes prevent cascade — moves idempotency keys to PostgreSQL (not Redis), implements read-your-writes consistency for balance checks, adds Redis cluster monitoring with automatic payment pause, and designs the DLQ replay to handle out-of-order processing safely"
71
+ weight: 0.30
72
+ description: "Cascade prevention changes"
@@ -0,0 +1,71 @@
1
+ meta:
2
+ id: api-gateway-errors
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design API gateway error handling — manage errors at the gateway layer for routing, transformation, and aggregation"
7
+ tags: [REST, API, gateway, routing, transformation, aggregation, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're architecting the error handling for an API gateway that
13
+ fronts 40 microservices. The gateway handles authentication, rate
14
+ limiting, routing, request/response transformation, and response
15
+ aggregation.
16
+
17
+ Current problems:
18
+
19
+ 1. Gateway errors vs backend errors are indistinguishable:
20
+ When the gateway itself fails (routing error, transform error),
21
+ it returns the same format as backend errors. Clients can't tell
22
+ if their request never reached the backend or if the backend
23
+ failed.
24
+
25
+ 2. Error format translation:
26
+ - Service A returns RFC 7807
27
+ - Service B returns { "error": "...", "code": 123 }
28
+ - Service C returns XML errors
29
+ - Service D returns plain text
30
+ Gateway currently passes through whatever the backend returns.
31
+
32
+ 3. Aggregated endpoint errors:
33
+ GET /dashboard calls 5 services. If 2 fail:
34
+ - Currently returns 500 (entire dashboard fails)
35
+ - Wanted: partial response with errors for failed sections
36
+
37
+ 4. Authentication at gateway vs service level:
38
+ Gateway validates JWT, service validates permissions. If the
39
+ service returns 403, should the gateway override, pass through,
40
+ or add context?
41
+
42
+ 5. Timeout ambiguity:
43
+ Gateway timeout is 30s, but Service A has internal timeout of
44
+ 10s. When Service A times out at 10s and returns 504, gateway
45
+ sees 504 from Service A — is this a gateway timeout or a service
46
+ timeout? The error is different.
47
+
48
+ 6. Circuit breaker at gateway level:
49
+ When a service is down, the gateway's circuit breaker returns
50
+ 503 — but the response lacks the backend's normal error format,
51
+ confusing clients.
52
+
53
+ Task: Design the gateway error handling architecture. Write: the
54
+ gateway error vs backend error distinction (with different error
55
+ envelope), the error format normalization layer, the partial
56
+ failure response format for aggregated endpoints, the auth error
57
+ handling strategy, and the circuit breaker error responses.
58
+
59
+ assertions:
60
+ - type: llm_judge
61
+ criteria: "Gateway vs backend errors are clearly distinguished — gateway errors have a different envelope or indicator (e.g., error source field), clients can programmatically determine whether to retry at the same endpoint or if the backend is down, and each error includes enough context for debugging"
62
+ weight: 0.35
63
+ description: "Clear gateway vs backend distinction"
64
+ - type: llm_judge
65
+ criteria: "Error format normalization handles all 4 service formats — transforms RFC 7807, custom JSON, XML, and plain text into a unified format, preserves original error details in a nested field for debugging, and the aggregated endpoint returns partial success with per-section error details"
66
+ weight: 0.35
67
+ description: "Format normalization"
68
+ - type: llm_judge
69
+ criteria: "Advanced scenarios are handled — auth error layering (gateway vs service-level), timeout disambiguation, circuit breaker responses match the expected backend format, and the design considers caching error responses to reduce backend load during outages"
70
+ weight: 0.30
71
+ description: "Advanced scenario handling"
@@ -0,0 +1,67 @@
1
+ meta:
2
+ id: async-api-errors
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Handle async API errors — design error handling for long-running operations, polling, and callback-based APIs"
7
+ tags: [REST, API, async, long-running, polling, callbacks, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your document processing API handles large file uploads that take
13
+ 1-30 minutes to process. You've implemented an async pattern:
14
+
15
+ 1. POST /documents → 202 Accepted { "job_id": "abc123" }
16
+ 2. GET /documents/jobs/abc123 → { "status": "processing" }
17
+ 3. GET /documents/jobs/abc123 → { "status": "completed", "result": {...} }
18
+
19
+ But error handling for this async flow is broken:
20
+
21
+ Problem 1 — Immediate validation vs deferred validation:
22
+ User uploads a 500MB file. Should you validate the file format
23
+ immediately (synchronous, but adds latency) or defer validation
24
+ to the background job (fast acceptance, but error arrives later)?
25
+ Currently you accept everything and fail later.
26
+
27
+ Problem 2 — Job failure notification:
28
+ When processing fails (corrupted PDF, unsupported format, OCR
29
+ failure), the job status just shows "failed" with no details.
30
+ Polling clients have to guess what went wrong.
31
+
32
+ Problem 3 — Partial failure:
33
+ A 100-page document: 95 pages processed successfully, 5 pages had
34
+ OCR errors. Is this a success or failure? Currently it's marked
35
+ as "failed" and the 95 good pages are thrown away.
36
+
37
+ Problem 4 — Timeout and abandonment:
38
+ Job has been "processing" for 2 hours (expected max: 30 min).
39
+ Is it stuck? Dead? Still running? No way to tell.
40
+
41
+ Problem 5 — Callback errors:
42
+ Client registered a callback URL. The callback fails (404, timeout).
43
+ The result is ready but the client doesn't know. No retry.
44
+
45
+ Problem 6 — Concurrent operations:
46
+ Two jobs on the same document submitted accidentally. They race
47
+ and produce conflicting results.
48
+
49
+ Task: Design the async error handling system. Write: the sync vs
50
+ async validation strategy, the job failure response format (with
51
+ error details, partial results, progress), the timeout detection
52
+ and recovery mechanism, the callback retry and dead-letter system,
53
+ and the concurrency control approach.
54
+
55
+ assertions:
56
+ - type: llm_judge
57
+ criteria: "Validation strategy balances immediacy and thoroughness — validates file size, format header, and auth synchronously (fail fast on obvious issues), defers content validation to async processing (OCR quality, page parsing), and communicates which validations are immediate vs deferred"
58
+ weight: 0.35
59
+ description: "Balanced validation strategy"
60
+ - type: llm_judge
61
+ criteria: "Job failure format is detailed and handles partial success — includes error type, affected items (which pages failed), partial results (the 95 good pages), progress percentage, and distinguishes between retryable failures (temporary resource issues) and permanent failures (unsupported format)"
62
+ weight: 0.35
63
+ description: "Detailed job failure format"
64
+ - type: llm_judge
65
+ criteria: "Timeout, callback, and concurrency issues are all addressed — stuck job detection with heartbeats or deadlines, callback retry with exponential backoff and dead letter queue, and duplicate job prevention (idempotency key or resource locking)"
66
+ weight: 0.30
67
+ description: "All async issues addressed"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: caching-error-scenarios
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Handle caching error scenarios — manage stale data, cache failures, and thundering herds in API caching layers"
7
+ tags: [REST, API, caching, Redis, stale-data, thundering-herd, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API has a Redis caching layer that serves 80% of read traffic.
13
+ Recent incidents revealed that your caching error handling is
14
+ inadequate.
15
+
16
+ Incident 1 — Cache stampede (thundering herd):
17
+ Popular product page cache expired. 10,000 concurrent requests
18
+ all missed the cache simultaneously and hit the database. Database
19
+ connection pool exhausted, API returned 503 for 45 seconds.
20
+
21
+ Incident 2 — Stale data served during outage:
22
+ Redis cluster failed over (30 seconds). During failover, the API
23
+ fell back to direct database queries. When Redis came back, it
24
+ served stale cached data (pre-outage prices) for products whose
25
+ prices were updated during the outage. Customers saw wrong prices.
26
+
27
+ Incident 3 — Negative caching trap:
28
+ A product was temporarily out of stock. The "out of stock" response
29
+ was cached for 1 hour. When inventory was restocked 10 minutes
30
+ later, customers still saw "out of stock" for 50 minutes.
31
+
32
+ Incident 4 — Cache poisoning:
33
+ A downstream service returned an error (500), which was cached as
34
+ if it were a valid response. The cached error was served to all
35
+ users for the TTL duration (15 minutes).
36
+
37
+ Incident 5 — Inconsistent cache state:
38
+ User updated their profile. The profile cache was updated, but the
39
+ user list cache still showed the old data. Different endpoints
40
+ returned different data for the same user.
41
+
42
+ Incident 6 — Cache memory exhaustion:
43
+ Redis ran out of memory. Eviction policy kicked in and removed
44
+ frequently-accessed keys. Error rates spiked because "cache misses"
45
+ on critical data overwhelmed the database.
46
+
47
+ Task: Design the caching error handling strategy. For each of the
48
+ 6 incidents, write: the root cause, the prevention mechanism, the
49
+ API's behavior when the cache fails (degrade gracefully vs error),
50
+ and the cache headers the API should return to communicate data
51
+ freshness to clients.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "All 6 incidents have prevention mechanisms — stampede protection (singleflight/lock, stale-while-revalidate), stale data detection (version tracking or invalidation), negative caching with short TTL, error response exclusion from cache, cache invalidation patterns (event-driven), and memory management (eviction policies, monitoring)"
56
+ weight: 0.35
57
+ description: "All incidents prevented"
58
+ - type: llm_judge
59
+ criteria: "Graceful degradation strategy is clear — defines when to serve stale data vs return an error vs fall back to the database, uses Cache-Control headers to communicate freshness, and handles the Redis failure scenario with circuit breakers and fallback behavior"
60
+ weight: 0.35
61
+ description: "Clear degradation strategy"
62
+ - type: llm_judge
63
+ criteria: "Cache headers communicate data state — uses Cache-Control, ETag, Age, and custom headers to tell clients if data is fresh, stale, or from a fallback source. The strategy prevents clients from caching error responses and handles CDN caching layers"
64
+ weight: 0.30
65
+ description: "Communicative cache headers"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: chaos-engineering-apis
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Apply chaos engineering to APIs — design experiments that validate error handling under controlled failure conditions"
7
+ tags: [REST, API, chaos-engineering, fault-injection, resilience, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API platform claims to handle failures gracefully, but you've
13
+ never tested it. After a major outage where circuit breakers didn't
14
+ fire, retries caused thundering herds, and error messages were
15
+ wrong, the CTO mandates chaos engineering.
16
+
17
+ Your API architecture:
18
+ - 15 microservices
19
+ - API gateway (Kong)
20
+ - PostgreSQL (primary + 2 replicas)
21
+ - Redis (caching + rate limiting)
22
+ - RabbitMQ (async processing)
23
+ - 3 external APIs (Stripe, SendGrid, Twilio)
24
+
25
+ Failure modes to test:
26
+ 1. Latency injection: What happens when a service responds in 5s
27
+ instead of 50ms? Do timeouts fire? Do circuit breakers trip?
28
+ 2. Error injection: What if a service returns 500 for 10% of
29
+ requests? Is the error budget consumed correctly?
30
+ 3. Connection failure: What if Redis goes down? Does rate limiting
31
+ fail open or closed? Do cached error responses work?
32
+ 4. Partial network partition: Service A can reach B but not C.
33
+ Does the error propagation handle this?
34
+ 5. Data corruption: What if the database returns truncated data?
35
+ Does the API validate response data from downstream?
36
+ 6. Clock skew: What if JWT validation fails because of 30-second
37
+ clock drift between services?
38
+ 7. Resource exhaustion: What if thread pools or connection pools
39
+ fill up? Does the API respond with 503 or just hang?
40
+ 8. Cascading failure: What if a failure in service D causes A, B,
41
+ and C to degrade?
42
+
43
+ Task: Design the chaos engineering program for your APIs. Write:
44
+ the experiment design for each of the 8 failure modes (hypothesis,
45
+ injection method, success criteria, abort conditions), the safety
46
+ guardrails (blast radius limits, kill switches), the progressive
47
+ rollout (start in staging, expand to production), and the error
48
+ handling improvements discovered template.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Experiment designs are rigorous — each has a clear hypothesis (e.g., 'circuit breaker should trip within 10s of 50% error rate'), specific injection method (proxy, sidecar, library), measurable success criteria, and abort conditions that prevent customer impact"
53
+ weight: 0.35
54
+ description: "Rigorous experiment designs"
55
+ - type: llm_judge
56
+ criteria: "All 8 failure modes are tested with practical approaches — latency injection via proxy, error injection via feature flags, connection failures via network policies, and resource exhaustion via controlled load. Each experiment validates specific error handling behavior (not just 'does it survive')"
57
+ weight: 0.35
58
+ description: "All failure modes tested"
59
+ - type: llm_judge
60
+ criteria: "Safety guardrails are comprehensive — blast radius limits (percentage of traffic, specific services), kill switches (automatic and manual), staging-first progression, and the results template captures what error handling worked, what didn't, and the remediation plan"
61
+ weight: 0.30
62
+ description: "Comprehensive safety guardrails"
@@ -0,0 +1,79 @@
1
+ meta:
2
+ id: database-error-handling
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Handle database errors in APIs — translate database failures into appropriate API responses without leaking internals"
7
+ tags: [REST, API, database, PostgreSQL, errors, translation, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API uses PostgreSQL and you're auditing how database errors
13
+ surface to API consumers. You find that many database errors leak
14
+ directly into API responses or are handled incorrectly.
15
+
16
+ Problematic error translations found in your codebase:
17
+
18
+ 1. Unique constraint violation:
19
+ DB: ERROR 23505: duplicate key value violates unique constraint
20
+ "users_email_key"
21
+ API returns: 500 { "error": "duplicate key value violates unique
22
+ constraint \"users_email_key\"" }
23
+ Should be: 409 Conflict with user-friendly message
24
+
25
+ 2. Foreign key violation:
26
+ DB: ERROR 23503: insert or update on table "orders" violates
27
+ foreign key constraint "orders_user_id_fkey"
28
+ API returns: 500 with the raw DB error
29
+ Should be: 400 or 422 with "Referenced user does not exist"
30
+
31
+ 3. Connection pool exhausted:
32
+ DB: TimeoutError: acquiring connection from pool timed out
33
+ API returns: hangs until client timeout
34
+ Should be: 503 with Retry-After
35
+
36
+ 4. Deadlock detected:
37
+ DB: ERROR 40P01: deadlock detected
38
+ API returns: 500 { "error": "deadlock detected" }
39
+ Should be: automatically retried, then 503 if retries exhausted
40
+
41
+ 5. Check constraint violation:
42
+ DB: ERROR 23514: new row for relation "products" violates check
43
+ constraint "products_price_check" (price must be > 0)
44
+ API returns: 500 with the constraint name
45
+ Should be: 422 with "Price must be greater than zero"
46
+
47
+ 6. Read replica lag:
48
+ User creates a post (writes to primary), immediately fetches
49
+ their posts (reads from replica) — post is missing
50
+ API returns: 200 { "posts": [] } (incorrect but no "error")
51
+ Should be: consistency-aware response
52
+
53
+ 7. Long query timeout:
54
+ Complex search query exceeds statement_timeout (30s)
55
+ DB: ERROR 57014: canceling statement due to statement timeout
56
+ API returns: 504 with no context
57
+ Should be: 408 or 504 with search refinement suggestions
58
+
59
+ Task: Design the database error translation layer. Write: the
60
+ mapping from PostgreSQL error codes to HTTP status codes, the
61
+ error message translation strategy (constraint names to human
62
+ messages), the automatic retry logic for transient errors
63
+ (deadlocks, connection issues), the read replica consistency
64
+ handling, and the implementation architecture (middleware vs
65
+ per-query vs ORM-level).
66
+
67
+ assertions:
68
+ - type: llm_judge
69
+ criteria: "Error code mapping is comprehensive — maps PostgreSQL error classes (23xxx constraint violations, 40xxx transaction errors, 53xxx resource errors, 57xxx timeout errors) to appropriate HTTP status codes, with different handling for client-caused vs server-caused database errors"
70
+ weight: 0.35
71
+ description: "Comprehensive error code mapping"
72
+ - type: llm_judge
73
+ criteria: "Error message translation is secure and helpful — constraint names are translated to user-friendly messages without revealing table/column names, unique violations identify the conflicting field, and the translation is configurable (not hardcoded). Internal database details never leak to the API response"
74
+ weight: 0.35
75
+ description: "Secure helpful message translation"
76
+ - type: llm_judge
77
+ criteria: "Transient error handling is robust — deadlocks are auto-retried with backoff, connection pool exhaustion triggers 503 with monitoring alerts, read replica lag is detected and handled (read-your-writes consistency or causal consistency), and the architecture centralizes translation without coupling business logic to database specifics"
78
+ weight: 0.30
79
+ description: "Robust transient error handling"
@@ -0,0 +1,63 @@
1
+ meta:
2
+ id: distributed-error-propagation
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design distributed error propagation — manage error context across microservice call chains without information loss"
7
+ tags: [REST, API, distributed, microservices, error-propagation, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your e-commerce platform has a 6-service call chain for placing an
13
+ order:
14
+
15
+ API Gateway → Order Service → Inventory Service → Payment Service
16
+ → Shipping Service
17
+ → Notification Service
18
+
19
+ A customer reports: "I got an error 'Something went wrong' when
20
+ placing my order." Your support team traces the request and finds:
21
+
22
+ - API Gateway returned: 500 { "error": "Internal server error" }
23
+ - Order Service logged: "PaymentError: upstream service failed"
24
+ - Payment Service logged: "HTTP 503 from Stripe"
25
+ - Stripe's actual error: "Card issuer declined — insufficient funds"
26
+
27
+ The real error (insufficient funds) is a 402 client error, but by
28
+ the time it reached the customer, it became a 500 server error.
29
+ Nobody in the chain translated it correctly.
30
+
31
+ Other propagation problems:
32
+ 1. Inventory Service returns 409 (conflict: item reserved by another
33
+ user) → Order Service converts to 500 → customer sees "server
34
+ error" instead of "item no longer available"
35
+ 2. Shipping Service returns 422 (address validation failed) → gets
36
+ lost in the chain → customer's order fails with no useful message
37
+ 3. Two services fail simultaneously (payment and shipping) → only
38
+ one error reaches the customer
39
+ 4. A service returns 200 with an error in the body (legacy service)
40
+ → downstream treats it as success
41
+ 5. Timeout at layer 3 of 6 → unclear if the operation partially
42
+ completed
43
+
44
+ Task: Design the error propagation framework. Write: the error
45
+ context object that flows through the service chain (preserving
46
+ original error while adding context at each hop), the error
47
+ translation rules (when to propagate vs wrap vs replace errors),
48
+ the distributed tracing integration, the multi-error aggregation
49
+ strategy, and the client-facing error derivation logic.
50
+
51
+ assertions:
52
+ - type: llm_judge
53
+ criteria: "Error context preservation is thorough — errors carry the original cause through the chain (structured error chain, not string concatenation), each service adds its context without losing upstream detail, and the full chain is available in logs/traces while only safe information reaches the client"
54
+ weight: 0.35
55
+ description: "Thorough error context preservation"
56
+ - type: llm_judge
57
+ criteria: "Error translation rules are correct — client errors (4xx) from downstream are translated appropriately for the current context (402 from Stripe becomes a meaningful payment error for the customer), server errors (5xx) are not exposed to the client as the downstream service's error, and the 5 specific problems are all addressed"
58
+ weight: 0.35
59
+ description: "Correct translation rules"
60
+ - type: llm_judge
61
+ criteria: "Distributed tracing integration is practical — uses correlation IDs (trace ID, span ID) to link errors across services, integrates with standard tracing (OpenTelemetry), and enables support to trace a customer's error back through the full service chain in minutes rather than hours"
62
+ weight: 0.30
63
+ description: "Practical tracing integration"
@@ -0,0 +1,61 @@
1
+ meta:
2
+ id: error-budgets-sre
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Implement error budgets — design SLO/SLI/SLA framework for API reliability with error budget policies"
7
+ tags: [REST, API, SRE, error-budgets, SLO, SLI, SLA, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API platform serves 50M requests/day across 200 endpoints for
13
+ 3,000 API consumers. The VP of Engineering wants to move from
14
+ "fight every error" to "error budget-based reliability management."
15
+
16
+ Current state:
17
+ - Overall error rate: 0.3% (150,000 errors/day)
18
+ - But it's unevenly distributed:
19
+ - GET endpoints: 0.05% error rate (mostly 404s from invalid IDs)
20
+ - POST /payments: 2.1% error rate (mostly declined cards — 402)
21
+ - POST /orders: 0.8% error rate (inventory conflicts — 409)
22
+ - Search endpoints: 0.4% error rate (timeout under load)
23
+ - P99 latency: 800ms overall (but payment is 2.5s)
24
+ - Availability: 99.85% (monthly)
25
+
26
+ Questions from the VP:
27
+ 1. "What's our SLO? How do we define 'error' for error budget
28
+ purposes? Is a 402 Declined Card an 'error'?"
29
+ 2. "How much error budget do we have? When should we stop shipping
30
+ features and fix reliability?"
31
+ 3. "How do we allocate error budget across teams? The payments team
32
+ has a higher 'error rate' but most are expected business errors."
33
+ 4. "What's the difference between our SLO (internal target) and SLA
34
+ (customer promise)? How much buffer do we need?"
35
+ 5. "How do we calculate error budget burn rate? I want an alert when
36
+ we're burning budget too fast."
37
+
38
+ Your API consumers have different needs:
39
+ - Enterprise (10 consumers): need 99.99% availability, <200ms p99
40
+ - Standard (200 consumers): need 99.9% availability, <500ms p99
41
+ - Free tier (2,790 consumers): best effort
42
+
43
+ Task: Design the complete error budget framework. Write: the SLI
44
+ definitions (what counts as an error), the SLO targets per tier,
45
+ the error budget calculation (with the 402/409 classification
46
+ decision), the burn rate alerting system, and the error budget
47
+ policy (what happens when budget is exhausted).
48
+
49
+ assertions:
50
+ - type: llm_judge
51
+ criteria: "SLI/SLO definitions are precise — clearly defines what counts as an error (5xx yes, 4xx depends on type), distinguishes between expected business errors (402 declined, 409 conflict) and unexpected errors (500, 503), and sets different SLOs for different endpoint categories and consumer tiers"
52
+ weight: 0.35
53
+ description: "Precise SLI/SLO definitions"
54
+ - type: llm_judge
55
+ criteria: "Error budget math is correct — calculates monthly budget from SLO (e.g., 99.9% = 0.1% budget = 50,000 errors at 50M req/day), explains burn rate computation (fast burn vs slow burn), and the alerting thresholds catch budget exhaustion before it happens"
56
+ weight: 0.35
57
+ description: "Correct error budget math"
58
+ - type: llm_judge
59
+ criteria: "Error budget policy is actionable — defines what happens when budget is exhausted (feature freeze, reliability sprint), how to allocate budget across teams, the SLO-to-SLA buffer rationale, and how the policy integrates with sprint planning and release decisions"
60
+ weight: 0.30
61
+ description: "Actionable budget policy"