dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (243) hide show
  1. package/courses/GENERATION_LOG.md +27 -0
  2. package/courses/api-error-handling/course.yaml +16 -0
  3. package/courses/api-error-handling/scenarios/level-1/error-response-format.yaml +131 -0
  4. package/courses/api-error-handling/scenarios/level-1/http-status-codes-basics.yaml +90 -0
  5. package/courses/api-error-handling/scenarios/level-1/rate-limiting-basics.yaml +135 -0
  6. package/courses/api-error-handling/scenarios/level-1/request-validation-errors.yaml +208 -0
  7. package/courses/api-error-handling/scenarios/level-2/circuit-breaker-pattern.yaml +189 -0
  8. package/courses/api-error-handling/scenarios/level-2/idempotency-retry-logic.yaml +159 -0
  9. package/courses/api-error-handling/scenarios/level-2/rfc-7807-problem-details.yaml +178 -0
  10. package/courses/api-error-handling/scenarios/level-2/webhook-error-handling.yaml +211 -0
  11. package/courses/api-error-handling/scenarios/level-3/distributed-tracing-errors.yaml +275 -0
  12. package/courses/github-actions-cicd/course.yaml +10 -0
  13. package/courses/github-actions-cicd/scenarios/level-1/actions-and-runners.yaml +58 -0
  14. package/courses/github-actions-cicd/scenarios/level-1/basic-workflow-syntax.yaml +52 -0
  15. package/courses/github-actions-cicd/scenarios/level-1/branch-protection-checks.yaml +63 -0
  16. package/courses/github-actions-cicd/scenarios/level-1/environment-variables-secrets.yaml +65 -0
  17. package/courses/github-actions-cicd/scenarios/level-1/first-cicd-shift.yaml +62 -0
  18. package/courses/github-actions-cicd/scenarios/level-1/job-dependencies-outputs.yaml +62 -0
  19. package/courses/github-actions-cicd/scenarios/level-1/simple-ci-pipeline.yaml +57 -0
  20. package/courses/github-actions-cicd/scenarios/level-1/workflow-debugging.yaml +90 -0
  21. package/courses/github-actions-cicd/scenarios/level-1/workflow-status-notifications.yaml +59 -0
  22. package/courses/github-actions-cicd/scenarios/level-1/workflow-triggers.yaml +56 -0
  23. package/courses/github-actions-cicd/scenarios/level-2/concurrency-control.yaml +58 -0
  24. package/courses/github-actions-cicd/scenarios/level-2/conditional-execution.yaml +60 -0
  25. package/courses/github-actions-cicd/scenarios/level-2/custom-actions-development.yaml +55 -0
  26. package/courses/github-actions-cicd/scenarios/level-2/dependency-caching.yaml +58 -0
  27. package/courses/github-actions-cicd/scenarios/level-2/deployment-workflows.yaml +61 -0
  28. package/courses/github-actions-cicd/scenarios/level-2/github-packages-publishing.yaml +59 -0
  29. package/courses/github-actions-cicd/scenarios/level-2/intermediate-cicd-shift.yaml +68 -0
  30. package/courses/github-actions-cicd/scenarios/level-2/matrix-builds.yaml +59 -0
  31. package/courses/github-actions-cicd/scenarios/level-2/reusable-workflows.yaml +61 -0
  32. package/courses/github-actions-cicd/scenarios/level-2/workflow-cost-optimization.yaml +61 -0
  33. package/courses/github-actions-cicd/scenarios/level-3/advanced-cicd-shift.yaml +64 -0
  34. package/courses/github-actions-cicd/scenarios/level-3/compliance-automation.yaml +68 -0
  35. package/courses/github-actions-cicd/scenarios/level-3/docker-action-development.yaml +65 -0
  36. package/courses/github-actions-cicd/scenarios/level-3/github-environments.yaml +65 -0
  37. package/courses/github-actions-cicd/scenarios/level-3/monorepo-ci.yaml +68 -0
  38. package/courses/github-actions-cicd/scenarios/level-3/oidc-cloud-deployments.yaml +55 -0
  39. package/courses/github-actions-cicd/scenarios/level-3/release-automation.yaml +61 -0
  40. package/courses/github-actions-cicd/scenarios/level-3/security-hardening.yaml +63 -0
  41. package/courses/github-actions-cicd/scenarios/level-3/self-hosted-runners.yaml +60 -0
  42. package/courses/github-actions-cicd/scenarios/level-3/workflow-optimization.yaml +59 -0
  43. package/courses/github-actions-cicd/scenarios/level-4/cicd-data-architecture.yaml +63 -0
  44. package/courses/github-actions-cicd/scenarios/level-4/cicd-economics-roi.yaml +63 -0
  45. package/courses/github-actions-cicd/scenarios/level-4/cicd-executive-communication.yaml +58 -0
  46. package/courses/github-actions-cicd/scenarios/level-4/cicd-incident-response.yaml +60 -0
  47. package/courses/github-actions-cicd/scenarios/level-4/cicd-org-design.yaml +59 -0
  48. package/courses/github-actions-cicd/scenarios/level-4/cicd-platform-architecture.yaml +63 -0
  49. package/courses/github-actions-cicd/scenarios/level-4/cicd-training-program.yaml +65 -0
  50. package/courses/github-actions-cicd/scenarios/level-4/cicd-vendor-evaluation.yaml +59 -0
  51. package/courses/github-actions-cicd/scenarios/level-4/enterprise-cicd-governance.yaml +55 -0
  52. package/courses/github-actions-cicd/scenarios/level-4/expert-cicd-shift.yaml +60 -0
  53. package/courses/github-actions-cicd/scenarios/level-5/cicd-ai-future.yaml +63 -0
  54. package/courses/github-actions-cicd/scenarios/level-5/cicd-behavioral-science.yaml +70 -0
  55. package/courses/github-actions-cicd/scenarios/level-5/cicd-board-strategy.yaml +56 -0
  56. package/courses/github-actions-cicd/scenarios/level-5/cicd-consulting-engagement.yaml +61 -0
  57. package/courses/github-actions-cicd/scenarios/level-5/cicd-industry-benchmarks.yaml +63 -0
  58. package/courses/github-actions-cicd/scenarios/level-5/cicd-ma-integration.yaml +73 -0
  59. package/courses/github-actions-cicd/scenarios/level-5/cicd-product-development.yaml +68 -0
  60. package/courses/github-actions-cicd/scenarios/level-5/cicd-regulatory-landscape.yaml +72 -0
  61. package/courses/github-actions-cicd/scenarios/level-5/comprehensive-cicd-system.yaml +66 -0
  62. package/courses/github-actions-cicd/scenarios/level-5/master-cicd-shift.yaml +76 -0
  63. package/courses/github-pr-review/scenarios/level-2/api-change-review.yaml +82 -0
  64. package/courses/github-pr-review/scenarios/level-2/automated-review-tooling.yaml +53 -0
  65. package/courses/github-pr-review/scenarios/level-2/cross-team-review.yaml +61 -0
  66. package/courses/github-pr-review/scenarios/level-2/intermediate-review-shift.yaml +66 -0
  67. package/courses/github-pr-review/scenarios/level-2/performance-review-patterns.yaml +99 -0
  68. package/courses/github-pr-review/scenarios/level-2/review-disagreement-resolution.yaml +64 -0
  69. package/courses/github-pr-review/scenarios/level-2/review-metrics-analysis.yaml +63 -0
  70. package/courses/github-pr-review/scenarios/level-2/review-turnaround-sla.yaml +54 -0
  71. package/courses/github-pr-review/scenarios/level-2/stacked-pr-review.yaml +65 -0
  72. package/courses/github-pr-review/scenarios/level-3/advanced-review-shift.yaml +65 -0
  73. package/courses/github-pr-review/scenarios/level-3/ai-powered-review.yaml +58 -0
  74. package/courses/github-pr-review/scenarios/level-3/compliance-review-process.yaml +64 -0
  75. package/courses/github-pr-review/scenarios/level-3/cross-functional-review.yaml +60 -0
  76. package/courses/github-pr-review/scenarios/level-3/incident-driven-review.yaml +63 -0
  77. package/courses/github-pr-review/scenarios/level-3/large-scale-review-operations.yaml +55 -0
  78. package/courses/github-pr-review/scenarios/level-3/monorepo-review-process.yaml +68 -0
  79. package/courses/github-pr-review/scenarios/level-3/review-automation-platform.yaml +61 -0
  80. package/courses/github-pr-review/scenarios/level-3/review-culture-design.yaml +62 -0
  81. package/courses/github-pr-review/scenarios/level-3/review-data-pipeline.yaml +62 -0
  82. package/courses/github-pr-review/scenarios/level-4/enterprise-review-operations.yaml +61 -0
  83. package/courses/github-pr-review/scenarios/level-4/expert-review-shift.yaml +62 -0
  84. package/courses/github-pr-review/scenarios/level-4/review-data-architecture.yaml +69 -0
  85. package/courses/github-pr-review/scenarios/level-4/review-economics-roi.yaml +63 -0
  86. package/courses/github-pr-review/scenarios/level-4/review-executive-communication.yaml +61 -0
  87. package/courses/github-pr-review/scenarios/level-4/review-incident-postmortem.yaml +69 -0
  88. package/courses/github-pr-review/scenarios/level-4/review-org-design.yaml +62 -0
  89. package/courses/github-pr-review/scenarios/level-4/review-platform-architecture.yaml +64 -0
  90. package/courses/github-pr-review/scenarios/level-4/review-training-program.yaml +66 -0
  91. package/courses/github-pr-review/scenarios/level-4/review-vendor-evaluation.yaml +76 -0
  92. package/courses/github-pr-review/scenarios/level-5/comprehensive-review-system.yaml +68 -0
  93. package/courses/github-pr-review/scenarios/level-5/master-review-shift.yaml +73 -0
  94. package/courses/github-pr-review/scenarios/level-5/review-ai-future.yaml +69 -0
  95. package/courses/github-pr-review/scenarios/level-5/review-behavioral-science.yaml +66 -0
  96. package/courses/github-pr-review/scenarios/level-5/review-board-strategy.yaml +62 -0
  97. package/courses/github-pr-review/scenarios/level-5/review-consulting-engagement.yaml +62 -0
  98. package/courses/github-pr-review/scenarios/level-5/review-devtools-product.yaml +71 -0
  99. package/courses/github-pr-review/scenarios/level-5/review-industry-benchmarks.yaml +64 -0
  100. package/courses/github-pr-review/scenarios/level-5/review-ma-integration.yaml +76 -0
  101. package/courses/github-pr-review/scenarios/level-5/review-regulatory-landscape.yaml +78 -0
  102. package/courses/postgresql-query-optimization/course.yaml +11 -0
  103. package/courses/postgresql-query-optimization/scenarios/level-1/explain-analyze-basics.yaml +80 -0
  104. package/courses/postgresql-query-optimization/scenarios/level-1/first-optimization-shift.yaml +77 -0
  105. package/courses/postgresql-query-optimization/scenarios/level-1/index-fundamentals.yaml +76 -0
  106. package/courses/postgresql-query-optimization/scenarios/level-1/join-basics.yaml +73 -0
  107. package/courses/postgresql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml +62 -0
  108. package/courses/postgresql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml +69 -0
  109. package/courses/postgresql-query-optimization/scenarios/level-1/select-star-problems.yaml +69 -0
  110. package/courses/postgresql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml +63 -0
  111. package/courses/postgresql-query-optimization/scenarios/level-1/vacuum-and-statistics.yaml +62 -0
  112. package/courses/postgresql-query-optimization/scenarios/level-1/where-clause-optimization.yaml +74 -0
  113. package/courses/postgresql-query-optimization/scenarios/level-2/autovacuum-tuning.yaml +76 -0
  114. package/courses/postgresql-query-optimization/scenarios/level-2/composite-index-design.yaml +81 -0
  115. package/courses/postgresql-query-optimization/scenarios/level-2/covering-indexes.yaml +74 -0
  116. package/courses/postgresql-query-optimization/scenarios/level-2/cte-optimization.yaml +83 -0
  117. package/courses/postgresql-query-optimization/scenarios/level-2/intermediate-optimization-shift.yaml +66 -0
  118. package/courses/postgresql-query-optimization/scenarios/level-2/join-optimization.yaml +72 -0
  119. package/courses/postgresql-query-optimization/scenarios/level-2/partial-and-expression-indexes.yaml +75 -0
  120. package/courses/postgresql-query-optimization/scenarios/level-2/query-planner-settings.yaml +62 -0
  121. package/courses/postgresql-query-optimization/scenarios/level-2/subquery-optimization.yaml +67 -0
  122. package/courses/postgresql-query-optimization/scenarios/level-2/window-function-optimization.yaml +63 -0
  123. package/courses/postgresql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml +71 -0
  124. package/courses/postgresql-query-optimization/scenarios/level-3/connection-pooling.yaml +60 -0
  125. package/courses/postgresql-query-optimization/scenarios/level-3/full-text-search-optimization.yaml +66 -0
  126. package/courses/postgresql-query-optimization/scenarios/level-3/jsonb-optimization.yaml +88 -0
  127. package/courses/postgresql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml +80 -0
  128. package/courses/postgresql-query-optimization/scenarios/level-3/materialized-view-optimization.yaml +73 -0
  129. package/courses/postgresql-query-optimization/scenarios/level-3/parallel-query-execution.yaml +74 -0
  130. package/courses/postgresql-query-optimization/scenarios/level-3/partitioning-strategies.yaml +71 -0
  131. package/courses/postgresql-query-optimization/scenarios/level-3/specialized-index-types.yaml +67 -0
  132. package/courses/postgresql-query-optimization/scenarios/level-3/write-optimization.yaml +65 -0
  133. package/courses/postgresql-query-optimization/scenarios/level-4/data-architecture-analytics.yaml +64 -0
  134. package/courses/postgresql-query-optimization/scenarios/level-4/database-executive-communication.yaml +64 -0
  135. package/courses/postgresql-query-optimization/scenarios/level-4/database-migration-planning.yaml +57 -0
  136. package/courses/postgresql-query-optimization/scenarios/level-4/enterprise-database-governance.yaml +52 -0
  137. package/courses/postgresql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml +73 -0
  138. package/courses/postgresql-query-optimization/scenarios/level-4/high-availability-architecture.yaml +62 -0
  139. package/courses/postgresql-query-optimization/scenarios/level-4/optimizer-internals.yaml +69 -0
  140. package/courses/postgresql-query-optimization/scenarios/level-4/performance-sla-design.yaml +58 -0
  141. package/courses/postgresql-query-optimization/scenarios/level-4/read-replica-optimization.yaml +62 -0
  142. package/courses/postgresql-query-optimization/scenarios/level-4/vendor-evaluation.yaml +73 -0
  143. package/courses/rest-api-error-handling/course.yaml +11 -0
  144. package/courses/rest-api-error-handling/scenarios/level-1/authentication-errors.yaml +71 -0
  145. package/courses/rest-api-error-handling/scenarios/level-1/content-negotiation-errors.yaml +63 -0
  146. package/courses/rest-api-error-handling/scenarios/level-1/error-logging-basics.yaml +63 -0
  147. package/courses/rest-api-error-handling/scenarios/level-1/error-response-format.yaml +58 -0
  148. package/courses/rest-api-error-handling/scenarios/level-1/first-error-handling-shift.yaml +67 -0
  149. package/courses/rest-api-error-handling/scenarios/level-1/http-status-codes.yaml +46 -0
  150. package/courses/rest-api-error-handling/scenarios/level-1/not-found-errors.yaml +52 -0
  151. package/courses/rest-api-error-handling/scenarios/level-1/rate-limiting-errors.yaml +56 -0
  152. package/courses/rest-api-error-handling/scenarios/level-1/request-validation-errors.yaml +59 -0
  153. package/courses/rest-api-error-handling/scenarios/level-1/server-error-handling.yaml +55 -0
  154. package/courses/rest-api-error-handling/scenarios/level-2/api-versioning-errors.yaml +66 -0
  155. package/courses/rest-api-error-handling/scenarios/level-2/batch-request-errors.yaml +61 -0
  156. package/courses/rest-api-error-handling/scenarios/level-2/circuit-breaker-pattern.yaml +52 -0
  157. package/courses/rest-api-error-handling/scenarios/level-2/error-code-taxonomy.yaml +62 -0
  158. package/courses/rest-api-error-handling/scenarios/level-2/error-monitoring-alerting.yaml +53 -0
  159. package/courses/rest-api-error-handling/scenarios/level-2/intermediate-error-shift.yaml +69 -0
  160. package/courses/rest-api-error-handling/scenarios/level-2/pagination-errors.yaml +66 -0
  161. package/courses/rest-api-error-handling/scenarios/level-2/retry-and-idempotency.yaml +60 -0
  162. package/courses/rest-api-error-handling/scenarios/level-2/rfc7807-problem-details.yaml +60 -0
  163. package/courses/rest-api-error-handling/scenarios/level-2/webhook-error-handling.yaml +55 -0
  164. package/courses/rest-api-error-handling/scenarios/level-3/advanced-error-shift.yaml +72 -0
  165. package/courses/rest-api-error-handling/scenarios/level-3/api-gateway-errors.yaml +71 -0
  166. package/courses/rest-api-error-handling/scenarios/level-3/async-api-errors.yaml +67 -0
  167. package/courses/rest-api-error-handling/scenarios/level-3/caching-error-scenarios.yaml +65 -0
  168. package/courses/rest-api-error-handling/scenarios/level-3/chaos-engineering-apis.yaml +62 -0
  169. package/courses/rest-api-error-handling/scenarios/level-3/database-error-handling.yaml +79 -0
  170. package/courses/rest-api-error-handling/scenarios/level-3/distributed-error-propagation.yaml +63 -0
  171. package/courses/rest-api-error-handling/scenarios/level-3/error-budgets-sre.yaml +61 -0
  172. package/courses/rest-api-error-handling/scenarios/level-3/error-correlation.yaml +58 -0
  173. package/courses/rest-api-error-handling/scenarios/level-3/graphql-vs-rest-errors.yaml +73 -0
  174. package/courses/rest-api-error-handling/scenarios/level-4/compliance-error-handling.yaml +65 -0
  175. package/courses/rest-api-error-handling/scenarios/level-4/enterprise-error-governance.yaml +62 -0
  176. package/courses/rest-api-error-handling/scenarios/level-4/error-analytics-platform.yaml +65 -0
  177. package/courses/rest-api-error-handling/scenarios/level-4/error-cost-optimization.yaml +63 -0
  178. package/courses/rest-api-error-handling/scenarios/level-4/error-executive-communication.yaml +60 -0
  179. package/courses/rest-api-error-handling/scenarios/level-4/error-handling-architecture.yaml +67 -0
  180. package/courses/rest-api-error-handling/scenarios/level-4/error-org-design.yaml +68 -0
  181. package/courses/rest-api-error-handling/scenarios/level-4/error-sla-design.yaml +65 -0
  182. package/courses/rest-api-error-handling/scenarios/level-4/error-training-program.yaml +61 -0
  183. package/courses/rest-api-error-handling/scenarios/level-4/expert-error-shift.yaml +63 -0
  184. package/courses/rest-api-error-handling/scenarios/level-5/comprehensive-error-system.yaml +68 -0
  185. package/courses/rest-api-error-handling/scenarios/level-5/error-ai-future.yaml +75 -0
  186. package/courses/rest-api-error-handling/scenarios/level-5/error-behavioral-science.yaml +73 -0
  187. package/courses/rest-api-error-handling/scenarios/level-5/error-board-strategy.yaml +60 -0
  188. package/courses/rest-api-error-handling/scenarios/level-5/error-consulting-engagement.yaml +58 -0
  189. package/courses/rest-api-error-handling/scenarios/level-5/error-industry-benchmarks.yaml +72 -0
  190. package/courses/rest-api-error-handling/scenarios/level-5/error-ma-integration.yaml +68 -0
  191. package/courses/rest-api-error-handling/scenarios/level-5/error-product-development.yaml +66 -0
  192. package/courses/rest-api-error-handling/scenarios/level-5/error-regulatory-landscape.yaml +80 -0
  193. package/courses/rest-api-error-handling/scenarios/level-5/master-error-shift.yaml +73 -0
  194. package/dist/cli/commands/add.d.ts.map +1 -1
  195. package/dist/cli/commands/add.js +6 -5
  196. package/dist/cli/commands/add.js.map +1 -1
  197. package/dist/cli/commands/generate.d.ts.map +1 -1
  198. package/dist/cli/commands/generate.js +4 -0
  199. package/dist/cli/commands/generate.js.map +1 -1
  200. package/dist/cli/commands/list.d.ts.map +1 -1
  201. package/dist/cli/commands/list.js +6 -18
  202. package/dist/cli/commands/list.js.map +1 -1
  203. package/dist/cli/commands/train.d.ts.map +1 -1
  204. package/dist/cli/commands/train.js +18 -18
  205. package/dist/cli/commands/train.js.map +1 -1
  206. package/dist/cli/index.js +93 -55
  207. package/dist/cli/index.js.map +1 -1
  208. package/dist/cli/run-demo.js +2 -1
  209. package/dist/cli/run-demo.js.map +1 -1
  210. package/dist/cli/setup.d.ts +18 -0
  211. package/dist/cli/setup.d.ts.map +1 -0
  212. package/dist/cli/setup.js +154 -0
  213. package/dist/cli/setup.js.map +1 -0
  214. package/dist/engine/agent-bridge.d.ts +5 -2
  215. package/dist/engine/agent-bridge.d.ts.map +1 -1
  216. package/dist/engine/agent-bridge.js +36 -9
  217. package/dist/engine/agent-bridge.js.map +1 -1
  218. package/dist/engine/loader.d.ts +21 -0
  219. package/dist/engine/loader.d.ts.map +1 -1
  220. package/dist/engine/loader.js +54 -1
  221. package/dist/engine/loader.js.map +1 -1
  222. package/dist/engine/training-loop.d.ts.map +1 -1
  223. package/dist/engine/training-loop.js +1 -0
  224. package/dist/engine/training-loop.js.map +1 -1
  225. package/dist/engine/training.d.ts.map +1 -1
  226. package/dist/engine/training.js +1 -0
  227. package/dist/engine/training.js.map +1 -1
  228. package/dist/generator/skill-generator.d.ts +1 -1
  229. package/dist/generator/skill-generator.d.ts.map +1 -1
  230. package/dist/generator/skill-generator.js +21 -2
  231. package/dist/generator/skill-generator.js.map +1 -1
  232. package/dist/mcp/server.d.ts.map +1 -1
  233. package/dist/mcp/server.js +11 -26
  234. package/dist/mcp/server.js.map +1 -1
  235. package/dist/mcp/session-manager.d.ts +3 -1
  236. package/dist/mcp/session-manager.d.ts.map +1 -1
  237. package/dist/mcp/session-manager.js +44 -22
  238. package/dist/mcp/session-manager.js.map +1 -1
  239. package/dist/types/schemas.d.ts +38 -13
  240. package/dist/types/schemas.d.ts.map +1 -1
  241. package/dist/types/schemas.js +9 -5
  242. package/dist/types/schemas.js.map +1 -1
  243. package/package.json +1 -1
@@ -0,0 +1,58 @@
1
+ meta:
2
+ id: error-correlation
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Build error correlation across services — connect related errors across distributed systems to identify root causes"
7
+ tags: [REST, API, correlation, distributed-tracing, root-cause, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your platform has 25 microservices generating 50,000 errors per
13
+ day. Most are correlated (one root cause triggers errors across
14
+ many services), but your current monitoring treats each error as
15
+ independent. This creates noise and delays root cause analysis.
16
+
17
+ Recent incident example:
18
+ - 3:00 PM: Database connection pool exhausted on User Service
19
+ - 3:01 PM: 500 errors spike on Order Service (calls User Service)
20
+ - 3:01 PM: 500 errors spike on Cart Service (calls User Service)
21
+ - 3:02 PM: 500 errors spike on Payment Service (calls Order Service)
22
+ - 3:02 PM: Timeout errors on API Gateway (all downstream failing)
23
+ - 3:03 PM: 50 PagerDuty alerts fire simultaneously
24
+ - 3:15 PM: On-call engineer still investigating which service is
25
+ the root cause (15 minutes wasted)
26
+
27
+ The 50 alerts were all caused by ONE root cause (DB connection pool).
28
+ But the engineer had to manually trace through 5 service dashboards
29
+ to figure that out.
30
+
31
+ Your error data sources:
32
+ - Application logs (JSON, in Elasticsearch)
33
+ - Distributed traces (Jaeger, using OpenTelemetry)
34
+ - Metrics (Prometheus, Grafana)
35
+ - Error tracking (Sentry, per-service)
36
+ - Kubernetes events
37
+ - Cloud provider health (AWS CloudWatch)
38
+
39
+ Task: Design the error correlation system. Write: the correlation
40
+ algorithm (how to group related errors across services), the root
41
+ cause identification logic (dependency graph + error timing), the
42
+ unified error view (single dashboard showing correlated errors),
43
+ the alert deduplication strategy (50 alerts → 1 incident), and
44
+ the automated root cause suggestion system.
45
+
46
+ assertions:
47
+ - type: llm_judge
48
+ criteria: "Correlation algorithm is sound — uses trace IDs to link errors across services, uses temporal proximity and dependency graphs to group errors without trace IDs, and correctly handles the case where one root cause generates errors in 5 dependent services"
49
+ weight: 0.35
50
+ description: "Sound correlation algorithm"
51
+ - type: llm_judge
52
+ criteria: "Root cause identification is effective — leverages the service dependency graph to identify upstream causes, uses error timing (first service to error is likely the cause), and the automated suggestion system would have identified the DB connection pool issue in minutes rather than 15+"
53
+ weight: 0.35
54
+ description: "Effective root cause identification"
55
+ - type: llm_judge
56
+ criteria: "Alert deduplication is practical — groups related alerts into a single incident, identifies the root service, suppresses downstream noise, and the unified dashboard shows the error cascade visually with the root cause highlighted"
57
+ weight: 0.30
58
+ description: "Practical alert deduplication"
@@ -0,0 +1,73 @@
1
+ meta:
2
+ id: graphql-vs-rest-errors
3
+ level: 3
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Compare GraphQL and REST error handling — design error strategies for APIs that serve both protocols"
7
+ tags: [REST, API, GraphQL, error-comparison, dual-protocol, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your company exposes the same backend through both REST and GraphQL
13
+ APIs. The error handling is inconsistent between them, and teams
14
+ building both interfaces need a unified strategy.
15
+
16
+ Current inconsistencies:
17
+
18
+ 1. Auth errors:
19
+ REST: 401 { "error": "Invalid token" }
20
+ GraphQL: 200 { "data": null, "errors": [{ "message": "Invalid
21
+ token" }] }
22
+ Problem: GraphQL always returns 200, making monitoring tools
23
+ think everything is fine.
24
+
25
+ 2. Validation errors:
26
+ REST: 400 { "errors": [{ "field": "email", "message": "..." }] }
27
+ GraphQL: 200 { "data": null, "errors": [{ "message": "...",
28
+ "extensions": { "field": "email" } }] }
29
+ Problem: Different shapes for the same validation failure.
30
+
31
+ 3. Partial failures:
32
+ REST aggregated endpoint: returns 500 if any sub-call fails
33
+ GraphQL: returns partial data with errors for failed fields
34
+ Problem: GraphQL handles partial failure better, but how to
35
+ achieve similar in REST?
36
+
37
+ 4. Rate limiting:
38
+ REST: 429 with X-RateLimit headers
39
+ GraphQL: all queries hit one endpoint — per-field or per-query
40
+ rate limiting?
41
+ Problem: A single GraphQL query can be equivalent to 50 REST
42
+ requests.
43
+
44
+ 5. Error codes:
45
+ REST: well-defined (HTTP status codes)
46
+ GraphQL: no standard error code system (just "message" strings)
47
+ Problem: Clients can't programmatically handle GraphQL errors.
48
+
49
+ 6. Deprecation:
50
+ REST: version URL + Sunset header
51
+ GraphQL: field-level @deprecated directive
52
+ Problem: How to communicate the same deprecation through both?
53
+
54
+ Task: Design the unified error handling strategy for dual-protocol
55
+ APIs. Write: the error mapping between REST and GraphQL for each
56
+ scenario, the GraphQL error extension standard (to match REST's
57
+ machine-readability), the monitoring strategy that works for both
58
+ (catching errors when GraphQL always returns 200), and the shared
59
+ error catalog that both protocols use.
60
+
61
+ assertions:
62
+ - type: llm_judge
63
+ criteria: "Error mapping is thorough — maps each REST error pattern (status code + body) to its GraphQL equivalent (200 + errors array), handles the 200-problem for monitoring (custom error extensions with severity/code), and addresses partial failure handling for both protocols"
64
+ weight: 0.35
65
+ description: "Thorough error mapping"
66
+ - type: llm_judge
67
+ criteria: "GraphQL error extensions are well-designed — adds machine-readable codes, severity levels, retry guidance, and field-level error attribution. The extensions make GraphQL errors as actionable as REST errors for programmatic handling. Addresses rate limiting for complex queries (query cost analysis)"
68
+ weight: 0.35
69
+ description: "Well-designed GraphQL extensions"
70
+ - type: llm_judge
71
+ criteria: "Shared error catalog is practical — defines errors once (shared between protocols), translates automatically to REST format or GraphQL format, and the monitoring strategy correctly identifies errors in both protocols (using error extensions, not HTTP status codes, for GraphQL)"
72
+ weight: 0.30
73
+ description: "Practical shared catalog"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: compliance-error-handling
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design compliance-aware error handling — satisfy PCI DSS, HIPAA, GDPR, and SOC 2 requirements for API error responses and logging"
7
+ tags: [REST, API, compliance, PCI-DSS, HIPAA, GDPR, SOC-2, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your healthcare fintech company processes both medical records
13
+ (HIPAA) and payment data (PCI DSS), serves European users (GDPR),
14
+ and undergoes annual SOC 2 audits. A compliance audit found 14
15
+ critical findings related to API error handling.
16
+
17
+ Critical findings:
18
+
19
+ PCI DSS violations:
20
+ 1. API error logs contain full credit card numbers (not just last 4)
21
+ 2. Error responses include cardholder name in validation messages
22
+ 3. Debug mode was left on in production, exposing internal error
23
+ details including payment processor credentials
24
+ 4. No audit trail for failed payment attempts
25
+
26
+ HIPAA violations:
27
+ 5. API 500 errors return patient medical record IDs in stack traces
28
+ 6. Error logs are stored unencrypted and accessible to all engineers
29
+ 7. Patient-identifiable data appears in error correlation IDs
30
+ (using patient email as request ID)
31
+ 8. No access logging for PHI-containing API endpoints
32
+
33
+ GDPR violations:
34
+ 9. Error messages include user email addresses
35
+ 10. Error logs retain personal data beyond the consent period
36
+ 11. No mechanism to purge error logs when a user exercises right to
37
+ erasure (Article 17)
38
+ 12. Cross-border error log storage (EU user data in US logs)
39
+
40
+ SOC 2 violations:
41
+ 13. No evidence of error monitoring and response procedures
42
+ 14. Incomplete audit trail — gaps in error logging during an outage
43
+
44
+ Remediation deadline: 90 days
45
+ Budget: $200K
46
+
47
+ Task: Design the compliance remediation plan for all 14 findings.
48
+ Write: the error response sanitization rules per regulation, the
49
+ error log data classification and handling policy, the audit trail
50
+ design that satisfies all 4 frameworks, the data residency solution
51
+ for error logs, and the ongoing compliance monitoring system.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "All 14 findings are remediated — PCI DSS: card data masked/removed from logs and responses, debug mode disabled, payment audit trails added. HIPAA: PHI removed from error outputs, logs encrypted and access-controlled, proper request IDs. GDPR: PII removed from errors, log retention aligned with consent, erasure mechanism built, data residency addressed. SOC 2: monitoring procedures documented, audit trail gaps prevented"
56
+ weight: 0.35
57
+ description: "All 14 findings remediated"
58
+ - type: llm_judge
59
+ criteria: "Unified compliance framework avoids conflicting rules — handles cases where regulations conflict (HIPAA requires audit trails but GDPR requires erasure), proposes a single error handling policy that satisfies all 4 frameworks simultaneously, and classifies data by sensitivity level"
60
+ weight: 0.35
61
+ description: "Unified compliance framework"
62
+ - type: llm_judge
63
+ criteria: "Implementation is feasible within 90 days and $200K — prioritizes critical findings, phases the work, and includes ongoing compliance monitoring (automated scans for PII/PHI in error logs, data residency checks, retention policy enforcement)"
64
+ weight: 0.30
65
+ description: "Feasible remediation plan"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: enterprise-error-governance
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design enterprise API error governance — standardize error handling across a large organization with multiple API teams"
7
+ tags: [REST, API, governance, enterprise, standardization, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're the VP of Platform Engineering at a company with 2,000
13
+ engineers across 40 API teams. An audit revealed catastrophic
14
+ inconsistency in error handling:
15
+
16
+ Audit findings:
17
+ - 40 teams use 23 different error response formats
18
+ - 15 teams expose internal stack traces in production
19
+ - 8 teams have no rate limiting
20
+ - Error codes are duplicated across teams (INVALID_INPUT means
21
+ different things in 12 services)
22
+ - No team logs errors the same way (impossible to build unified
23
+ dashboards)
24
+ - 6 teams return 200 with error bodies (breaking monitoring)
25
+ - Customer support can't debug cross-service issues (no correlation)
26
+ - API documentation lists errors inconsistently
27
+
28
+ Previous standardization attempts that failed:
29
+ - 2024: Published error handling guidelines (PDF) — 5% adoption
30
+ - 2024: Added error format to API style guide (Confluence) — 12%
31
+ - 2025: Mandatory review checklist item — reviewers don't check
32
+ - 2025: Error linting in CI — teams added bypass flags
33
+
34
+ The CEO is frustrated: "We've been trying to standardize for 2
35
+ years. Why can't 40 teams agree on error formats?"
36
+
37
+ Constraints:
38
+ - Can't break existing API consumers (backward compatibility)
39
+ - Teams have autonomy (can't mandate tools)
40
+ - Migration must be gradual (can't rewrite everything)
41
+ - Must show progress to the board quarterly
42
+
43
+ Task: Design the governance program that will actually succeed.
44
+ Write: the error handling standard (technical specification), the
45
+ adoption strategy (why previous attempts failed and how this one
46
+ differs), the enforcement mechanism (automated, not manual), the
47
+ migration path (per-team, backward-compatible), and the success
48
+ metrics (tracked quarterly for the board).
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Governance strategy addresses why previous attempts failed — identifies that PDFs, wikis, and checklists don't work, proposes automated enforcement (API gateway validation, CI/CD error format checks without bypass), and uses incentives rather than mandates (making compliance easier than non-compliance)"
53
+ weight: 0.35
54
+ description: "Strategy that overcomes past failures"
55
+ - type: llm_judge
56
+ criteria: "Technical standard is comprehensive but adoptable — defines the error format, error code registry, logging standard, and correlation requirements, but provides libraries/middleware that teams can drop in rather than implement from scratch. Backward compatibility is preserved via content negotiation or dual-format period"
57
+ weight: 0.35
58
+ description: "Comprehensive adoptable standard"
59
+ - type: llm_judge
60
+ criteria: "Success metrics are board-ready — tracks adoption percentage, error format compliance rate, cross-service debugging time, support ticket resolution time, and incident detection latency. Quarterly milestones show concrete progress toward full standardization"
61
+ weight: 0.30
62
+ description: "Board-ready success metrics"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: error-analytics-platform
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Build an API error analytics platform — design the data architecture for error pattern detection, anomaly identification, and trend analysis"
7
+ tags: [REST, API, analytics, error-patterns, anomaly-detection, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API platform processes 500M requests/day across 200 services.
13
+ The current error handling infrastructure can tell you THAT errors
14
+ are happening but not WHY patterns form or WHAT to do about them.
15
+
16
+ The CEO's questions you can't answer today:
17
+ 1. "Which errors cost us the most revenue?" (No link between errors
18
+ and business impact)
19
+ 2. "Are errors getting better or worse over time?" (No trend data)
20
+ 3. "Which teams need the most help with error handling?" (No team-
21
+ level aggregation)
22
+ 4. "Can we predict outages before they happen?" (No anomaly
23
+ detection)
24
+ 5. "What's our total cost of API errors?" (Engineering time + lost
25
+ revenue + support costs)
26
+
27
+ Data sources available:
28
+ - Application logs: 2TB/day (Elasticsearch, 30-day retention)
29
+ - Distributed traces: 500GB/day (Jaeger, 7-day retention)
30
+ - Metrics: Prometheus (90-day retention)
31
+ - Error tracking: Sentry (all services)
32
+ - Business metrics: Revenue per API call, conversion rates
33
+ - Support tickets: Zendesk (tagged by API endpoint)
34
+ - Incident records: PagerDuty
35
+
36
+ Desired capabilities:
37
+ - Error-to-revenue impact mapping
38
+ - Automatic error pattern classification
39
+ - Anomaly detection (detect novel error patterns)
40
+ - Team error budget scorecards
41
+ - Root cause suggestion engine
42
+ - Predictive alerting (predict errors before they spike)
43
+ - Error cost attribution (per team, per service, per endpoint)
44
+
45
+ Budget: $500K/year for the platform
46
+
47
+ Task: Design the error analytics platform. Write: the data
48
+ architecture (ingestion, storage, processing), the analytics models
49
+ (pattern classification, anomaly detection, prediction), the
50
+ business impact calculation model, the team scorecards design,
51
+ and the executive dashboard that answers the CEO's 5 questions.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "Data architecture handles the scale — 2TB+/day ingestion with appropriate storage tiers (hot/warm/cold), connects errors to traces, metrics, business data, and support tickets. The pipeline is cost-effective within the $500K budget"
56
+ weight: 0.35
57
+ description: "Scalable data architecture"
58
+ - type: llm_judge
59
+ criteria: "Analytics models are practical — error pattern classification groups similar errors automatically, anomaly detection identifies novel patterns beyond threshold-based alerting, and the prediction model uses leading indicators (latency trends, error rate derivatives) to forecast incidents"
60
+ weight: 0.35
61
+ description: "Practical analytics models"
62
+ - type: llm_judge
63
+ criteria: "Executive dashboard answers all 5 CEO questions — maps errors to revenue impact (lost transactions, support costs), shows trends over time, provides team-level scorecards with error budgets, includes anomaly/prediction alerts, and calculates total cost of API errors"
64
+ weight: 0.30
65
+ description: "CEO-answering dashboard"
@@ -0,0 +1,63 @@
1
+ meta:
2
+ id: error-cost-optimization
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Optimize API error costs — reduce the financial impact of errors through prevention, faster resolution, and smarter handling"
7
+ tags: [REST, API, cost-optimization, ROI, economics, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your CFO asks: "How much do API errors actually cost us?" You
13
+ conduct an analysis and discover the total is $8.5M per year.
14
+
15
+ Cost breakdown:
16
+
17
+ Direct costs — $3.2M/year:
18
+ - Failed transactions: $1.8M (declined payments that were
19
+ actually valid — errors in fraud detection API)
20
+ - Duplicate charges requiring refunds: $400K (retry without
21
+ idempotency)
22
+ - SLA penalty credits: $600K (error rate SLA breaches)
23
+ - Compliance fines: $400K (PII in error logs, audit trail gaps)
24
+
25
+ Indirect costs — $5.3M/year:
26
+ - Engineering time debugging: $2.1M (800 engineer-hours/month
27
+ × $220/hour)
28
+ - Customer support for API errors: $1.2M (40% of Tier 2 tickets)
29
+ - Customer churn attributed to reliability: $1.5M (3 enterprise
30
+ customers cited error handling in exit interviews)
31
+ - Delayed feature delivery: $500K (30% of sprint capacity goes
32
+ to error-related work)
33
+
34
+ The CFO's challenge: "Reduce this by 50% in 12 months. Show me
35
+ the investment required and the projected ROI."
36
+
37
+ Available levers:
38
+ - Error prevention (better validation, testing, monitoring)
39
+ - Error detection speed (faster alerts, better correlation)
40
+ - Error resolution speed (runbooks, automated remediation)
41
+ - Error handling quality (better messages, retry logic)
42
+ - Error architecture (circuit breakers, graceful degradation)
43
+
44
+ Task: Design the error cost optimization program. Write: the
45
+ prioritized investment plan (highest ROI items first), the cost
46
+ model (how to measure error cost reduction over time), the
47
+ specific initiatives for each cost category with projected
48
+ savings, the implementation roadmap (quarterly milestones), and
49
+ the executive report template for tracking progress.
50
+
51
+ assertions:
52
+ - type: llm_judge
53
+ criteria: "Investment plan prioritizes by ROI — identifies quick wins (idempotency for duplicate charges saves $400K with small investment), medium-term improvements (correlation and debugging tools save $2.1M in engineering time), and strategic initiatives (error architecture prevents churn). Each initiative has projected cost and savings"
54
+ weight: 0.35
55
+ description: "ROI-prioritized investment plan"
56
+ - type: llm_judge
57
+ criteria: "Cost model is measurable — defines how to track error cost reduction per category (failed transactions measured by payment success rate, engineering time measured by debugging hours, support costs measured by error ticket volume), and includes baseline measurements for before/after comparison"
58
+ weight: 0.35
59
+ description: "Measurable cost model"
60
+ - type: llm_judge
61
+ criteria: "Roadmap is realistic — phases the work quarterly over 12 months, identifies dependencies between initiatives, allocates team capacity realistically, and the projected 50% reduction ($4.25M) is justified by the specific initiative savings"
62
+ weight: 0.30
63
+ description: "Realistic 12-month roadmap"
@@ -0,0 +1,60 @@
1
+ meta:
2
+ id: error-executive-communication
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Communicate API errors to executives — translate technical error data into business impact language for leadership"
7
+ tags: [REST, API, executive, communication, board, business-impact, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're preparing for the quarterly board meeting. The board wants
13
+ to understand the company's API reliability story after a high-
14
+ profile incident last quarter where a 4-hour API outage cost $2M
15
+ in revenue and made the tech press.
16
+
17
+ Board members and their concerns:
18
+ - CEO: "Is our API platform a competitive advantage or liability?"
19
+ - CFO: "What's the ROI of our reliability investments?"
20
+ - CTO (board member): "Are we architecturally sound or is this
21
+ held together with duct tape?"
22
+ - Board member (former CISO): "Are error handling gaps creating
23
+ compliance or security risk?"
24
+ - Board member (investor representative): "How does our reliability
25
+ compare to competitors?"
26
+
27
+ Data you have:
28
+ - DORA metrics: deployment frequency 15/day, lead time 2 hours,
29
+ MTTR 45 minutes, change failure rate 8%
30
+ - Error rate trend: 0.5% → 0.3% over 6 months
31
+ - Revenue impact of errors: $8.5M/year (down from $12M)
32
+ - Uptime: 99.95% (target: 99.99%)
33
+ - Customer NPS for API: 42 (industry avg: 38)
34
+ - Reliability investment: $3M this year
35
+ - Competitor comparison: Stripe (99.999%), Twilio (99.95%),
36
+ Plaid (99.9%)
37
+
38
+ You need to present this in 15 minutes max, with 5 slides, in
39
+ language that non-technical board members understand.
40
+
41
+ Task: Write the board presentation. Include: the 5-slide structure
42
+ (with specific content per slide), the narrative arc (from problem
43
+ to progress to plan), how to translate DORA metrics and error
44
+ rates into business language, how to handle tough questions from
45
+ each board member, and the specific asks from the board (budget,
46
+ support, decisions).
47
+
48
+ assertions:
49
+ - type: llm_judge
50
+ criteria: "Board presentation translates technical data to business impact — error rates become revenue impact dollars, MTTR becomes customer experience minutes, DORA metrics become competitive positioning. Non-technical board members can understand every slide without technical background"
51
+ weight: 0.35
52
+ description: "Business-language translation"
53
+ - type: llm_judge
54
+ criteria: "Narrative arc is compelling — acknowledges the incident honestly, shows concrete progress (error cost down from $12M to $8.5M), presents a clear plan to reach the target, and positions reliability as competitive advantage (not just risk mitigation). The 5-slide structure is tight and focused"
55
+ weight: 0.35
56
+ description: "Compelling narrative arc"
57
+ - type: llm_judge
58
+ criteria: "Handles each board member's concerns — has prepared answers for the CEO (competitive positioning), CFO (ROI of $3M investment), CTO (architectural soundness), CISO (compliance risk), and investor (competitor comparison). Includes specific asks (budget, headcount, decisions)"
59
+ weight: 0.30
60
+ description: "Addresses all board concerns"
@@ -0,0 +1,67 @@
1
+ meta:
2
+ id: error-handling-architecture
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Architect enterprise error handling — design the 4-layer error handling architecture for a Fortune 500 API platform"
7
+ tags: [REST, API, architecture, enterprise, Fortune-500, layers, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're the Chief Architect designing the error handling
13
+ architecture for a Fortune 500 company's API platform. The platform
14
+ serves 1B requests/day across 300 microservices, with $500M in
15
+ annual transaction volume flowing through the APIs.
16
+
17
+ The architecture must support:
18
+ - 300 microservices across 60 teams
19
+ - 5 compliance frameworks (PCI DSS, HIPAA, SOC 2, GDPR, FedRAMP)
20
+ - 10,000 API consumers (external)
21
+ - 99.99% availability SLA
22
+ - Multi-region (US-East, US-West, EU-West, AP-Southeast)
23
+ - Multi-cloud (AWS primary, GCP secondary)
24
+
25
+ Design the 4-layer error handling architecture:
26
+
27
+ Layer 1 — Error Generation:
28
+ How errors are created, classified, and enriched at the service
29
+ level. Includes error types, severity, retryability, and context.
30
+
31
+ Layer 2 — Error Propagation:
32
+ How errors flow through service chains, API gateways, and load
33
+ balancers. Includes translation, aggregation, and enrichment.
34
+
35
+ Layer 3 — Error Observation:
36
+ How errors are logged, traced, alerted, and analyzed. Includes
37
+ the data pipeline, storage, and analytics.
38
+
39
+ Layer 4 — Error Communication:
40
+ How errors are presented to different audiences (API consumers,
41
+ developers, ops, executives, regulators). Includes formatting,
42
+ redaction, and documentation.
43
+
44
+ Cross-cutting concerns:
45
+ - Multi-region consistency (same error in US and EU)
46
+ - Compliance (data never crosses wrong borders)
47
+ - Performance (error handling adds <1ms latency)
48
+ - Cost (within $2M/year budget for the platform)
49
+
50
+ Task: Design all 4 layers with their interactions. For each layer,
51
+ write: the detailed design, the technology choices, the failure
52
+ modes (what happens when the error handling itself fails), and the
53
+ interfaces between layers.
54
+
55
+ assertions:
56
+ - type: llm_judge
57
+ criteria: "4-layer design is coherent — layers build on each other, interfaces are well-defined, and each layer handles the scale (1B requests/day, 300 services). Error generation is standardized via shared libraries, propagation handles multi-hop chains, observation handles the data volume, and communication tailors output by audience"
58
+ weight: 0.35
59
+ description: "Coherent 4-layer design"
60
+ - type: llm_judge
61
+ criteria: "Cross-cutting concerns are addressed — multi-region error consistency (same format, same codes globally), compliance-aware error handling (data residency for logs, PII redaction), performance budget (<1ms added latency), and the error handling system itself has failure modes and fallbacks"
62
+ weight: 0.35
63
+ description: "Cross-cutting concerns addressed"
64
+ - type: llm_judge
65
+ criteria: "Technology choices are justified — selects specific tools for each layer (error libraries, tracing systems, log platforms, API gateway config) with build-vs-buy rationale, and the total cost fits within the $2M/year budget"
66
+ weight: 0.30
67
+ description: "Justified technology choices"
@@ -0,0 +1,68 @@
1
+ meta:
2
+ id: error-org-design
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design the API reliability organization — structure teams, roles, and processes for enterprise-scale error management"
7
+ tags: [REST, API, org-design, SRE, platform-engineering, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your company has grown from 200 to 2,000 engineers in 3 years.
13
+ API error handling was manageable when everyone knew each other,
14
+ but now it's chaos. The VP of Engineering asks you to design the
15
+ organizational structure for API reliability.
16
+
17
+ Current state:
18
+ - No dedicated reliability team (every team does their own thing)
19
+ - On-call rotation is dreaded (engineers lack error handling skills)
20
+ - Incident response is ad-hoc (whoever's online)
21
+ - Error handling standards exist but nobody enforces them
22
+ - Knowledge is siloed (team A's error patterns are invisible to B)
23
+ - New hires take 3 months before they can diagnose API errors
24
+
25
+ Options to consider:
26
+
27
+ Option A — Centralized SRE team:
28
+ One 20-person SRE team owns all API reliability.
29
+ Pro: Consistency, expertise concentration
30
+ Con: Bottleneck, teams don't learn, "throw it over the wall"
31
+
32
+ Option B — Embedded SREs:
33
+ Each of the 40 teams gets 0.5 SRE (20 SREs distributed).
34
+ Pro: Close to product teams, context-aware
35
+ Con: Inconsistency, isolation, career path unclear
36
+
37
+ Option C — Platform + embedded hybrid:
38
+ Central platform team (10) builds tools and standards.
39
+ Embedded reliability champions (1 per team, 40) enforce and adapt.
40
+ Pro: Best of both worlds
41
+ Con: Complex coordination, champion role ambiguity
42
+
43
+ Option D — Full team ownership (you build it, you run it):
44
+ No dedicated SREs. Every team owns their error handling end-to-end.
45
+ Pro: Full ownership, fast iteration
46
+ Con: Inconsistency, no expertise concentration, on-call burden
47
+
48
+ Budget: 20 headcount for reliability roles
49
+
50
+ Task: Recommend and design the organizational model. Write: the
51
+ team structure (with specific roles and responsibilities), the
52
+ interaction model between platform and product teams, the career
53
+ path for reliability engineers, the on-call rotation design, and
54
+ the knowledge sharing system that prevents siloing.
55
+
56
+ assertions:
57
+ - type: llm_judge
58
+ criteria: "Organizational model is well-reasoned — analyzes all 4 options with trade-offs, recommends one (likely the hybrid) with clear justification, and the 20 headcount is allocated across roles with specific responsibilities. Addresses why pure centralized and pure distributed models fail at this scale"
59
+ weight: 0.35
60
+ description: "Well-reasoned org model"
61
+ - type: llm_judge
62
+ criteria: "Interaction model is practical — defines how the platform team and product teams collaborate on error handling (shared standards, tooling, escalation paths), how reliability champions bridge the gap, and how knowledge flows between teams. The on-call rotation is sustainable (not just dumping on SREs)"
63
+ weight: 0.35
64
+ description: "Practical interaction model"
65
+ - type: llm_judge
66
+ criteria: "Career path and knowledge sharing are addressed — reliability engineers have growth opportunities (IC and management tracks), the champion role has recognition and development, and the knowledge sharing system captures error patterns, runbooks, and incident learnings across teams"
67
+ weight: 0.30
68
+ description: "Career path and knowledge sharing"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: error-sla-design
3
+ level: 4
4
+ course: rest-api-error-handling
5
+ type: output
6
+ description: "Design API error SLAs — create enforceable service level agreements for error rates, response times, and error resolution"
7
+ tags: [REST, API, SLA, contracts, reliability, penalties, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your API platform serves 500 enterprise customers. Three major
13
+ customers are threatening to leave because of error-related
14
+ issues, and they're demanding formal SLAs.
15
+
16
+ Customer A (payments processor, $5M ARR):
17
+ "We need guarantees on error rates. Last month your payment
18
+ endpoint had a 2% error rate for 4 hours. That cost us $200K
19
+ in failed transactions. We want financial penalties if errors
20
+ exceed thresholds."
21
+
22
+ Customer B (healthcare platform, $3M ARR):
23
+ "When errors occur, your error messages don't help us debug.
24
+ We need SLAs on error message quality — every error must include
25
+ an actionable message and a correlation ID we can reference with
26
+ your support team."
27
+
28
+ Customer C (e-commerce aggregator, $2M ARR):
29
+ "Your error response times are inconsistent. Sometimes errors
30
+ return in 50ms, sometimes in 30 seconds (timeout). We need
31
+ SLAs on error response latency so our circuit breakers work
32
+ predictably."
33
+
34
+ Your sales team wants to create three SLA tiers:
35
+ - Platinum ($50K+/mo): strictest SLAs, dedicated support
36
+ - Gold ($10K+/mo): standard SLAs, priority support
37
+ - Silver (<$10K/mo): best-effort, standard support
38
+
39
+ Questions from legal:
40
+ 1. "How do we measure error rate? Does a client-caused 400 count?"
41
+ 2. "What are appropriate financial penalties? Credits or cash?"
42
+ 3. "How do we handle force majeure (upstream provider outages)?"
43
+ 4. "How do we prevent SLA gaming (customers triggering errors to
44
+ claim credits)?"
45
+
46
+ Task: Design the complete SLA framework. Write: the SLA tiers
47
+ with specific metrics and thresholds for each customer concern
48
+ (error rate, error quality, error response time), the measurement
49
+ methodology (how to calculate without disputes), the penalty
50
+ structure (credits, escalation), the exclusions (what doesn't
51
+ count), and the internal engineering requirements to meet the SLAs.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "SLA metrics are precisely defined — error rate excludes client errors (4xx) and counts only server errors (5xx) and timeouts, error quality is measurable (correlation ID present, actionable message, correct status code), and error response latency has clear P99 targets per tier. All three customer concerns are addressed"
56
+ weight: 0.35
57
+ description: "Precisely defined SLA metrics"
58
+ - type: llm_judge
59
+ criteria: "Penalty structure is fair and enforceable — service credits are proportional to impact, excludes force majeure and client-caused errors, has anti-gaming provisions, and the measurement methodology prevents disputes (shared monitoring dashboard, agreed-upon measurement points)"
60
+ weight: 0.35
61
+ description: "Fair enforceable penalties"
62
+ - type: llm_judge
63
+ criteria: "Internal engineering requirements are realistic — identifies what the platform team must build to meet the SLAs (per-customer error tracking, error quality validation, guaranteed timeout behavior), and the SLA tiers are achievable without over-committing"
64
+ weight: 0.30
65
+ description: "Realistic engineering requirements"