dojo.md 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (225) hide show
  1. package/courses/GENERATION_LOG.md +45 -0
  2. package/courses/aws-lambda-debugging/course.yaml +11 -0
  3. package/courses/aws-lambda-debugging/scenarios/level-1/api-gateway-integration.yaml +71 -0
  4. package/courses/aws-lambda-debugging/scenarios/level-1/cloudwatch-logs-basics.yaml +64 -0
  5. package/courses/aws-lambda-debugging/scenarios/level-1/cold-start-basics.yaml +70 -0
  6. package/courses/aws-lambda-debugging/scenarios/level-1/environment-variable-issues.yaml +72 -0
  7. package/courses/aws-lambda-debugging/scenarios/level-1/first-debugging-shift.yaml +73 -0
  8. package/courses/aws-lambda-debugging/scenarios/level-1/handler-import-errors.yaml +71 -0
  9. package/courses/aws-lambda-debugging/scenarios/level-1/iam-permission-errors.yaml +68 -0
  10. package/courses/aws-lambda-debugging/scenarios/level-1/invocation-errors.yaml +72 -0
  11. package/courses/aws-lambda-debugging/scenarios/level-1/lambda-timeout-errors.yaml +65 -0
  12. package/courses/aws-lambda-debugging/scenarios/level-1/memory-and-oom.yaml +70 -0
  13. package/courses/aws-lambda-debugging/scenarios/level-2/async-invocation-failures.yaml +72 -0
  14. package/courses/aws-lambda-debugging/scenarios/level-2/cold-start-optimization.yaml +76 -0
  15. package/courses/aws-lambda-debugging/scenarios/level-2/dynamodb-streams-debugging.yaml +70 -0
  16. package/courses/aws-lambda-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +71 -0
  17. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-concurrency-management.yaml +70 -0
  18. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-layers-debugging.yaml +76 -0
  19. package/courses/aws-lambda-debugging/scenarios/level-2/sam-local-debugging.yaml +74 -0
  20. package/courses/aws-lambda-debugging/scenarios/level-2/sqs-event-source.yaml +72 -0
  21. package/courses/aws-lambda-debugging/scenarios/level-2/vpc-networking-issues.yaml +71 -0
  22. package/courses/aws-lambda-debugging/scenarios/level-2/xray-tracing.yaml +62 -0
  23. package/courses/aws-lambda-debugging/scenarios/level-3/advanced-debugging-shift.yaml +72 -0
  24. package/courses/aws-lambda-debugging/scenarios/level-3/container-image-lambda.yaml +79 -0
  25. package/courses/aws-lambda-debugging/scenarios/level-3/cross-account-invocation.yaml +72 -0
  26. package/courses/aws-lambda-debugging/scenarios/level-3/eventbridge-patterns.yaml +79 -0
  27. package/courses/aws-lambda-debugging/scenarios/level-3/iac-deployment-debugging.yaml +68 -0
  28. package/courses/aws-lambda-debugging/scenarios/level-3/kinesis-stream-processing.yaml +64 -0
  29. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-at-edge.yaml +64 -0
  30. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-extensions-debugging.yaml +67 -0
  31. package/courses/aws-lambda-debugging/scenarios/level-3/powertools-observability.yaml +79 -0
  32. package/courses/aws-lambda-debugging/scenarios/level-3/step-functions-debugging.yaml +80 -0
  33. package/courses/aws-lambda-debugging/scenarios/level-4/cost-optimization-strategy.yaml +67 -0
  34. package/courses/aws-lambda-debugging/scenarios/level-4/expert-debugging-shift.yaml +62 -0
  35. package/courses/aws-lambda-debugging/scenarios/level-4/incident-management-serverless.yaml +61 -0
  36. package/courses/aws-lambda-debugging/scenarios/level-4/multi-region-serverless.yaml +67 -0
  37. package/courses/aws-lambda-debugging/scenarios/level-4/observability-platform-design.yaml +71 -0
  38. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-architecture-design.yaml +64 -0
  39. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-data-architecture.yaml +66 -0
  40. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-migration-strategy.yaml +65 -0
  41. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-security-design.yaml +60 -0
  42. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-testing-strategy.yaml +62 -0
  43. package/courses/aws-lambda-debugging/scenarios/level-5/board-serverless-strategy.yaml +63 -0
  44. package/courses/aws-lambda-debugging/scenarios/level-5/consulting-serverless-adoption.yaml +57 -0
  45. package/courses/aws-lambda-debugging/scenarios/level-5/industry-serverless-patterns.yaml +62 -0
  46. package/courses/aws-lambda-debugging/scenarios/level-5/ma-serverless-integration.yaml +75 -0
  47. package/courses/aws-lambda-debugging/scenarios/level-5/master-debugging-shift.yaml +61 -0
  48. package/courses/aws-lambda-debugging/scenarios/level-5/organizational-serverless-transformation.yaml +65 -0
  49. package/courses/aws-lambda-debugging/scenarios/level-5/regulatory-serverless.yaml +61 -0
  50. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-economics.yaml +65 -0
  51. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-future-technology.yaml +66 -0
  52. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-platform-design.yaml +71 -0
  53. package/courses/docker-container-debugging/course.yaml +11 -0
  54. package/courses/docker-container-debugging/scenarios/level-1/container-exit-codes.yaml +59 -0
  55. package/courses/docker-container-debugging/scenarios/level-1/container-networking-basics.yaml +69 -0
  56. package/courses/docker-container-debugging/scenarios/level-1/docker-logs-debugging.yaml +67 -0
  57. package/courses/docker-container-debugging/scenarios/level-1/dockerfile-build-failures.yaml +71 -0
  58. package/courses/docker-container-debugging/scenarios/level-1/environment-variable-issues.yaml +74 -0
  59. package/courses/docker-container-debugging/scenarios/level-1/first-debugging-shift.yaml +70 -0
  60. package/courses/docker-container-debugging/scenarios/level-1/image-pull-failures.yaml +68 -0
  61. package/courses/docker-container-debugging/scenarios/level-1/port-mapping-issues.yaml +67 -0
  62. package/courses/docker-container-debugging/scenarios/level-1/resource-limits-oom.yaml +70 -0
  63. package/courses/docker-container-debugging/scenarios/level-1/volume-mount-problems.yaml +66 -0
  64. package/courses/docker-container-debugging/scenarios/level-2/container-health-checks.yaml +73 -0
  65. package/courses/docker-container-debugging/scenarios/level-2/docker-compose-debugging.yaml +66 -0
  66. package/courses/docker-container-debugging/scenarios/level-2/docker-exec-debugging.yaml +71 -0
  67. package/courses/docker-container-debugging/scenarios/level-2/image-layer-optimization.yaml +81 -0
  68. package/courses/docker-container-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +73 -0
  69. package/courses/docker-container-debugging/scenarios/level-2/logging-and-log-rotation.yaml +76 -0
  70. package/courses/docker-container-debugging/scenarios/level-2/multi-stage-build-debugging.yaml +76 -0
  71. package/courses/docker-container-debugging/scenarios/level-2/network-debugging-tools.yaml +67 -0
  72. package/courses/docker-container-debugging/scenarios/level-2/pid1-signal-handling.yaml +71 -0
  73. package/courses/docker-container-debugging/scenarios/level-2/security-scanning-basics.yaml +67 -0
  74. package/courses/docker-container-debugging/scenarios/level-3/advanced-debugging-shift.yaml +77 -0
  75. package/courses/docker-container-debugging/scenarios/level-3/buildkit-optimization.yaml +67 -0
  76. package/courses/docker-container-debugging/scenarios/level-3/container-filesystem-debugging.yaml +70 -0
  77. package/courses/docker-container-debugging/scenarios/level-3/container-security-hardening.yaml +74 -0
  78. package/courses/docker-container-debugging/scenarios/level-3/disk-space-management.yaml +74 -0
  79. package/courses/docker-container-debugging/scenarios/level-3/docker-api-automation.yaml +72 -0
  80. package/courses/docker-container-debugging/scenarios/level-3/docker-daemon-issues.yaml +73 -0
  81. package/courses/docker-container-debugging/scenarios/level-3/docker-in-docker-ci.yaml +69 -0
  82. package/courses/docker-container-debugging/scenarios/level-3/overlay-network-debugging.yaml +70 -0
  83. package/courses/docker-container-debugging/scenarios/level-3/production-container-ops.yaml +71 -0
  84. package/courses/docker-container-debugging/scenarios/level-4/cicd-pipeline-design.yaml +66 -0
  85. package/courses/docker-container-debugging/scenarios/level-4/container-monitoring-observability.yaml +63 -0
  86. package/courses/docker-container-debugging/scenarios/level-4/container-orchestration-strategy.yaml +62 -0
  87. package/courses/docker-container-debugging/scenarios/level-4/container-performance-engineering.yaml +64 -0
  88. package/courses/docker-container-debugging/scenarios/level-4/container-security-architecture.yaml +66 -0
  89. package/courses/docker-container-debugging/scenarios/level-4/enterprise-image-management.yaml +58 -0
  90. package/courses/docker-container-debugging/scenarios/level-4/expert-debugging-shift.yaml +63 -0
  91. package/courses/docker-container-debugging/scenarios/level-4/incident-response-containers.yaml +70 -0
  92. package/courses/docker-container-debugging/scenarios/level-4/multi-environment-management.yaml +65 -0
  93. package/courses/docker-container-debugging/scenarios/level-4/stateful-service-containers.yaml +65 -0
  94. package/courses/docker-container-debugging/scenarios/level-5/board-infrastructure-strategy.yaml +58 -0
  95. package/courses/docker-container-debugging/scenarios/level-5/consulting-container-strategy.yaml +61 -0
  96. package/courses/docker-container-debugging/scenarios/level-5/container-platform-architecture.yaml +67 -0
  97. package/courses/docker-container-debugging/scenarios/level-5/container-platform-economics.yaml +67 -0
  98. package/courses/docker-container-debugging/scenarios/level-5/container-technology-evolution.yaml +67 -0
  99. package/courses/docker-container-debugging/scenarios/level-5/disaster-recovery-containers.yaml +66 -0
  100. package/courses/docker-container-debugging/scenarios/level-5/industry-container-patterns.yaml +71 -0
  101. package/courses/docker-container-debugging/scenarios/level-5/master-debugging-shift.yaml +62 -0
  102. package/courses/docker-container-debugging/scenarios/level-5/organizational-transformation.yaml +67 -0
  103. package/courses/docker-container-debugging/scenarios/level-5/regulatory-compliance-containers.yaml +61 -0
  104. package/courses/kubernetes-deployment-troubleshooting/course.yaml +12 -0
  105. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/configmap-secret-issues.yaml +69 -0
  106. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/crashloopbackoff.yaml +68 -0
  107. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/deployment-rollout.yaml +56 -0
  108. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/first-troubleshooting-shift.yaml +65 -0
  109. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/health-probe-failures.yaml +70 -0
  110. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/imagepullbackoff.yaml +57 -0
  111. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/kubectl-debugging-basics.yaml +56 -0
  112. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/oomkilled.yaml +70 -0
  113. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/pending-pods.yaml +68 -0
  114. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/service-not-reachable.yaml +66 -0
  115. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/dns-resolution-failures.yaml +63 -0
  116. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/helm-deployment-failures.yaml +63 -0
  117. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/hpa-scaling-issues.yaml +62 -0
  118. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/ingress-routing-issues.yaml +63 -0
  119. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/init-container-failures.yaml +63 -0
  120. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/intermediate-troubleshooting-shift.yaml +66 -0
  121. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/network-policy-blocking.yaml +67 -0
  122. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/persistent-volume-issues.yaml +69 -0
  123. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/rbac-permission-denied.yaml +57 -0
  124. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/resource-quota-limits.yaml +64 -0
  125. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/advanced-troubleshooting-shift.yaml +69 -0
  126. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/cluster-upgrade-failures.yaml +71 -0
  127. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/gitops-drift-detection.yaml +62 -0
  128. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/job-cronjob-failures.yaml +67 -0
  129. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/monitoring-alerting-gaps.yaml +64 -0
  130. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/multi-container-debugging.yaml +68 -0
  131. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/node-pressure-evictions.yaml +70 -0
  132. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/pod-disruption-budgets.yaml +59 -0
  133. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/service-mesh-debugging.yaml +64 -0
  134. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/statefulset-troubleshooting.yaml +69 -0
  135. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/capacity-planning.yaml +65 -0
  136. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/cost-optimization.yaml +57 -0
  137. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/disaster-recovery-design.yaml +56 -0
  138. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/executive-communication.yaml +62 -0
  139. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/expert-troubleshooting-shift.yaml +65 -0
  140. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/incident-management-process.yaml +59 -0
  141. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-cluster-operations.yaml +62 -0
  142. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-tenancy-design.yaml +55 -0
  143. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/platform-engineering.yaml +59 -0
  144. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/security-hardening.yaml +58 -0
  145. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/behavioral-science.yaml +62 -0
  146. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/board-strategy.yaml +61 -0
  147. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/cloud-native-future.yaml +65 -0
  148. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/comprehensive-platform.yaml +57 -0
  149. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/consulting-engagement.yaml +62 -0
  150. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/industry-benchmarks.yaml +58 -0
  151. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/ma-integration.yaml +62 -0
  152. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/master-troubleshooting-shift.yaml +73 -0
  153. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/product-development.yaml +65 -0
  154. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/regulatory-compliance.yaml +76 -0
  155. package/courses/mysql-query-optimization/course.yaml +11 -0
  156. package/courses/mysql-query-optimization/scenarios/level-1/buffer-pool-basics.yaml +65 -0
  157. package/courses/mysql-query-optimization/scenarios/level-1/explain-basics.yaml +66 -0
  158. package/courses/mysql-query-optimization/scenarios/level-1/first-optimization-shift.yaml +78 -0
  159. package/courses/mysql-query-optimization/scenarios/level-1/innodb-index-fundamentals.yaml +68 -0
  160. package/courses/mysql-query-optimization/scenarios/level-1/join-basics.yaml +66 -0
  161. package/courses/mysql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml +67 -0
  162. package/courses/mysql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml +66 -0
  163. package/courses/mysql-query-optimization/scenarios/level-1/select-star-problems.yaml +68 -0
  164. package/courses/mysql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml +65 -0
  165. package/courses/mysql-query-optimization/scenarios/level-1/where-clause-optimization.yaml +65 -0
  166. package/courses/mysql-query-optimization/scenarios/level-2/buffer-pool-tuning.yaml +64 -0
  167. package/courses/mysql-query-optimization/scenarios/level-2/composite-index-design.yaml +71 -0
  168. package/courses/mysql-query-optimization/scenarios/level-2/covering-and-invisible-indexes.yaml +69 -0
  169. package/courses/mysql-query-optimization/scenarios/level-2/cte-and-window-functions.yaml +78 -0
  170. package/courses/mysql-query-optimization/scenarios/level-2/intermediate-optimization-shift.yaml +68 -0
  171. package/courses/mysql-query-optimization/scenarios/level-2/join-optimization.yaml +67 -0
  172. package/courses/mysql-query-optimization/scenarios/level-2/performance-schema-analysis.yaml +69 -0
  173. package/courses/mysql-query-optimization/scenarios/level-2/query-optimizer-hints.yaml +74 -0
  174. package/courses/mysql-query-optimization/scenarios/level-2/subquery-optimization.yaml +70 -0
  175. package/courses/mysql-query-optimization/scenarios/level-2/write-optimization.yaml +63 -0
  176. package/courses/mysql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml +71 -0
  177. package/courses/mysql-query-optimization/scenarios/level-3/connection-management.yaml +67 -0
  178. package/courses/mysql-query-optimization/scenarios/level-3/full-text-search.yaml +77 -0
  179. package/courses/mysql-query-optimization/scenarios/level-3/json-optimization.yaml +87 -0
  180. package/courses/mysql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml +68 -0
  181. package/courses/mysql-query-optimization/scenarios/level-3/monitoring-alerting.yaml +63 -0
  182. package/courses/mysql-query-optimization/scenarios/level-3/online-schema-changes.yaml +79 -0
  183. package/courses/mysql-query-optimization/scenarios/level-3/partitioning-strategies.yaml +83 -0
  184. package/courses/mysql-query-optimization/scenarios/level-3/query-profiling-deep-dive.yaml +84 -0
  185. package/courses/mysql-query-optimization/scenarios/level-3/replication-optimization.yaml +66 -0
  186. package/courses/mysql-query-optimization/scenarios/level-4/aurora-vs-rds-evaluation.yaml +61 -0
  187. package/courses/mysql-query-optimization/scenarios/level-4/data-architecture.yaml +62 -0
  188. package/courses/mysql-query-optimization/scenarios/level-4/database-migration-planning.yaml +59 -0
  189. package/courses/mysql-query-optimization/scenarios/level-4/enterprise-governance.yaml +50 -0
  190. package/courses/mysql-query-optimization/scenarios/level-4/executive-communication.yaml +54 -0
  191. package/courses/mysql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml +67 -0
  192. package/courses/mysql-query-optimization/scenarios/level-4/high-availability-architecture.yaml +60 -0
  193. package/courses/mysql-query-optimization/scenarios/level-4/optimizer-internals.yaml +62 -0
  194. package/courses/mysql-query-optimization/scenarios/level-4/performance-sla-design.yaml +52 -0
  195. package/courses/mysql-query-optimization/scenarios/level-4/read-replica-scaling.yaml +51 -0
  196. package/courses/mysql-query-optimization/scenarios/level-5/ai-database-future.yaml +45 -0
  197. package/courses/mysql-query-optimization/scenarios/level-5/behavioral-science.yaml +44 -0
  198. package/courses/mysql-query-optimization/scenarios/level-5/benchmark-design.yaml +47 -0
  199. package/courses/mysql-query-optimization/scenarios/level-5/board-strategy.yaml +48 -0
  200. package/courses/mysql-query-optimization/scenarios/level-5/comprehensive-platform.yaml +49 -0
  201. package/courses/mysql-query-optimization/scenarios/level-5/consulting-engagement.yaml +52 -0
  202. package/courses/mysql-query-optimization/scenarios/level-5/ma-database-integration.yaml +47 -0
  203. package/courses/mysql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +56 -0
  204. package/courses/mysql-query-optimization/scenarios/level-5/product-development.yaml +48 -0
  205. package/courses/mysql-query-optimization/scenarios/level-5/regulatory-compliance.yaml +48 -0
  206. package/courses/postgresql-query-optimization/scenarios/level-5/comprehensive-database-system.yaml +70 -0
  207. package/courses/postgresql-query-optimization/scenarios/level-5/database-ai-future.yaml +81 -0
  208. package/courses/postgresql-query-optimization/scenarios/level-5/database-behavioral-science.yaml +63 -0
  209. package/courses/postgresql-query-optimization/scenarios/level-5/database-board-strategy.yaml +77 -0
  210. package/courses/postgresql-query-optimization/scenarios/level-5/database-consulting-engagement.yaml +61 -0
  211. package/courses/postgresql-query-optimization/scenarios/level-5/database-industry-benchmarks.yaml +64 -0
  212. package/courses/postgresql-query-optimization/scenarios/level-5/database-ma-integration.yaml +71 -0
  213. package/courses/postgresql-query-optimization/scenarios/level-5/database-product-development.yaml +72 -0
  214. package/courses/postgresql-query-optimization/scenarios/level-5/database-regulatory-landscape.yaml +76 -0
  215. package/courses/postgresql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +66 -0
  216. package/courses/terraform-infrastructure-setup/course.yaml +11 -0
  217. package/courses/terraform-infrastructure-setup/scenarios/level-1/hcl-syntax-errors.yaml +65 -0
  218. package/courses/terraform-infrastructure-setup/scenarios/level-1/provider-configuration.yaml +62 -0
  219. package/courses/terraform-infrastructure-setup/scenarios/level-1/terraform-init-errors.yaml +72 -0
  220. package/courses/terraform-infrastructure-setup/scenarios/level-1/variable-and-output-errors.yaml +78 -0
  221. package/dist/mcp/session-manager.d.ts +7 -4
  222. package/dist/mcp/session-manager.d.ts.map +1 -1
  223. package/dist/mcp/session-manager.js +23 -8
  224. package/dist/mcp/session-manager.js.map +1 -1
  225. package/package.json +3 -2
@@ -0,0 +1,59 @@
1
+ meta:
2
+ id: incident-management-process
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes incident management process — escalation procedures, communication templates, postmortem culture, and SLA/SLO framework"
7
+ tags: [Kubernetes, incident-management, SRE, SLO, postmortem, process, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your organization experienced 3 major Kubernetes incidents in the
13
+ past month, each lasting 2+ hours. Post-incident reviews revealed
14
+ systemic issues:
15
+
16
+ Incident 1 — Database Outage (3.5 hours):
17
+ - Root cause: Accidental ConfigMap update pushed to production
18
+ - Detection: Customer complaint after 45 minutes
19
+ - Impact: 100% of write operations failed
20
+ - Issue: No one knew who to page, no runbook existed
21
+
22
+ Incident 2 — Memory Leak Causing Cascading Failures (2.5 hours):
23
+ - Root cause: New deployment had memory leak, OOMKilled repeatedly
24
+ - Detection: Prometheus alert fired but went to wrong Slack channel
25
+ - Impact: 3 dependent services degraded
26
+ - Issue: No circuit breaker, no isolation between services
27
+
28
+ Incident 3 — Node Failure During Cluster Upgrade (2 hours):
29
+ - Root cause: Drain failed due to PDB, manual override corrupted state
30
+ - Detection: Immediate (during planned maintenance)
31
+ - Impact: 30% of pods evicted uncontrollably
32
+ - Issue: No change management process, no rollback plan
33
+
34
+ Common themes:
35
+ - No standardized severity levels
36
+ - No on-call rotation for Kubernetes platform
37
+ - No incident commander role defined
38
+ - Postmortems are blame-focused, findings not actioned
39
+ - No SLOs defined for the platform
40
+
41
+ Task: Design a Kubernetes incident management process. Write: severity
42
+ classification (SEV1-4), on-call structure and escalation paths,
43
+ incident response playbook for common Kubernetes failures, SLO/SLI
44
+ framework for the platform, blameless postmortem template, and how
45
+ to build a reliability culture.
46
+
47
+ assertions:
48
+ - type: llm_judge
49
+ criteria: "Severity classification is defined — SEV1: full production outage affecting all users (15 min response, incident commander, all-hands), SEV2: partial outage affecting subset of users (30 min response, on-call lead), SEV3: degraded performance but functional (2h response, next business day), SEV4: minor issue no user impact (ticket-based). Clear criteria for each level based on user impact, revenue impact, and data integrity"
50
+ weight: 0.35
51
+ description: "Severity classification"
52
+ - type: llm_judge
53
+ criteria: "Incident response process is structured — on-call rotation for platform team (primary + secondary), incident commander role (coordinates communication, not debugging), dedicated incident Slack channel per SEV1/2, status page updates every 15 minutes, stakeholder communication templates. Playbooks for common Kubernetes failures: CrashLoopBackOff, node failure, storage issues, network partitions — with decision trees and kubectl commands"
54
+ weight: 0.35
55
+ description: "Incident response"
56
+ - type: llm_judge
57
+ criteria: "SLOs and postmortems are practical — SLIs: availability (% of successful requests), latency (P99 < Xms), error rate. SLOs: 99.9% availability = 43.8 min/month downtime budget. Error budget policy: when budget exhausted, freeze deployments until reliability improves. Blameless postmortem template: timeline, root cause, contributing factors, action items with owners and deadlines. Schedule postmortem within 48 hours, track action items to completion"
58
+ weight: 0.30
59
+ description: "SLOs and postmortems"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: multi-cluster-operations
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design multi-cluster troubleshooting strategy — cross-cluster service discovery, federated monitoring, and disaster recovery across regions"
7
+ tags: [Kubernetes, multi-cluster, federation, disaster-recovery, cross-region, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your organization runs 5 Kubernetes clusters across 3 regions for a
13
+ global e-commerce platform:
14
+
15
+ | Cluster | Region | Purpose | Nodes | Workloads |
16
+ |-------------|-------------|-------------------|-------|-----------|
17
+ | prod-us-1 | us-east-1 | Primary US | 50 | 200+ |
18
+ | prod-us-2 | us-west-2 | Secondary US | 30 | 150+ |
19
+ | prod-eu-1 | eu-west-1 | EU primary | 40 | 180+ |
20
+ | prod-apac-1 | ap-south-1 | APAC primary | 25 | 100+ |
21
+ | staging | us-east-1 | Pre-production | 15 | 200+ |
22
+
23
+ Current challenges:
24
+ 1. Cross-cluster service discovery — services in prod-us-1 need to
25
+ call services in prod-eu-1 for GDPR compliance (EU user data stays
26
+ in EU). Currently using hardcoded LoadBalancer IPs.
27
+
28
+ 2. Monitoring fragmentation — each cluster has its own Prometheus,
29
+ but no unified view. Alert fatigue from 5 separate Alertmanagers.
30
+
31
+ 3. Disaster recovery — when prod-us-1 had a control plane issue last
32
+ month, failover to prod-us-2 took 45 minutes (manual DNS switch,
33
+ database failover, config updates). Target RTO is 5 minutes.
34
+
35
+ 4. Configuration drift — clusters have diverged. Same app deployed
36
+ with different resource limits, different RBAC policies, different
37
+ network policies across clusters.
38
+
39
+ 5. Upgrade coordination — upgrading all clusters takes weeks of
40
+ sequential work, and version skew between clusters causes API
41
+ compatibility issues.
42
+
43
+ Task: Design a multi-cluster troubleshooting and operations strategy.
44
+ Write: cross-cluster service discovery approaches (Istio multi-cluster,
45
+ Submariner, DNS-based), federated monitoring architecture (Thanos/
46
+ Cortex for global Prometheus view), automated failover design, GitOps
47
+ for multi-cluster consistency, and the operational runbook for
48
+ cross-cluster incidents.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Cross-cluster connectivity is addressed — options: Istio multi-cluster mesh (automatic cross-cluster service discovery via shared control plane or remote clusters), Submariner (L3 connectivity between clusters), DNS-based with external-dns + global load balancer, or service export/import API. Trade-offs between complexity, latency, and feature richness. Istio provides mTLS across clusters but adds operational overhead"
53
+ weight: 0.35
54
+ description: "Cross-cluster connectivity"
55
+ - type: llm_judge
56
+ criteria: "Federated monitoring is designed — use Thanos or Cortex for global Prometheus query layer across all clusters. Each cluster runs local Prometheus, Thanos sidecar uploads to object storage, Thanos Query aggregates across clusters. Unified Alertmanager with deduplication to prevent alert fatigue. Grafana dashboards with cluster selector for unified view"
57
+ weight: 0.35
58
+ description: "Federated monitoring"
59
+ - type: llm_judge
60
+ criteria: "Failover and consistency are practical — automated failover: global load balancer (Route53/CloudFlare) with health checks, pre-provisioned standby capacity, database replication with automatic promotion. GitOps: single Git repo with cluster-specific overlays (Kustomize/Helm), ArgoCD ApplicationSets for multi-cluster sync, policy-as-code (OPA/Kyverno) for consistency enforcement. Upgrade strategy: rolling cluster upgrades starting with staging"
61
+ weight: 0.30
62
+ description: "Failover and consistency"
@@ -0,0 +1,55 @@
1
+ meta:
2
+ id: multi-tenancy-design
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes multi-tenancy strategy — namespace isolation, RBAC, NetworkPolicy, ResourceQuota, and tenant troubleshooting workflows"
7
+ tags: [Kubernetes, multi-tenancy, isolation, RBAC, namespace, security, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your company is consolidating from 12 separate Kubernetes clusters
13
+ (one per team) to 3 shared clusters. Each team currently has full
14
+ cluster-admin access. Leadership wants to reduce infrastructure costs
15
+ by 40% through better utilization.
16
+
17
+ Teams and their requirements:
18
+ | Team | Services | CPU/Mem | Compliance | Access Level |
19
+ |------------|----------|----------|---------------|---------------|
20
+ | Platform | 15 | 32c/64G | SOC2 | Cluster admin |
21
+ | Payments | 8 | 16c/32G | PCI-DSS | Namespace |
22
+ | ML/Data | 6 | 64c/256G | None | Namespace+GPU |
23
+ | Frontend | 12 | 8c/16G | None | Namespace |
24
+ | Backend | 20 | 24c/48G | None | Namespace |
25
+ | Mobile API | 10 | 16c/32G | None | Namespace |
26
+
27
+ Concerns raised by teams:
28
+ 1. Payments team: "PCI-DSS requires network isolation. How do we
29
+ ensure no other team's pods can reach our services?"
30
+ 2. ML team: "We need GPU nodes. Other teams shouldn't schedule on them."
31
+ 3. All teams: "We need to deploy without waiting for a platform team
32
+ ticket. We currently have cluster-admin."
33
+ 4. Platform team: "How do we prevent one team from consuming all
34
+ cluster resources during a traffic spike?"
35
+ 5. Security: "How do we audit who did what across all tenants?"
36
+
37
+ Task: Design a multi-tenancy strategy addressing all concerns. Write:
38
+ namespace isolation architecture (RBAC, NetworkPolicy, ResourceQuota
39
+ per team), PCI-DSS compliance approach for payments namespace, GPU
40
+ node isolation with taints/tolerations, self-service deployment model
41
+ (GitOps per team), resource fairness and priority, and audit logging.
42
+
43
+ assertions:
44
+ - type: llm_judge
45
+ criteria: "Namespace isolation is comprehensive — each team gets dedicated namespace(s) with: RBAC roles scoped to their namespace (not cluster-admin), NetworkPolicy default-deny with allow rules only for intra-team and approved cross-team traffic, ResourceQuotas matching their allocation, LimitRanges for per-pod defaults. Platform team retains cluster-level access for infrastructure management"
46
+ weight: 0.35
47
+ description: "Namespace isolation"
48
+ - type: llm_judge
49
+ criteria: "Specific requirements are addressed — PCI-DSS: dedicated node pool with taints, strict NetworkPolicy (no ingress from non-payment namespaces), encrypted secrets, audit logging on all API access. GPU isolation: taints on GPU nodes (dedicated=gpu:NoSchedule), only ML namespace pods have tolerations. Self-service: ArgoCD AppProjects per team (can deploy to their namespace only). Priority classes for critical workloads"
50
+ weight: 0.35
51
+ description: "Specific requirements"
52
+ - type: llm_judge
53
+ criteria: "Operational model is practical — GitOps per team: each team owns their manifests in Git, ArgoCD syncs to their namespace. Resource fairness: ResourceQuotas prevent resource hogging, PriorityClasses ensure critical workloads (payments) aren't preempted. Audit: enable Kubernetes audit logging, ship to central SIEM, configure audit policy to log all write operations. Consider vCluster for teams needing CRDs or more isolation"
54
+ weight: 0.30
55
+ description: "Operational model"
@@ -0,0 +1,59 @@
1
+ meta:
2
+ id: platform-engineering
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design an internal developer platform on Kubernetes — self-service deployment, golden paths, and developer experience optimization"
7
+ tags: [Kubernetes, platform-engineering, developer-experience, IDP, self-service, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your VP of Engineering asks: "Developers spend 40% of their time on
13
+ Kubernetes YAML, debugging deployments, and waiting for platform
14
+ tickets. How do we fix this?"
15
+
16
+ Current developer experience survey results:
17
+ - "Writing Kubernetes manifests is my least favorite part of the job"
18
+ - "I don't understand why my deployment failed, the error messages are cryptic"
19
+ - "I need to wait 3 days for a platform team ticket to create a new namespace"
20
+ - "Every team has different deployment patterns, no consistency"
21
+ - "I can't run my service locally in a way that mimics production"
22
+
23
+ Current state:
24
+ - 50 developers, 12 teams
25
+ - Each team writes raw Kubernetes YAML
26
+ - No standardized CI/CD pipeline
27
+ - Manual namespace provisioning (platform team ticket)
28
+ - No local development environment that mirrors Kubernetes
29
+ - Average time from code merge to production: 4 hours
30
+ - 30% of deployments require rollback
31
+
32
+ Target state:
33
+ - Developers focus on code, not Kubernetes primitives
34
+ - Self-service for common operations (new namespace, new service)
35
+ - Standardized deployment pipeline with guardrails
36
+ - Local development parity with production
37
+ - Merge to production in 15 minutes
38
+ - Rollback rate < 5%
39
+
40
+ Task: Design an internal developer platform (IDP) on Kubernetes.
41
+ Write: the abstraction layers (what developers see vs what Kubernetes
42
+ sees), golden paths for common patterns (web service, worker, cron job),
43
+ self-service capabilities (Backstage/Port catalog), CI/CD standardization,
44
+ local development approach (Tilt/Skaffold/DevSpace), and how to measure
45
+ developer productivity improvements.
46
+
47
+ assertions:
48
+ - type: llm_judge
49
+ criteria: "Abstraction layers are defined — developers interact with simplified service definitions (not raw YAML), platform team maintains templates/operators that translate to Kubernetes resources. Options: Backstage service catalog with templates, Crossplane compositions, custom CRDs with operators, or Helm chart library. Developer writes: service name, image, port, resources, env vars. Platform generates: Deployment, Service, Ingress, HPA, PDB, NetworkPolicy, ServiceMonitor"
50
+ weight: 0.35
51
+ description: "Abstraction layers"
52
+ - type: llm_judge
53
+ criteria: "Golden paths and CI/CD are practical — standardized templates for web service (Deployment + HPA + Ingress), worker (Deployment without Ingress), cron job (CronJob with monitoring), stateful service (StatefulSet + PVC). CI/CD pipeline: build → test → security scan → deploy to staging → automated tests → deploy to production. Progressive delivery with canary deployments. Rollback automation on error rate increase"
54
+ weight: 0.35
55
+ description: "Golden paths and CI/CD"
56
+ - type: llm_judge
57
+ criteria: "Local development and metrics are covered — local development: Tilt (live reload Kubernetes dev), DevSpace, or Telepresence (route cluster traffic to local machine). Developer productivity metrics: DORA metrics (deployment frequency, lead time, change failure rate, MTTR), developer satisfaction surveys, time-to-first-deployment for new services, percentage of deployments using golden paths"
58
+ weight: 0.30
59
+ description: "Local dev and metrics"
@@ -0,0 +1,58 @@
1
+ meta:
2
+ id: security-hardening
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes security hardening strategy — Pod Security Standards, supply chain security, runtime protection, and compliance"
7
+ tags: [Kubernetes, security, Pod-Security-Standards, supply-chain, compliance, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ A security audit of your Kubernetes platform revealed critical findings:
13
+
14
+ CRITICAL findings:
15
+ 1. 65% of pods run as root
16
+ 2. 40% of containers have privileged=true
17
+ 3. No pod security admission is enforced
18
+ 4. Container images pulled from public registries without verification
19
+ 5. Secrets stored as base64 in etcd (not encrypted at rest)
20
+
21
+ HIGH findings:
22
+ 6. No network segmentation between namespaces
23
+ 7. ServiceAccount tokens auto-mounted in all pods
24
+ 8. No image vulnerability scanning in CI/CD pipeline
25
+ 9. Kubernetes API server accessible from all pods
26
+ 10. No audit logging enabled
27
+
28
+ MEDIUM findings:
29
+ 11. No resource limits on 30% of pods (DoS risk)
30
+ 12. ConfigMaps contain credentials that should be in Secrets
31
+ 13. Helm charts pulled from untrusted repositories
32
+ 14. No RBAC review process (stale permissions accumulate)
33
+ 15. Kubernetes dashboard exposed without authentication
34
+
35
+ The security team requires all CRITICAL and HIGH findings remediated
36
+ within 90 days for SOC2 compliance. You can't just enforce everything
37
+ at once — that would break 200+ running services.
38
+
39
+ Task: Design a phased security hardening plan. Write: the remediation
40
+ priority and timeline, Pod Security Standards rollout strategy (audit
41
+ → warn → enforce), supply chain security (image signing, scanning,
42
+ admission control), secrets management (external vault, encryption at
43
+ rest), network segmentation approach, and how to balance security
44
+ with developer velocity.
45
+
46
+ assertions:
47
+ - type: llm_judge
48
+ criteria: "Phased rollout is realistic — Phase 1 (weeks 1-4): enable audit logging, encrypt secrets at rest, disable auto-mounting SA tokens, scan existing images. Phase 2 (weeks 5-8): Pod Security Standards in audit mode, add NetworkPolicies in monitoring mode, image signing. Phase 3 (weeks 9-12): enforce Pod Security Standards (warn → enforce), block unsigned images, RBAC review. The progression audit → warn → enforce prevents breaking running services"
49
+ weight: 0.35
50
+ description: "Phased rollout"
51
+ - type: llm_judge
52
+ criteria: "Supply chain security is addressed — image scanning in CI/CD (Trivy/Grype), admission control (Kyverno/OPA Gatekeeper) to reject vulnerable or unsigned images, private registry as proxy cache for public images, image signing with Cosign/Notation, SBOM generation, base image standardization. Supply chain attacks are a top Kubernetes risk — images are the primary attack vector"
53
+ weight: 0.35
54
+ description: "Supply chain security"
55
+ - type: llm_judge
56
+ criteria: "Operational security balance is practical — developer velocity: provide pre-hardened base images (non-root, minimal), self-service exception process for privileged containers (approval workflow), automated RBAC provisioning tied to team membership, policy-as-code in Git (Kyverno/OPA policies reviewed via PR). Monitoring: runtime security (Falco) for detecting anomalous behavior, regular penetration testing"
57
+ weight: 0.30
58
+ description: "Security-velocity balance"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: behavioral-science
3
+ level: 5
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Behavioral science of Kubernetes troubleshooting — cognitive biases in incident response, team dynamics under pressure, and building a learning organization"
7
+ tags: [Kubernetes, behavioral-science, cognitive-bias, incident-response, learning-organization, master]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your post-incident reviews reveal recurring human factors that
13
+ contribute to incident severity and duration:
14
+
15
+ Pattern 1 — Anchoring Bias:
16
+ During a cascading failure, the on-call engineer focused on a
17
+ CrashLoopBackOff pod (the most visible symptom) for 45 minutes,
18
+ ignoring the root cause: a network partition isolating the database.
19
+ They anchored on the first symptom they saw.
20
+
21
+ Pattern 2 — Escalation Commitment:
22
+ An engineer spent 2 hours manually patching pods during a cluster
23
+ upgrade failure instead of rolling back. They had already invested
24
+ time and didn't want to "lose progress." The rollback would have
25
+ taken 5 minutes.
26
+
27
+ Pattern 3 — Diffusion of Responsibility:
28
+ With 8 people in the incident channel, nobody took ownership.
29
+ Everyone assumed someone else was investigating. The MTTR was 3x
30
+ longer than incidents with a clear owner.
31
+
32
+ Pattern 4 — Confirmation Bias:
33
+ The team suspected a recent deployment caused an outage. They spent
34
+ an hour analyzing the deployment diff. The actual cause was a node
35
+ hardware failure unrelated to any deployment. They only looked for
36
+ evidence that confirmed their initial hypothesis.
37
+
38
+ Pattern 5 — Normalcy Bias:
39
+ Warning alerts fired for disk pressure for 3 days before the node
40
+ went down. The team ignored them because "it always warns and never
41
+ actually fails." Until it did.
42
+
43
+ Task: Analyze the behavioral science of Kubernetes troubleshooting.
44
+ Write: the cognitive biases most common in incident response, how
45
+ each manifests in Kubernetes contexts, countermeasures for each bias
46
+ (checklists, structured investigation, time-boxing), how to build a
47
+ blameless learning culture, and how incident command structure
48
+ mitigates human factors.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Cognitive biases are identified with K8s context — anchoring (focus on first symptom not root cause), escalation commitment (continuing failing approach instead of rollback), diffusion of responsibility (no clear owner in group incidents), confirmation bias (seeking evidence for initial hypothesis only), normalcy bias (ignoring recurring warnings). Each pattern maps to specific Kubernetes troubleshooting scenarios"
53
+ weight: 0.35
54
+ description: "Biases identified"
55
+ - type: llm_judge
56
+ criteria: "Countermeasures are practical and specific — anchoring: use structured investigation checklist (always check events, logs, nodes, network before deep-diving), time-box symptom investigation (15 min max before stepping back). Escalation commitment: establish rollback thresholds upfront ('if not fixed in 20 min, rollback'). Diffusion: incident commander assigns explicit owners. Confirmation bias: devil's advocate role, check alternative hypotheses. Normalcy: auto-escalate warnings that persist > threshold"
57
+ weight: 0.35
58
+ description: "Countermeasures"
59
+ - type: llm_judge
60
+ criteria: "Learning organization culture is addressed — blameless postmortems (focus on systemic factors, not individual blame), share incident learnings broadly (incident review meetings), celebrate good catches (not just response), invest in simulation and game days (practice under controlled stress), create psychological safety so engineers escalate early rather than hiding problems. Incident command: clear roles reduce cognitive load during high-stress situations"
61
+ weight: 0.30
62
+ description: "Learning culture"
@@ -0,0 +1,61 @@
1
+ meta:
2
+ id: board-strategy
3
+ level: 5
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Board-level Kubernetes strategy — present cloud-native infrastructure as competitive advantage to board of directors"
7
+ tags: [Kubernetes, board-strategy, cloud-native, competitive-advantage, leadership, master]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're the CTO preparing a board presentation on the company's
13
+ cloud-native infrastructure strategy. The board includes a mix of
14
+ technical (2 former CTOs) and non-technical (3 finance/operations)
15
+ members.
16
+
17
+ Context:
18
+ - Company: B2B SaaS, $500M ARR, 2000 employees
19
+ - Current platform: Kubernetes across 3 cloud providers
20
+ - Annual cloud spend: $18M (3.6% of revenue)
21
+ - Platform team: 25 engineers
22
+ - Supporting 200+ microservices, 150 engineers
23
+
24
+ Board questions from pre-read:
25
+ 1. "Cloud spend grew 40% last year but revenue grew 25%. When does
26
+ the infrastructure investment start paying dividends?"
27
+
28
+ 2. "A competitor claims they run on serverless and their infrastructure
29
+ team is 5 people. Are we over-engineering?"
30
+
31
+ 3. "If we acquired a company running on Azure, how hard is it to
32
+ integrate their services with our platform?"
33
+
34
+ 4. "What's our risk exposure if AWS has a major outage? We had 2 hours
35
+ of downtime last quarter from a cloud provider issue."
36
+
37
+ 5. "The engineering team is asking for $5M to build an 'internal
38
+ developer platform.' What does that even mean and why should we
39
+ fund it?"
40
+
41
+ Task: Prepare the board presentation. Write: the infrastructure as
42
+ competitive advantage narrative (speed-to-market, reliability, multi-
43
+ cloud flexibility), the cloud spend efficiency analysis (cost per
44
+ transaction, cost as % of revenue trends), serverless vs Kubernetes
45
+ trade-off for the board, M&A integration capabilities, the $5M IDP
46
+ investment business case, and a risk framework for infrastructure
47
+ decisions.
48
+
49
+ assertions:
50
+ - type: llm_judge
51
+ criteria: "Infrastructure as competitive advantage is framed for board — cloud-native platform enables: 50x more frequent deployments (speed-to-market), 99.95% availability (customer trust), multi-cloud capability (negotiation leverage, M&A readiness). Cloud spend at 3.6% of revenue is below SaaS median (5-8%). The 40% cost growth funded a 3x increase in deployment frequency — cost per deployment actually decreased 60%"
52
+ weight: 0.35
53
+ description: "Competitive advantage"
54
+ - type: llm_judge
55
+ criteria: "Board questions are answered directly — serverless competitor: they likely have hidden costs (Lambda pricing at scale, vendor lock-in premium, limited workload types). Their 5-person team works because they outsource to the cloud provider — we trade cost for control and flexibility. M&A integration: Kubernetes standardizes deployment regardless of cloud — integration timeline weeks vs months for non-Kubernetes. AWS outage: multi-cloud strategy limits blast radius, DR investment reduces RTO"
56
+ weight: 0.35
57
+ description: "Board answers"
58
+ - type: llm_judge
59
+ criteria: "IDP investment case is compelling — $5M IDP investment yields: 35% developer productivity gain (equivalent to hiring 50 engineers at $150K = $7.5M value), reduce change failure rate from 12% to 5% (fewer customer-impacting incidents), enable self-service reducing platform team ticket backlog 80%. ROI: 150%+ in year 1. Frame as 'investing in engineering leverage' not 'building infrastructure.' Show payback period and comparison to hiring equivalent headcount"
60
+ weight: 0.30
61
+ description: "IDP business case"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: cloud-native-future
3
+ level: 5
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Future of Kubernetes and cloud-native — WebAssembly, AI/ML workloads, edge computing, eBPF, and the evolution of container orchestration"
7
+ tags: [Kubernetes, future, WebAssembly, AI-ML, edge, eBPF, cloud-native, master]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your VP of Engineering asks you to present a technology radar for the
13
+ next 3 years: "Where is Kubernetes headed, and how should we position
14
+ our platform for the future?"
15
+
16
+ Current technology landscape (2026):
17
+
18
+ Emerging trends:
19
+ 1. WebAssembly (Wasm) on Kubernetes — SpinKube and containerd-wasm
20
+ allow running Wasm workloads alongside containers. Cold start: <1ms
21
+ vs 1-5s for containers. Memory: 1-10MB vs 50-500MB per container.
22
+ Use case: edge computing, serverless functions, plugin systems.
23
+
24
+ 2. AI/ML workload orchestration — GPU scheduling, model serving
25
+ (KServe, vLLM), training pipelines (Kubeflow). GPU sharing and
26
+ fractional GPU allocation. Multi-tenant GPU clusters.
27
+
28
+ 3. eBPF for networking and observability — Cilium replacing kube-proxy
29
+ and iptables. Kernel-level observability without sidecars. Tetragon
30
+ for runtime security. 40% less network overhead than traditional CNI.
31
+
32
+ 4. Edge Kubernetes — K3s, KubeEdge for running Kubernetes at the edge
33
+ (retail stores, IoT gateways, vehicles). Challenges: intermittent
34
+ connectivity, limited resources, fleet management.
35
+
36
+ 5. Platform engineering maturity — Backstage, Kratix, Crossplane
37
+ abstracting Kubernetes for developers. The trend toward "Kubernetes
38
+ disappears" — developers never see YAML.
39
+
40
+ 6. Sustainable computing — carbon-aware scheduling, right-sizing for
41
+ energy efficiency, cloud carbon footprint tracking.
42
+
43
+ Your organization's exposure:
44
+ - Currently: no Wasm, no AI/ML on K8s, basic CNI, no edge deployments
45
+ - Planned: AI feature development, potential retail edge deployment
46
+ - Risk: falling behind competitors on AI inference speed
47
+
48
+ Task: Write the technology radar and positioning strategy. Include:
49
+ assessment of each trend (adopt/trial/assess/hold), impact on current
50
+ platform architecture, investment recommendations, skills development
51
+ plan, migration paths from current state, and risk of not adopting.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "Technology assessment is balanced — Adopt: eBPF/Cilium (proven, measurable benefits, drop-in replacement). Trial: AI/ML workloads on K8s (growing need, evaluate KServe/vLLM). Assess: WebAssembly (promising but ecosystem still maturing), edge Kubernetes (depends on business use case). Hold: nothing to abandon, but evaluate sidecar-based service mesh vs eBPF alternatives. Each assessment includes reasoning and timeline"
56
+ weight: 0.35
57
+ description: "Technology assessment"
58
+ - type: llm_judge
59
+ criteria: "Platform impact analysis is specific — eBPF: replace kube-proxy with Cilium, remove service mesh sidecars (reduce resource overhead 30-40%), gain kernel-level observability. AI/ML: add GPU node pools, implement fractional GPU sharing, deploy model serving infrastructure (KServe), integrate with training pipelines. Wasm: start with edge/serverless experiments, evaluate for latency-sensitive workloads. Each has clear migration path"
60
+ weight: 0.35
61
+ description: "Platform impact"
62
+ - type: llm_judge
63
+ criteria: "Investment and skills recommendations are actionable — prioritize eBPF (immediate ROI, Q1), then AI/ML infrastructure (business demand, Q2-Q3), then evaluate Wasm for edge (Q4). Skills: send 2 engineers for Cilium certification, hire ML platform engineer, establish innovation lab for Wasm experimentation. Risk of not adopting: competitors with faster AI inference will win deals, edge deployment delays lose retail opportunity"
64
+ weight: 0.30
65
+ description: "Investment recommendations"
@@ -0,0 +1,57 @@
1
+ meta:
2
+ id: comprehensive-platform
3
+ level: 5
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design a comprehensive Kubernetes platform — architect an enterprise-grade platform from scratch covering all operational concerns"
7
+ tags: [Kubernetes, platform-architecture, enterprise, comprehensive, design, master]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're hired as the founding platform architect for a rapidly growing
13
+ startup that's about to hit $100M ARR. They have 100 engineers, 40
14
+ microservices running on Heroku, and are hitting Heroku's limits
15
+ (cost, performance, flexibility).
16
+
17
+ The CEO says: "We're migrating to Kubernetes. Design the platform
18
+ that will carry us to $1B ARR."
19
+
20
+ Requirements gathered from stakeholders:
21
+ - CTO: "Multi-region, 99.99% availability, sub-100ms latency globally"
22
+ - VP Engineering: "Developers should deploy in < 10 minutes with zero
23
+ Kubernetes knowledge"
24
+ - CISO: "SOC2, GDPR, PCI-DSS compliance. Zero-trust networking."
25
+ - CFO: "Cloud costs should be < 5% of revenue at scale"
26
+ - Head of Data: "Support ML training workloads alongside production"
27
+ - VP Product: "A/B testing infrastructure, feature flags, canary
28
+ deployments as first-class features"
29
+
30
+ Constraints:
31
+ - 6-month timeline to migrate first 10 services
32
+ - 18 months to complete full migration
33
+ - Platform team budget: 8 engineers (hiring from 0)
34
+ - Prefer AWS but must be portable enough for multi-cloud if needed
35
+ - Must support Python, Go, Java, Node.js, and Rust services
36
+
37
+ Task: Design the comprehensive platform architecture. Write: the
38
+ infrastructure architecture (multi-region EKS, networking, storage),
39
+ developer experience layer (IDP, CI/CD, local dev), operational
40
+ excellence (monitoring, alerting, incident response), security and
41
+ compliance layer, data platform integration (ML workloads, GPU
42
+ scheduling), the migration strategy from Heroku, hiring plan for
43
+ the platform team, and the 18-month execution roadmap.
44
+
45
+ assertions:
46
+ - type: llm_judge
47
+ criteria: "Infrastructure architecture is production-grade — multi-region EKS (us-east-1, eu-west-1, ap-southeast-1) with global load balancing. Networking: VPC per region with transit gateway, Cilium CNI with eBPF, Istio or Cilium service mesh. Storage: EBS for databases, EFS for shared storage, S3 for objects. Cluster design: separate clusters for production, staging, data/ML workloads. Karpenter for node autoscaling, spot instances for non-critical workloads"
48
+ weight: 0.35
49
+ description: "Infrastructure architecture"
50
+ - type: llm_judge
51
+ criteria: "Developer experience and operational layers are comprehensive — IDP: Backstage catalog with service templates (golden paths), ArgoCD for GitOps, automated CI/CD (build → test → security scan → canary deploy → promote). Local dev: Tilt or DevSpace. Monitoring: Prometheus + Thanos (cross-cluster), Grafana dashboards, Loki for logs, Tempo for traces. Alerting: Alertmanager with PagerDuty + Slack, SLO-based alerting. Incident response: documented runbooks, game day program"
52
+ weight: 0.35
53
+ description: "DevEx and operations"
54
+ - type: llm_judge
55
+ criteria: "Migration and hiring are realistic — Heroku migration: parallel-run strategy (run on both Heroku and K8s, shift traffic gradually). Start with stateless services, migrate stateful last. Hiring plan: hire platform lead month 1, then 2 engineers/month. Skills needed: Kubernetes, Terraform, Go, SRE. 18-month roadmap: months 1-3 (foundation: clusters, CI/CD, first service), months 4-9 (migrate 50% services, build IDP), months 10-15 (complete migration, ML platform), months 16-18 (optimization, multi-region active-active)"
56
+ weight: 0.30
57
+ description: "Migration and hiring"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: consulting-engagement
3
+ level: 5
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Kubernetes consulting engagement — assess a client's troubled Kubernetes migration, design remediation roadmap, and present recommendations to their leadership"
7
+ tags: [Kubernetes, consulting, migration, assessment, remediation, master]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're a Kubernetes platform consultant hired by a mid-size fintech
13
+ company (500 engineers, $200M revenue). Their Kubernetes migration
14
+ started 18 months ago and is "failing." The CTO says: "We spent $3M
15
+ on this migration and we're worse off than before."
16
+
17
+ Your assessment findings after 2 weeks:
18
+
19
+ Technical:
20
+ - 30% of services migrated to Kubernetes, 70% still on EC2
21
+ - The 30% on Kubernetes have MORE downtime than EC2 services
22
+ - 15 critical incidents in the last quarter (vs 3 for EC2 services)
23
+ - No centralized monitoring (each team runs their own Prometheus)
24
+ - No standardized deployment pipeline (each team deploys differently)
25
+ - RBAC is cluster-admin for everyone
26
+ - No network policies, no pod security enforcement
27
+ - 85% of pods have no resource limits
28
+ - Persistent storage on local SSDs (no replication, no backup)
29
+
30
+ Organizational:
31
+ - No platform team exists — "everyone owns Kubernetes"
32
+ - Each of 8 dev teams learned Kubernetes independently
33
+ - No internal documentation or runbooks
34
+ - Engineers spend ~30% of time on Kubernetes ops instead of features
35
+ - Leadership measured success by "% of services migrated" not outcomes
36
+ - The migration was mandated top-down without training plan
37
+
38
+ Cultural:
39
+ - Engineers are frustrated and some are leaving
40
+ - "Kubernetes is too complex" sentiment is growing
41
+ - Shadow IT: some teams secretly running new services on EC2
42
+
43
+ Task: Write the consulting report and remediation roadmap. Include:
44
+ root cause analysis (why the migration is failing), the immediate
45
+ stabilization plan (first 30 days), the 90-day remediation roadmap,
46
+ the 12-month platform maturity plan, organizational changes needed
47
+ (form platform team, training program), success metrics, and the
48
+ executive presentation framework.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Root cause analysis is accurate — the migration failed because of organizational and process gaps, not Kubernetes itself. Key failures: no platform team to provide expertise, no standardization (each team reinvented), no training (mandated without enablement), wrong success metric (% migrated vs reliability), no shared infrastructure (monitoring, CI/CD, security). Technical debt compounded because nobody owned the platform"
53
+ weight: 0.35
54
+ description: "Root cause analysis"
55
+ - type: llm_judge
56
+ criteria: "Stabilization and remediation are phased — 30-day stabilization: fix the worst reliability issues (add resource limits, deploy centralized monitoring, create incident response process, pause migration of new services). 90-day remediation: form platform team (3-4 engineers), deploy standardized CI/CD, implement RBAC and network policies, centralize logging. 12-month maturity: complete migration with golden paths, implement DR, achieve SLO targets"
57
+ weight: 0.35
58
+ description: "Phased remediation"
59
+ - type: llm_judge
60
+ criteria: "Organizational and metrics recommendations are practical — form a dedicated platform team (hire or reassign), invest in training program (certifications, workshops, pair programming with platform team), redefine success metrics (availability, MTTR, developer satisfaction, deployment frequency — not % migrated), create internal documentation and runbooks, establish architecture review board. Executive framing: this is a recovery, not a restart — the $3M invested built valuable experience"
61
+ weight: 0.30
62
+ description: "Organizational recommendations"