dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (222) hide show
  1. package/courses/GENERATION_LOG.md +45 -0
  2. package/courses/aws-lambda-debugging/course.yaml +11 -0
  3. package/courses/aws-lambda-debugging/scenarios/level-1/api-gateway-integration.yaml +71 -0
  4. package/courses/aws-lambda-debugging/scenarios/level-1/cloudwatch-logs-basics.yaml +64 -0
  5. package/courses/aws-lambda-debugging/scenarios/level-1/cold-start-basics.yaml +70 -0
  6. package/courses/aws-lambda-debugging/scenarios/level-1/environment-variable-issues.yaml +72 -0
  7. package/courses/aws-lambda-debugging/scenarios/level-1/first-debugging-shift.yaml +73 -0
  8. package/courses/aws-lambda-debugging/scenarios/level-1/handler-import-errors.yaml +71 -0
  9. package/courses/aws-lambda-debugging/scenarios/level-1/iam-permission-errors.yaml +68 -0
  10. package/courses/aws-lambda-debugging/scenarios/level-1/invocation-errors.yaml +72 -0
  11. package/courses/aws-lambda-debugging/scenarios/level-1/lambda-timeout-errors.yaml +65 -0
  12. package/courses/aws-lambda-debugging/scenarios/level-1/memory-and-oom.yaml +70 -0
  13. package/courses/aws-lambda-debugging/scenarios/level-2/async-invocation-failures.yaml +72 -0
  14. package/courses/aws-lambda-debugging/scenarios/level-2/cold-start-optimization.yaml +76 -0
  15. package/courses/aws-lambda-debugging/scenarios/level-2/dynamodb-streams-debugging.yaml +70 -0
  16. package/courses/aws-lambda-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +71 -0
  17. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-concurrency-management.yaml +70 -0
  18. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-layers-debugging.yaml +76 -0
  19. package/courses/aws-lambda-debugging/scenarios/level-2/sam-local-debugging.yaml +74 -0
  20. package/courses/aws-lambda-debugging/scenarios/level-2/sqs-event-source.yaml +72 -0
  21. package/courses/aws-lambda-debugging/scenarios/level-2/vpc-networking-issues.yaml +71 -0
  22. package/courses/aws-lambda-debugging/scenarios/level-2/xray-tracing.yaml +62 -0
  23. package/courses/aws-lambda-debugging/scenarios/level-3/advanced-debugging-shift.yaml +72 -0
  24. package/courses/aws-lambda-debugging/scenarios/level-3/container-image-lambda.yaml +79 -0
  25. package/courses/aws-lambda-debugging/scenarios/level-3/cross-account-invocation.yaml +72 -0
  26. package/courses/aws-lambda-debugging/scenarios/level-3/eventbridge-patterns.yaml +79 -0
  27. package/courses/aws-lambda-debugging/scenarios/level-3/iac-deployment-debugging.yaml +68 -0
  28. package/courses/aws-lambda-debugging/scenarios/level-3/kinesis-stream-processing.yaml +64 -0
  29. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-at-edge.yaml +64 -0
  30. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-extensions-debugging.yaml +67 -0
  31. package/courses/aws-lambda-debugging/scenarios/level-3/powertools-observability.yaml +79 -0
  32. package/courses/aws-lambda-debugging/scenarios/level-3/step-functions-debugging.yaml +80 -0
  33. package/courses/aws-lambda-debugging/scenarios/level-4/cost-optimization-strategy.yaml +67 -0
  34. package/courses/aws-lambda-debugging/scenarios/level-4/expert-debugging-shift.yaml +62 -0
  35. package/courses/aws-lambda-debugging/scenarios/level-4/incident-management-serverless.yaml +61 -0
  36. package/courses/aws-lambda-debugging/scenarios/level-4/multi-region-serverless.yaml +67 -0
  37. package/courses/aws-lambda-debugging/scenarios/level-4/observability-platform-design.yaml +71 -0
  38. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-architecture-design.yaml +64 -0
  39. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-data-architecture.yaml +66 -0
  40. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-migration-strategy.yaml +65 -0
  41. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-security-design.yaml +60 -0
  42. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-testing-strategy.yaml +62 -0
  43. package/courses/aws-lambda-debugging/scenarios/level-5/board-serverless-strategy.yaml +63 -0
  44. package/courses/aws-lambda-debugging/scenarios/level-5/consulting-serverless-adoption.yaml +57 -0
  45. package/courses/aws-lambda-debugging/scenarios/level-5/industry-serverless-patterns.yaml +62 -0
  46. package/courses/aws-lambda-debugging/scenarios/level-5/ma-serverless-integration.yaml +75 -0
  47. package/courses/aws-lambda-debugging/scenarios/level-5/master-debugging-shift.yaml +61 -0
  48. package/courses/aws-lambda-debugging/scenarios/level-5/organizational-serverless-transformation.yaml +65 -0
  49. package/courses/aws-lambda-debugging/scenarios/level-5/regulatory-serverless.yaml +61 -0
  50. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-economics.yaml +65 -0
  51. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-future-technology.yaml +66 -0
  52. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-platform-design.yaml +71 -0
  53. package/courses/docker-container-debugging/course.yaml +11 -0
  54. package/courses/docker-container-debugging/scenarios/level-1/container-exit-codes.yaml +59 -0
  55. package/courses/docker-container-debugging/scenarios/level-1/container-networking-basics.yaml +69 -0
  56. package/courses/docker-container-debugging/scenarios/level-1/docker-logs-debugging.yaml +67 -0
  57. package/courses/docker-container-debugging/scenarios/level-1/dockerfile-build-failures.yaml +71 -0
  58. package/courses/docker-container-debugging/scenarios/level-1/environment-variable-issues.yaml +74 -0
  59. package/courses/docker-container-debugging/scenarios/level-1/first-debugging-shift.yaml +70 -0
  60. package/courses/docker-container-debugging/scenarios/level-1/image-pull-failures.yaml +68 -0
  61. package/courses/docker-container-debugging/scenarios/level-1/port-mapping-issues.yaml +67 -0
  62. package/courses/docker-container-debugging/scenarios/level-1/resource-limits-oom.yaml +70 -0
  63. package/courses/docker-container-debugging/scenarios/level-1/volume-mount-problems.yaml +66 -0
  64. package/courses/docker-container-debugging/scenarios/level-2/container-health-checks.yaml +73 -0
  65. package/courses/docker-container-debugging/scenarios/level-2/docker-compose-debugging.yaml +66 -0
  66. package/courses/docker-container-debugging/scenarios/level-2/docker-exec-debugging.yaml +71 -0
  67. package/courses/docker-container-debugging/scenarios/level-2/image-layer-optimization.yaml +81 -0
  68. package/courses/docker-container-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +73 -0
  69. package/courses/docker-container-debugging/scenarios/level-2/logging-and-log-rotation.yaml +76 -0
  70. package/courses/docker-container-debugging/scenarios/level-2/multi-stage-build-debugging.yaml +76 -0
  71. package/courses/docker-container-debugging/scenarios/level-2/network-debugging-tools.yaml +67 -0
  72. package/courses/docker-container-debugging/scenarios/level-2/pid1-signal-handling.yaml +71 -0
  73. package/courses/docker-container-debugging/scenarios/level-2/security-scanning-basics.yaml +67 -0
  74. package/courses/docker-container-debugging/scenarios/level-3/advanced-debugging-shift.yaml +77 -0
  75. package/courses/docker-container-debugging/scenarios/level-3/buildkit-optimization.yaml +67 -0
  76. package/courses/docker-container-debugging/scenarios/level-3/container-filesystem-debugging.yaml +70 -0
  77. package/courses/docker-container-debugging/scenarios/level-3/container-security-hardening.yaml +74 -0
  78. package/courses/docker-container-debugging/scenarios/level-3/disk-space-management.yaml +74 -0
  79. package/courses/docker-container-debugging/scenarios/level-3/docker-api-automation.yaml +72 -0
  80. package/courses/docker-container-debugging/scenarios/level-3/docker-daemon-issues.yaml +73 -0
  81. package/courses/docker-container-debugging/scenarios/level-3/docker-in-docker-ci.yaml +69 -0
  82. package/courses/docker-container-debugging/scenarios/level-3/overlay-network-debugging.yaml +70 -0
  83. package/courses/docker-container-debugging/scenarios/level-3/production-container-ops.yaml +71 -0
  84. package/courses/docker-container-debugging/scenarios/level-4/cicd-pipeline-design.yaml +66 -0
  85. package/courses/docker-container-debugging/scenarios/level-4/container-monitoring-observability.yaml +63 -0
  86. package/courses/docker-container-debugging/scenarios/level-4/container-orchestration-strategy.yaml +62 -0
  87. package/courses/docker-container-debugging/scenarios/level-4/container-performance-engineering.yaml +64 -0
  88. package/courses/docker-container-debugging/scenarios/level-4/container-security-architecture.yaml +66 -0
  89. package/courses/docker-container-debugging/scenarios/level-4/enterprise-image-management.yaml +58 -0
  90. package/courses/docker-container-debugging/scenarios/level-4/expert-debugging-shift.yaml +63 -0
  91. package/courses/docker-container-debugging/scenarios/level-4/incident-response-containers.yaml +70 -0
  92. package/courses/docker-container-debugging/scenarios/level-4/multi-environment-management.yaml +65 -0
  93. package/courses/docker-container-debugging/scenarios/level-4/stateful-service-containers.yaml +65 -0
  94. package/courses/docker-container-debugging/scenarios/level-5/board-infrastructure-strategy.yaml +58 -0
  95. package/courses/docker-container-debugging/scenarios/level-5/consulting-container-strategy.yaml +61 -0
  96. package/courses/docker-container-debugging/scenarios/level-5/container-platform-architecture.yaml +67 -0
  97. package/courses/docker-container-debugging/scenarios/level-5/container-platform-economics.yaml +67 -0
  98. package/courses/docker-container-debugging/scenarios/level-5/container-technology-evolution.yaml +67 -0
  99. package/courses/docker-container-debugging/scenarios/level-5/disaster-recovery-containers.yaml +66 -0
  100. package/courses/docker-container-debugging/scenarios/level-5/industry-container-patterns.yaml +71 -0
  101. package/courses/docker-container-debugging/scenarios/level-5/master-debugging-shift.yaml +62 -0
  102. package/courses/docker-container-debugging/scenarios/level-5/organizational-transformation.yaml +67 -0
  103. package/courses/docker-container-debugging/scenarios/level-5/regulatory-compliance-containers.yaml +61 -0
  104. package/courses/kubernetes-deployment-troubleshooting/course.yaml +12 -0
  105. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/configmap-secret-issues.yaml +69 -0
  106. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/crashloopbackoff.yaml +68 -0
  107. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/deployment-rollout.yaml +56 -0
  108. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/first-troubleshooting-shift.yaml +65 -0
  109. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/health-probe-failures.yaml +70 -0
  110. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/imagepullbackoff.yaml +57 -0
  111. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/kubectl-debugging-basics.yaml +56 -0
  112. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/oomkilled.yaml +70 -0
  113. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/pending-pods.yaml +68 -0
  114. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/service-not-reachable.yaml +66 -0
  115. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/dns-resolution-failures.yaml +63 -0
  116. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/helm-deployment-failures.yaml +63 -0
  117. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/hpa-scaling-issues.yaml +62 -0
  118. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/ingress-routing-issues.yaml +63 -0
  119. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/init-container-failures.yaml +63 -0
  120. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/intermediate-troubleshooting-shift.yaml +66 -0
  121. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/network-policy-blocking.yaml +67 -0
  122. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/persistent-volume-issues.yaml +69 -0
  123. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/rbac-permission-denied.yaml +57 -0
  124. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/resource-quota-limits.yaml +64 -0
  125. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/advanced-troubleshooting-shift.yaml +69 -0
  126. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/cluster-upgrade-failures.yaml +71 -0
  127. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/gitops-drift-detection.yaml +62 -0
  128. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/job-cronjob-failures.yaml +67 -0
  129. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/monitoring-alerting-gaps.yaml +64 -0
  130. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/multi-container-debugging.yaml +68 -0
  131. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/node-pressure-evictions.yaml +70 -0
  132. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/pod-disruption-budgets.yaml +59 -0
  133. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/service-mesh-debugging.yaml +64 -0
  134. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/statefulset-troubleshooting.yaml +69 -0
  135. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/capacity-planning.yaml +65 -0
  136. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/cost-optimization.yaml +57 -0
  137. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/disaster-recovery-design.yaml +56 -0
  138. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/executive-communication.yaml +62 -0
  139. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/expert-troubleshooting-shift.yaml +65 -0
  140. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/incident-management-process.yaml +59 -0
  141. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-cluster-operations.yaml +62 -0
  142. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-tenancy-design.yaml +55 -0
  143. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/platform-engineering.yaml +59 -0
  144. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/security-hardening.yaml +58 -0
  145. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/behavioral-science.yaml +62 -0
  146. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/board-strategy.yaml +61 -0
  147. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/cloud-native-future.yaml +65 -0
  148. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/comprehensive-platform.yaml +57 -0
  149. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/consulting-engagement.yaml +62 -0
  150. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/industry-benchmarks.yaml +58 -0
  151. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/ma-integration.yaml +62 -0
  152. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/master-troubleshooting-shift.yaml +73 -0
  153. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/product-development.yaml +65 -0
  154. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/regulatory-compliance.yaml +76 -0
  155. package/courses/mysql-query-optimization/course.yaml +11 -0
  156. package/courses/mysql-query-optimization/scenarios/level-1/buffer-pool-basics.yaml +65 -0
  157. package/courses/mysql-query-optimization/scenarios/level-1/explain-basics.yaml +66 -0
  158. package/courses/mysql-query-optimization/scenarios/level-1/first-optimization-shift.yaml +78 -0
  159. package/courses/mysql-query-optimization/scenarios/level-1/innodb-index-fundamentals.yaml +68 -0
  160. package/courses/mysql-query-optimization/scenarios/level-1/join-basics.yaml +66 -0
  161. package/courses/mysql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml +67 -0
  162. package/courses/mysql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml +66 -0
  163. package/courses/mysql-query-optimization/scenarios/level-1/select-star-problems.yaml +68 -0
  164. package/courses/mysql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml +65 -0
  165. package/courses/mysql-query-optimization/scenarios/level-1/where-clause-optimization.yaml +65 -0
  166. package/courses/mysql-query-optimization/scenarios/level-2/buffer-pool-tuning.yaml +64 -0
  167. package/courses/mysql-query-optimization/scenarios/level-2/composite-index-design.yaml +71 -0
  168. package/courses/mysql-query-optimization/scenarios/level-2/covering-and-invisible-indexes.yaml +69 -0
  169. package/courses/mysql-query-optimization/scenarios/level-2/cte-and-window-functions.yaml +78 -0
  170. package/courses/mysql-query-optimization/scenarios/level-2/intermediate-optimization-shift.yaml +68 -0
  171. package/courses/mysql-query-optimization/scenarios/level-2/join-optimization.yaml +67 -0
  172. package/courses/mysql-query-optimization/scenarios/level-2/performance-schema-analysis.yaml +69 -0
  173. package/courses/mysql-query-optimization/scenarios/level-2/query-optimizer-hints.yaml +74 -0
  174. package/courses/mysql-query-optimization/scenarios/level-2/subquery-optimization.yaml +70 -0
  175. package/courses/mysql-query-optimization/scenarios/level-2/write-optimization.yaml +63 -0
  176. package/courses/mysql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml +71 -0
  177. package/courses/mysql-query-optimization/scenarios/level-3/connection-management.yaml +67 -0
  178. package/courses/mysql-query-optimization/scenarios/level-3/full-text-search.yaml +77 -0
  179. package/courses/mysql-query-optimization/scenarios/level-3/json-optimization.yaml +87 -0
  180. package/courses/mysql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml +68 -0
  181. package/courses/mysql-query-optimization/scenarios/level-3/monitoring-alerting.yaml +63 -0
  182. package/courses/mysql-query-optimization/scenarios/level-3/online-schema-changes.yaml +79 -0
  183. package/courses/mysql-query-optimization/scenarios/level-3/partitioning-strategies.yaml +83 -0
  184. package/courses/mysql-query-optimization/scenarios/level-3/query-profiling-deep-dive.yaml +84 -0
  185. package/courses/mysql-query-optimization/scenarios/level-3/replication-optimization.yaml +66 -0
  186. package/courses/mysql-query-optimization/scenarios/level-4/aurora-vs-rds-evaluation.yaml +61 -0
  187. package/courses/mysql-query-optimization/scenarios/level-4/data-architecture.yaml +62 -0
  188. package/courses/mysql-query-optimization/scenarios/level-4/database-migration-planning.yaml +59 -0
  189. package/courses/mysql-query-optimization/scenarios/level-4/enterprise-governance.yaml +50 -0
  190. package/courses/mysql-query-optimization/scenarios/level-4/executive-communication.yaml +54 -0
  191. package/courses/mysql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml +67 -0
  192. package/courses/mysql-query-optimization/scenarios/level-4/high-availability-architecture.yaml +60 -0
  193. package/courses/mysql-query-optimization/scenarios/level-4/optimizer-internals.yaml +62 -0
  194. package/courses/mysql-query-optimization/scenarios/level-4/performance-sla-design.yaml +52 -0
  195. package/courses/mysql-query-optimization/scenarios/level-4/read-replica-scaling.yaml +51 -0
  196. package/courses/mysql-query-optimization/scenarios/level-5/ai-database-future.yaml +45 -0
  197. package/courses/mysql-query-optimization/scenarios/level-5/behavioral-science.yaml +44 -0
  198. package/courses/mysql-query-optimization/scenarios/level-5/benchmark-design.yaml +47 -0
  199. package/courses/mysql-query-optimization/scenarios/level-5/board-strategy.yaml +48 -0
  200. package/courses/mysql-query-optimization/scenarios/level-5/comprehensive-platform.yaml +49 -0
  201. package/courses/mysql-query-optimization/scenarios/level-5/consulting-engagement.yaml +52 -0
  202. package/courses/mysql-query-optimization/scenarios/level-5/ma-database-integration.yaml +47 -0
  203. package/courses/mysql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +56 -0
  204. package/courses/mysql-query-optimization/scenarios/level-5/product-development.yaml +48 -0
  205. package/courses/mysql-query-optimization/scenarios/level-5/regulatory-compliance.yaml +48 -0
  206. package/courses/postgresql-query-optimization/scenarios/level-5/comprehensive-database-system.yaml +70 -0
  207. package/courses/postgresql-query-optimization/scenarios/level-5/database-ai-future.yaml +81 -0
  208. package/courses/postgresql-query-optimization/scenarios/level-5/database-behavioral-science.yaml +63 -0
  209. package/courses/postgresql-query-optimization/scenarios/level-5/database-board-strategy.yaml +77 -0
  210. package/courses/postgresql-query-optimization/scenarios/level-5/database-consulting-engagement.yaml +61 -0
  211. package/courses/postgresql-query-optimization/scenarios/level-5/database-industry-benchmarks.yaml +64 -0
  212. package/courses/postgresql-query-optimization/scenarios/level-5/database-ma-integration.yaml +71 -0
  213. package/courses/postgresql-query-optimization/scenarios/level-5/database-product-development.yaml +72 -0
  214. package/courses/postgresql-query-optimization/scenarios/level-5/database-regulatory-landscape.yaml +76 -0
  215. package/courses/postgresql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +66 -0
  216. package/courses/terraform-infrastructure-setup/course.yaml +11 -0
  217. package/courses/terraform-infrastructure-setup/scenarios/level-1/terraform-init-errors.yaml +72 -0
  218. package/dist/mcp/session-manager.d.ts +7 -4
  219. package/dist/mcp/session-manager.d.ts.map +1 -1
  220. package/dist/mcp/session-manager.js +23 -8
  221. package/dist/mcp/session-manager.js.map +1 -1
  222. package/package.json +1 -1
@@ -0,0 +1,68 @@
1
+ meta:
2
+ id: multi-container-debugging
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug multi-container pods — diagnose sidecar issues, shared volume problems, and inter-container communication failures"
7
+ tags: [Kubernetes, multi-container, sidecar, init-container, volumes, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your application pod has 3 containers: the main app, a log-shipper
13
+ sidecar, and a config-reloader sidecar. The pod shows 2/3 Ready:
14
+
15
+ $ kubectl get pods
16
+ NAME READY STATUS RESTARTS AGE
17
+ webapp-5a6b7c-d8e9 2/3 Running 0 10m
18
+
19
+ $ kubectl describe pod webapp-5a6b7c-d8e9
20
+ Containers:
21
+ webapp:
22
+ State: Running
23
+ Ready: True
24
+
25
+ log-shipper:
26
+ State: Running
27
+ Ready: True
28
+
29
+ config-reloader:
30
+ State: Running
31
+ Ready: False
32
+ Readiness: http-get http://:9091/healthz
33
+ Message: Readiness probe failed: connection refused
34
+
35
+ The config-reloader container is Running but not Ready. Its readiness
36
+ probe targets port 9091 but the container is actually listening on
37
+ 9090 (port mismatch in the probe config).
38
+
39
+ Because one container isn't Ready, the entire pod is not fully Ready,
40
+ which means the Service might remove it from endpoints if the Service
41
+ requires all containers Ready.
42
+
43
+ Additional issues discovered:
44
+ - log-shipper can't read app logs because the shared volume mount
45
+ path is wrong (/var/log/app vs /var/log/webapp)
46
+ - config-reloader watches a ConfigMap volume but the ConfigMap update
47
+ propagation delay (up to 60s) causes stale config reads
48
+
49
+ Task: Explain multi-container pod debugging. Write: how containers
50
+ in a pod share resources (network, storage, but NOT filesystem by
51
+ default), sidecar container patterns (logging, config reload, proxy),
52
+ how pod readiness is determined (ALL containers must be Ready), how
53
+ shared volumes work between containers, troubleshooting techniques
54
+ for each container (-c flag), and native sidecar containers (K8s 1.29+).
55
+
56
+ assertions:
57
+ - type: llm_judge
58
+ criteria: "Multi-container pod model is explained — containers in a pod share network namespace (same IP, localhost communication), can share volumes via volumeMounts, but have separate filesystems otherwise. Pod is Ready only when ALL containers pass readiness probes. Each container has independent lifecycle, restart policy, and resource limits. Logs and exec require -c <container> flag"
59
+ weight: 0.35
60
+ description: "Pod model explained"
61
+ - type: llm_judge
62
+ criteria: "Sidecar patterns and debugging are covered — logging sidecar reads from shared volume (both containers must mount the SAME path), config-reloader watches ConfigMap volumes (updates propagate every kubelet sync period ~60s), proxy sidecar shares network namespace. Debug: kubectl logs <pod> -c <container>, kubectl exec <pod> -c <container>, check shared volume mount paths match, verify port configurations per container"
63
+ weight: 0.35
64
+ description: "Sidecar debugging"
65
+ - type: llm_judge
66
+ criteria: "Native sidecars and best practices are covered — Kubernetes 1.29+ native sidecar containers (init containers with restartPolicy: Always) start before and stop after the main container, solving lifecycle ordering issues. Best practices: use emptyDir for shared volumes, ensure consistent mount paths, set resource limits per container, use separate health endpoints per container, consider if you really need sidecars vs a separate deployment"
67
+ weight: 0.30
68
+ description: "Native sidecars and practices"
@@ -0,0 +1,70 @@
1
+ meta:
2
+ id: node-pressure-evictions
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug node pressure and pod evictions — diagnose DiskPressure, MemoryPressure, and PIDPressure conditions causing pod disruptions"
7
+ tags: [Kubernetes, node-pressure, eviction, DiskPressure, MemoryPressure, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Pods are being evicted from nodes across your cluster, causing
13
+ intermittent service disruptions:
14
+
15
+ $ kubectl get pods --field-selector=status.phase=Failed
16
+ NAME STATUS REASON AGE
17
+ logger-svc-5a6b-c7d8 Failed Evicted 15m
18
+ metrics-agg-9e0f-g1h2 Failed Evicted 15m
19
+ cache-warm-3i4j-k5l6 Failed Evicted 14m
20
+
21
+ $ kubectl describe node worker-3
22
+ Conditions:
23
+ Type Status Reason
24
+ MemoryPressure True KubeletHasInsufficientMemory
25
+ DiskPressure True KubeletHasDiskPressure
26
+ PIDPressure False KubeletHasSufficientPID
27
+ Ready True KubeletReady
28
+
29
+ Taints:
30
+ node.kubernetes.io/memory-pressure:NoSchedule
31
+ node.kubernetes.io/disk-pressure:NoSchedule
32
+
33
+ Allocated resources:
34
+ CPU Requests: 7200m (90%), Memory Requests: 28Gi (87%)
35
+ CPU Limits: 14000m (175%), Memory Limits: 56Gi (175%)
36
+
37
+ The node has memory and disk pressure. Kubernetes automatically added
38
+ taints to prevent new pods from scheduling. The kubelet is evicting
39
+ pods based on QoS class:
40
+ - BestEffort pods evicted first
41
+ - Then Burstable pods exceeding requests
42
+ - Guaranteed pods only if exceeding limits
43
+
44
+ The overcommitment ratio is concerning: limits are 175% of node
45
+ capacity. If all pods try to use their limits simultaneously, the
46
+ node will be severely overcommitted.
47
+
48
+ $ kubectl top node worker-3
49
+ NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
50
+ worker-3 6800m 85% 30Gi 94%
51
+
52
+ Task: Explain node pressure conditions and eviction. Write: the three
53
+ pressure types (memory, disk, PID), how kubelet eviction thresholds
54
+ work (soft vs hard), eviction order by QoS class, how Kubernetes
55
+ automatically taints pressured nodes, the relationship between
56
+ requests/limits and overcommitment, and strategies to prevent evictions.
57
+
58
+ assertions:
59
+ - type: llm_judge
60
+ criteria: "Node pressure types are explained — MemoryPressure (memory usage exceeds threshold), DiskPressure (disk usage exceeds threshold, includes container images and logs), PIDPressure (process IDs exhausted). Kubelet monitors these and sets node conditions. Automatic taints are applied: node.kubernetes.io/<condition>:NoSchedule prevents new pods from scheduling on pressured nodes"
61
+ weight: 0.35
62
+ description: "Pressure types explained"
63
+ - type: llm_judge
64
+ criteria: "Eviction thresholds and order are explained — soft evictions: kubelet waits eviction-soft-grace-period before evicting (e.g., memory.available < 100Mi for 30s). Hard evictions: immediate eviction when threshold crossed (e.g., memory.available < 50Mi). Eviction order: BestEffort first, then Burstable pods exceeding requests sorted by usage, then Guaranteed only if exceeding limits. Pods using more than requested are evicted before those within requests"
65
+ weight: 0.35
66
+ description: "Eviction mechanics"
67
+ - type: llm_judge
68
+ criteria: "Prevention strategies are practical — set appropriate resource requests and limits (avoid overcommitment), use Guaranteed QoS for critical workloads, implement PodDisruptionBudgets to limit simultaneous evictions, monitor node capacity (kubectl top node, Prometheus), use cluster autoscaler to add nodes before pressure, set ResourceQuotas per namespace, configure kubelet eviction thresholds appropriately"
69
+ weight: 0.30
70
+ description: "Prevention strategies"
@@ -0,0 +1,59 @@
1
+ meta:
2
+ id: pod-disruption-budgets
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug PodDisruptionBudget issues — diagnose why node drains hang, upgrades stall, and voluntary disruptions are blocked"
7
+ tags: [Kubernetes, PDB, PodDisruptionBudget, node-drain, upgrades, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You need to drain a node for maintenance but the drain command hangs:
13
+
14
+ $ kubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data
15
+ evicting pod app/critical-svc-7a8b-c9d0
16
+ evicting pod app/critical-svc-1e2f-g3h4
17
+ error when evicting pods/"critical-svc-7a8b-c9d0" -n "app":
18
+ Cannot evict pod as it would violate the pod's disruption budget.
19
+
20
+ $ kubectl get pdb -n app
21
+ NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
22
+ critical-pdb 3 N/A 0 7d
23
+
24
+ $ kubectl get pods -l app=critical-svc -n app
25
+ NAME READY STATUS RESTARTS AGE
26
+ critical-svc-7a8b-c9d0 1/1 Running 0 2d (worker-2)
27
+ critical-svc-1e2f-g3h4 1/1 Running 0 2d (worker-2)
28
+ critical-svc-5i6j-k7l8 1/1 Running 0 2d (worker-3)
29
+
30
+ The PDB requires minAvailable=3 but there are only 3 pods total,
31
+ two of which are on the node being drained. Evicting either would
32
+ drop below the minimum.
33
+
34
+ This also blocks cluster autoscaler from removing underutilized nodes
35
+ and blocks Kubernetes version upgrades that require node rotation.
36
+
37
+ Compounding the issue: the HPA has minReplicas=3, so it can't scale
38
+ up additional pods to create room for the drain.
39
+
40
+ Task: Explain PodDisruptionBudgets and how to debug drain issues.
41
+ Write: what PDBs protect against (voluntary disruptions), minAvailable
42
+ vs maxUnavailable, how PDBs interact with node drains and cluster
43
+ upgrades, the difference between voluntary and involuntary disruptions,
44
+ common PDB misconfigurations, and how to properly configure PDBs for
45
+ maintenance windows.
46
+
47
+ assertions:
48
+ - type: llm_judge
49
+ criteria: "PDB behavior is explained — PDBs limit voluntary disruptions (node drain, cluster upgrade, pod eviction) but NOT involuntary disruptions (node crash, OOM kill, hardware failure). minAvailable: minimum pods that must be running. maxUnavailable: maximum pods that can be down simultaneously. ALLOWED DISRUPTIONS shows how many pods can currently be evicted without violating the budget"
50
+ weight: 0.35
51
+ description: "PDB behavior"
52
+ - type: llm_judge
53
+ criteria: "The misconfiguration is diagnosed — minAvailable=3 with 3 total pods means ALLOWED DISRUPTIONS=0 (can never evict any pod). Fix: use maxUnavailable=1 instead (allows 1 pod down at a time), or set minAvailable to N-1 (e.g., 2 for 3 replicas). Must also ensure HPA maxReplicas allows scaling up to create headroom for drains. PDB blocks kubectl drain, cluster autoscaler scale-down, and rolling node upgrades"
54
+ weight: 0.35
55
+ description: "Misconfiguration diagnosed"
56
+ - type: llm_judge
57
+ criteria: "Best practices are practical — use maxUnavailable instead of minAvailable for easier reasoning, ensure PDB allows at least 1 disruption at all times, coordinate PDB with HPA (HPA should scale up before drain to maintain minimum), use --timeout on kubectl drain to detect stuck drains, temporary PDB modification for emergency maintenance, test PDBs with kubectl drain --dry-run"
58
+ weight: 0.30
59
+ description: "Best practices"
@@ -0,0 +1,64 @@
1
+ meta:
2
+ id: service-mesh-debugging
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug service mesh issues — diagnose Istio/Linkerd sidecar injection failures, mTLS errors, and traffic routing problems"
7
+ tags: [Kubernetes, service-mesh, Istio, Linkerd, sidecar, mTLS, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ After enabling Istio service mesh, several services are broken:
13
+
14
+ $ kubectl get pods -n bookstore
15
+ NAME READY STATUS RESTARTS AGE
16
+ catalog-svc-7a8b9c-d0e1 2/2 Running 0 10m
17
+ order-svc-2f3g4h-i5j6 1/1 Running 0 10m
18
+ payment-svc-7k8l9m-n0p1 2/2 Running 0 10m
19
+
20
+ Notice: catalog-svc and payment-svc show 2/2 (app + istio-proxy
21
+ sidecar), but order-svc shows 1/1 — the sidecar wasn't injected.
22
+
23
+ $ kubectl get namespace bookstore --show-labels
24
+ NAME STATUS AGE LABELS
25
+ bookstore Active 1d istio-injection=enabled
26
+
27
+ The namespace has automatic injection enabled, so why didn't order-svc
28
+ get a sidecar?
29
+
30
+ $ kubectl get deployment order-svc -o yaml | grep -A2 annotations
31
+ annotations:
32
+ sidecar.istio.io/inject: "false"
33
+
34
+ Someone explicitly disabled injection for order-svc. Now there's a
35
+ connectivity problem:
36
+
37
+ $ kubectl logs catalog-svc-7a8b9c-d0e1 -c istio-proxy
38
+ upstream connect error or disconnect/reset before headers. reset
39
+ reason: connection failure, transport failure reason: TLS error:
40
+ 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
41
+
42
+ catalog-svc (with mTLS via sidecar) can't talk to order-svc (no
43
+ sidecar, no mTLS). Strict mTLS mode requires both sides to have the
44
+ proxy.
45
+
46
+ Task: Explain service mesh troubleshooting. Write: how sidecar
47
+ injection works (namespace label, annotation override), mTLS between
48
+ services (what happens when one side lacks a sidecar), traffic routing
49
+ with VirtualService/DestinationRule, how to debug with istioctl
50
+ analyze and proxy logs, and common service mesh issues.
51
+
52
+ assertions:
53
+ - type: llm_judge
54
+ criteria: "Sidecar injection is explained — automatic injection via namespace label (istio-injection=enabled), per-pod override via annotation (sidecar.istio.io/inject: true/false). The mutating webhook intercepts pod creation and adds the sidecar container. Pods created before labeling the namespace need restart. Injection can fail if webhook is misconfigured or the annotation explicitly disables it"
55
+ weight: 0.35
56
+ description: "Sidecar injection"
57
+ - type: llm_judge
58
+ criteria: "mTLS issues are diagnosed — strict mTLS mode requires both client and server to have Istio sidecar proxies for mutual TLS authentication. If one pod lacks a sidecar, the TLS handshake fails (CERTIFICATE_VERIFY_FAILED). Solutions: enable sidecar on all services, or use permissive mode (PeerAuthentication) to allow both plaintext and mTLS during migration"
59
+ weight: 0.35
60
+ description: "mTLS issues"
61
+ - type: llm_judge
62
+ criteria: "Debugging tools are covered — istioctl analyze checks configuration for issues, istioctl proxy-status shows sync status between control plane and proxies, istioctl proxy-config shows proxy configuration, kubectl logs -c istio-proxy for proxy errors. VirtualService/DestinationRule for traffic routing, retries, timeouts. Common issues: port naming conventions (http-, grpc-), protocol detection failures"
63
+ weight: 0.30
64
+ description: "Debugging tools"
@@ -0,0 +1,69 @@
1
+ meta:
2
+ id: statefulset-troubleshooting
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug StatefulSet issues — diagnose ordered startup failures, persistent volume problems, and headless service misconfigurations"
7
+ tags: [Kubernetes, StatefulSet, ordered-deployment, PVC, headless-service, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your Kafka cluster StatefulSet is partially failed. Pod kafka-0 runs
13
+ fine but kafka-1 and kafka-2 won't start:
14
+
15
+ $ kubectl get pods -l app=kafka
16
+ NAME READY STATUS RESTARTS AGE
17
+ kafka-0 1/1 Running 0 1h
18
+ kafka-1 0/1 Pending 0 1h
19
+ kafka-2 0/1 Pending 0 1h
20
+
21
+ StatefulSets use OrderedReady pod management by default — kafka-1
22
+ won't even be attempted until kafka-0 is Running and Ready, and
23
+ kafka-2 waits for kafka-1. But kafka-0 IS Running...
24
+
25
+ $ kubectl describe pod kafka-1
26
+ Events:
27
+ Warning FailedScheduling 5m default-scheduler
28
+ 0/5 nodes are available: 5 node(s) had volume node affinity conflict
29
+
30
+ $ kubectl get pvc -l app=kafka
31
+ NAME STATUS VOLUME CAPACITY STORAGECLASS
32
+ data-kafka-0 Bound pv-us-east-1a 50Gi gp3
33
+ data-kafka-1 Bound pv-us-east-1b 50Gi gp3
34
+ data-kafka-2 Pending gp3
35
+
36
+ PVC data-kafka-1 is bound to a PV in us-east-1b, but there are no
37
+ nodes in that AZ anymore (a node was decommissioned). PVC data-kafka-2
38
+ can't provision at all.
39
+
40
+ Additionally, the headless Service for Kafka is misconfigured:
41
+
42
+ $ kubectl get svc kafka-headless
43
+ NAME TYPE CLUSTER-IP PORT(S)
44
+ kafka-headless ClusterIP 10.96.1.50 9092/TCP
45
+
46
+ It has a ClusterIP! Headless services must have clusterIP: None to
47
+ return individual pod IPs for StatefulSet DNS entries like
48
+ kafka-0.kafka-headless.default.svc.cluster.local.
49
+
50
+ Task: Explain StatefulSet-specific troubleshooting. Write: how
51
+ StatefulSets differ from Deployments (ordered, stable identity,
52
+ stable storage), OrderedReady vs Parallel pod management, PVC
53
+ lifecycle (PVCs persist across pod restarts and aren't auto-deleted),
54
+ headless Service requirements, volume node affinity issues, and
55
+ common StatefulSet failure patterns.
56
+
57
+ assertions:
58
+ - type: llm_judge
59
+ criteria: "StatefulSet semantics are explained — ordered pod creation/deletion (pod-0 before pod-1), stable network identity (pod-name.headless-svc.namespace.svc.cluster.local), stable persistent storage (PVCs bound to specific pods and persist across restarts/rescheduling). PVCs are NOT deleted when StatefulSet is scaled down — must be manually cleaned up. OrderedReady: wait for Ready before next pod. Parallel: all pods simultaneously"
60
+ weight: 0.35
61
+ description: "StatefulSet semantics"
62
+ - type: llm_judge
63
+ criteria: "Volume and headless Service issues are diagnosed — PVCs bound to specific PVs may have zone affinity, preventing scheduling if no nodes exist in that zone. Headless Service must have clusterIP: None to enable DNS A records for individual pods. With a ClusterIP, DNS returns the cluster IP instead of pod IPs, breaking StatefulSet peer discovery (Kafka, etcd, ZooKeeper need direct pod addressing)"
64
+ weight: 0.35
65
+ description: "Volume and DNS issues"
66
+ - type: llm_judge
67
+ criteria: "Fixes and patterns are practical — for zone affinity: use WaitForFirstConsumer volumeBindingMode to co-locate PV with pod's node, or ensure nodes exist in all zones. For stuck PVCs: delete PVC and let StatefulSet recreate it (data loss!). For headless Service: set clusterIP: None. For rolling updates: use partition field to do canary-style updates. Monitor StatefulSet with kubectl rollout status"
68
+ weight: 0.30
69
+ description: "Fixes and patterns"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: capacity-planning
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes capacity planning strategy — right-sizing workloads, cost optimization, and scaling architecture for growth"
7
+ tags: [Kubernetes, capacity-planning, cost-optimization, right-sizing, scaling, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your CTO wants a capacity planning review. The Kubernetes platform
13
+ serves 50 microservices with monthly cloud costs of $180,000. Key
14
+ metrics from the past quarter:
15
+
16
+ Resource Utilization Summary:
17
+ | Metric | Current | Target |
18
+ |------------------------|---------|--------|
19
+ | Avg CPU utilization | 23% | 65% |
20
+ | Avg Memory utilization | 34% | 70% |
21
+ | Node count | 85 | ??? |
22
+ | Monthly cost | $180K | $120K |
23
+ | Overprovisioned pods | 72% | <20% |
24
+
25
+ Analysis reveals:
26
+ 1. 72% of pods request 3-5x more CPU/memory than they actually use.
27
+ Example: order-svc requests 2 CPU, 4Gi memory but averages 200m
28
+ CPU and 800Mi memory.
29
+
30
+ 2. No VPA is configured — all requests were set by developers during
31
+ initial deployment and never adjusted.
32
+
33
+ 3. HPA min/max are set too conservatively — several services have
34
+ minReplicas=5 but traffic analysis shows they need only 2 during
35
+ off-peak (midnight-6am).
36
+
37
+ 4. Spot/preemptible nodes are not used at all — everything runs on
38
+ on-demand instances.
39
+
40
+ 5. No cluster autoscaler — nodes are manually provisioned. Some nodes
41
+ run at 90%+ while others sit at 15%.
42
+
43
+ 6. The team is planning for 3x growth in the next year (Black Friday
44
+ peak expected to be 10x normal traffic).
45
+
46
+ Task: Design a comprehensive capacity planning strategy. Write:
47
+ right-sizing methodology (VPA recommendations, historical analysis),
48
+ cost optimization techniques (spot instances, bin-packing, reserved
49
+ instances), autoscaling architecture (HPA + cluster autoscaler +
50
+ Karpenter), capacity modeling for growth, and how to present the
51
+ business case for optimization investment.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "Right-sizing methodology is explained — use VPA in recommendation mode to analyze actual usage vs requests, review P95/P99 usage (not average) for requests, set limits at 2x requests for burstable workloads. Start with non-critical services, measure for 2 weeks minimum. Expected savings: reducing from 3-5x overprovisioning to 1.5x can cut costs 40-60%"
56
+ weight: 0.35
57
+ description: "Right-sizing methodology"
58
+ - type: llm_judge
59
+ criteria: "Cost optimization techniques are comprehensive — spot/preemptible instances for stateless workloads (60-90% savings), Karpenter for intelligent node provisioning (right-sizes node types automatically), bin-packing optimization (consolidate workloads to fewer nodes), reserved instances for baseline capacity, scheduled scaling for predictable traffic patterns, namespace-level budgets with ResourceQuotas"
60
+ weight: 0.35
61
+ description: "Cost optimization"
62
+ - type: llm_judge
63
+ criteria: "Growth planning is practical — capacity model: current baseline × growth factor × headroom (1.5x) for peak. Autoscaling architecture: HPA for pod-level scaling, cluster autoscaler/Karpenter for node-level. Load testing to validate scaling limits. Black Friday planning: pre-scale 24h before, warm caches, increase HPA limits, pre-provision spot capacity. Business case: $60K/month savings pays for 2 FTE platform engineers"
64
+ weight: 0.30
65
+ description: "Growth planning"
@@ -0,0 +1,57 @@
1
+ meta:
2
+ id: cost-optimization
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes cost optimization strategy — FinOps practices, spot instances, Karpenter, and cost attribution for multi-team clusters"
7
+ tags: [Kubernetes, cost-optimization, FinOps, spot-instances, Karpenter, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ The CFO has flagged Kubernetes cloud costs as growing 3x faster than
13
+ revenue. Your monthly bill breakdown:
14
+
15
+ | Category | Monthly Cost | % of Total |
16
+ |-----------------------|-------------|------------|
17
+ | EC2 compute (on-demand)| $95,000 | 55% |
18
+ | EBS storage | $25,000 | 15% |
19
+ | Data transfer | $20,000 | 12% |
20
+ | Load balancers | $12,000 | 7% |
21
+ | NAT gateways | $10,000 | 6% |
22
+ | Other (ECR, Route53) | $8,000 | 5% |
23
+ | **Total** | **$170,000**| |
24
+
25
+ Cost attribution analysis:
26
+ - 4 of 12 teams account for 70% of compute costs
27
+ - The ML team runs GPU instances 24/7 but only uses them 8 hours/day
28
+ - 3 staging environments run full replicas of production (identical
29
+ node count) but get 5% of the traffic
30
+ - 40% of pods are overprovisioned by 3x or more
31
+ - No spot instances are used anywhere
32
+ - Each service has its own LoadBalancer ($18/mo each × 50 services)
33
+ - NAT gateway costs are high because pods pull images from public
34
+ registries on every deployment
35
+
36
+ Target: Reduce monthly costs from $170K to $100K within 6 months
37
+ without reducing availability.
38
+
39
+ Task: Design the cost optimization strategy. Write: the quick wins
40
+ (what can save money this month), medium-term optimizations (1-3
41
+ months), structural changes (3-6 months), cost attribution and
42
+ showback model for teams, FinOps culture practices, and how to
43
+ maintain cost discipline as the platform grows.
44
+
45
+ assertions:
46
+ - type: llm_judge
47
+ criteria: "Quick wins are specific with estimated savings — right-size staging (reduce to 20% of prod capacity: save ~$20K/mo), schedule GPU instances off-hours (save ~$8K/mo), consolidate LoadBalancers into shared Ingress controller (save ~$7K/mo), set up ECR pull-through cache for public images (save ~$3K/mo on NAT). Each quick win has clear implementation steps and expected savings"
48
+ weight: 0.35
49
+ description: "Quick wins"
50
+ - type: llm_judge
51
+ criteria: "Medium and structural optimizations are comprehensive — medium: implement Karpenter for intelligent node provisioning (right-size instances automatically, mix instance types), use spot instances for stateless workloads (60-90% savings), implement VPA for right-sizing pod requests. Structural: reserved instances or savings plans for baseline capacity, implement cost attribution tags per team/service, automated idle resource detection and cleanup, data transfer optimization (VPC endpoints, regional caching)"
52
+ weight: 0.35
53
+ description: "Optimization strategy"
54
+ - type: llm_judge
55
+ criteria: "FinOps culture and sustainability are addressed — cost attribution model: tag all resources by team/service, monthly cost reports per team, set team-level budgets with alerts. FinOps practices: cost reviews in sprint retrospectives, 'cost of change' in PR reviews, engineering KPI for cost efficiency (cost per request). Governance: require cost estimates for new services, auto-scale-down for non-production during off-hours, regular quarterly cost optimization sprints"
56
+ weight: 0.30
57
+ description: "FinOps culture"
@@ -0,0 +1,56 @@
1
+ meta:
2
+ id: disaster-recovery-design
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Design Kubernetes disaster recovery strategy — backup architecture, recovery procedures, RPO/RTO targets, and DR testing"
7
+ tags: [Kubernetes, disaster-recovery, backup, Velero, RPO, RTO, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ After a near-miss incident where an engineer accidentally deleted a
13
+ production namespace, leadership wants a comprehensive disaster
14
+ recovery strategy. Currently there are no cluster backups.
15
+
16
+ Requirements from leadership:
17
+ - RPO: Maximum 1 hour of data loss for critical services
18
+ - RTO: Maximum 15 minutes to restore critical services
19
+ - Scope: 3 production clusters, 2 staging, 150+ namespaces
20
+ - Compliance: SOC2 requires documented DR procedures and annual testing
21
+ - Budget: $15K/month for DR infrastructure
22
+
23
+ Current gaps:
24
+ 1. No etcd backups — complete cluster loss means rebuilding from scratch
25
+ 2. No PV snapshots — database data would be lost entirely
26
+ 3. Manifests are in Git but not all (some resources created via kubectl)
27
+ 4. No DR runbook exists — team has never practiced recovery
28
+ 5. Secrets are stored only in the cluster (not in external vault)
29
+ 6. No cross-region replication for any stateful services
30
+
31
+ Disaster scenarios to plan for:
32
+ A. Accidental namespace deletion (most common)
33
+ B. etcd corruption or loss
34
+ C. Complete cluster failure
35
+ D. Regional outage (entire AZ/region down)
36
+ E. Ransomware/security breach requiring clean rebuild
37
+
38
+ Task: Design a comprehensive Kubernetes DR strategy. Write: the backup
39
+ architecture (Velero + etcd snapshots + PV snapshots), recovery
40
+ procedures for each scenario, RPO/RTO analysis and trade-offs, DR
41
+ testing plan (chaos engineering, game days), secrets management for
42
+ DR, and the cost analysis for the proposed solution.
43
+
44
+ assertions:
45
+ - type: llm_judge
46
+ criteria: "Backup architecture is comprehensive — Velero for Kubernetes resource backup (scheduled, namespace-scoped), etcd snapshots for cluster state (every 30 min for 1-hour RPO), CSI volume snapshots for persistent data, cross-region backup storage (S3 cross-region replication). GitOps as source of truth for declarative resources. External secrets manager (HashiCorp Vault) so secrets survive cluster loss"
47
+ weight: 0.35
48
+ description: "Backup architecture"
49
+ - type: llm_judge
50
+ criteria: "Recovery procedures per scenario are defined — (A) Namespace deletion: Velero restore with namespace filter (< 5 min RTO), (B) etcd corruption: restore from etcd snapshot (10-15 min), (C) Cluster failure: provision new cluster + Velero full restore (30-60 min), (D) Regional outage: failover to DR cluster with pre-provisioned capacity (< 15 min with warm standby), (E) Security breach: clean cluster from IaC + restore verified backups (2-4 hours)"
51
+ weight: 0.35
52
+ description: "Recovery procedures"
53
+ - type: llm_judge
54
+ criteria: "Testing and cost analysis are practical — DR testing: quarterly game days simulating each scenario, automated backup verification (restore to test cluster nightly), chaos engineering with Litmus/Chaos Mesh for ongoing resilience. Cost breakdown: Velero + S3 storage ($2-3K/mo), cross-region replication ($3-5K/mo), warm standby cluster ($5-7K/mo). Justification: 1 hour of downtime costs more than a year of DR infrastructure"
55
+ weight: 0.30
56
+ description: "Testing and cost"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: executive-communication
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Communicate Kubernetes platform value to executives — translate technical metrics into business outcomes and justify infrastructure investment"
7
+ tags: [Kubernetes, executive-communication, ROI, business-case, leadership, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ The CFO questions the Kubernetes platform cost during quarterly budget
13
+ review: "We're spending $2.4M annually on Kubernetes infrastructure
14
+ and a 6-person platform team. What's the ROI? Can we just use
15
+ serverless instead?"
16
+
17
+ You need to prepare a business case. Current data:
18
+
19
+ Platform metrics:
20
+ - 50 microservices across 5 clusters
21
+ - 200 deployments per week (up from 4/week pre-Kubernetes)
22
+ - 99.95% availability (up from 99.2%)
23
+ - Mean time to recovery: 8 minutes (down from 4 hours)
24
+ - 12 development teams using the platform
25
+
26
+ Cost breakdown:
27
+ - Cloud infrastructure: $1.8M/year
28
+ - Platform team (6 engineers): $900K/year
29
+ - Tooling licenses: $120K/year
30
+ - Total: $2.82M/year
31
+
32
+ Business impact (estimated):
33
+ - Each hour of downtime costs $50K in revenue
34
+ - Pre-Kubernetes: 52 hours of downtime/year = $2.6M revenue impact
35
+ - Post-Kubernetes: 4.4 hours of downtime/year = $220K revenue impact
36
+ - Faster deployments enabled features that generated $5M in new revenue
37
+ - Developer productivity improved 35% (measured by DORA metrics)
38
+
39
+ The CFO also asks:
40
+ - "Why can't we use AWS Lambda for everything?"
41
+ - "What would happen if we cut the platform team to 3 people?"
42
+ - "How does this compare to industry benchmarks?"
43
+
44
+ Task: Prepare an executive-level business case for the Kubernetes
45
+ platform. Write: the ROI calculation, comparison with serverless
46
+ alternatives (where Kubernetes wins vs where serverless wins), risk
47
+ analysis of reducing the platform team, industry benchmarks (DORA
48
+ metrics comparison), and a 3-year cost projection with scaling plans.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "ROI calculation is clear — revenue saved from reduced downtime: $2.38M/year. Revenue generated from faster feature delivery: $5M attributed. Developer productivity gain: 35% across 50 developers ≈ 17.5 FTE equivalent ($2.6M value). Total value: ~$10M. Cost: $2.82M. ROI: ~255%. Present in business terms, not technical jargon"
53
+ weight: 0.35
54
+ description: "ROI calculation"
55
+ - type: llm_judge
56
+ criteria: "Serverless comparison is balanced — Lambda wins for: event-driven workloads, unpredictable traffic, simple functions. Kubernetes wins for: long-running services, complex networking, stateful workloads, predictable high-traffic services (cheaper at scale), multi-cloud portability. Hybrid approach: use Lambda for event processing, Kubernetes for core services. Migration cost and vendor lock-in considered"
57
+ weight: 0.35
58
+ description: "Serverless comparison"
59
+ - type: llm_judge
60
+ criteria: "Risk and benchmarks are practical — cutting platform team to 3: increased MTTR (8 min → 30+ min), slower developer onboarding, security audit gaps, higher incident rate. Industry benchmarks: DORA Elite performers deploy multiple times/day, < 1 hour lead time, < 5% change failure rate, < 1 hour MTTR. 3-year projection: costs grow 15%/year with infrastructure, but revenue scales 40%/year. Cost per deployment drops as volume increases"
61
+ weight: 0.30
62
+ description: "Risk and benchmarks"
@@ -0,0 +1,65 @@
1
+ meta:
2
+ id: expert-troubleshooting-shift
3
+ level: 4
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Combined expert troubleshooting shift — manage a multi-cluster incident requiring cross-team coordination, executive communication, and architectural decisions"
7
+ tags: [Kubernetes, troubleshooting, combined, shift-simulation, multi-cluster, expert]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ You're the platform engineering lead. A regional cloud provider outage
13
+ affects your primary US cluster (prod-us-1). You must coordinate
14
+ response across multiple teams and communicate to leadership.
15
+
16
+ 9:00 AM — Cloud provider status page: "Investigating increased error
17
+ rates in us-east-1"
18
+
19
+ Cluster impact:
20
+ - prod-us-1 (us-east-1): 40% of API calls failing, 3 nodes NotReady
21
+ - prod-us-2 (us-west-2): Healthy, running at 60% capacity
22
+ - prod-eu-1 (eu-west-1): Healthy
23
+
24
+ Your decision matrix:
25
+ 1. Do you failover US traffic to prod-us-2?
26
+ - Pro: Restores service immediately
27
+ - Con: prod-us-2 may not handle 100% US traffic
28
+ - Risk: If us-west-2 also fails, complete US outage
29
+
30
+ 2. Do you split traffic between prod-us-2 and prod-eu-1?
31
+ - Pro: Distributes load
32
+ - Con: EU latency for US users, GDPR implications for EU processing
33
+
34
+ 3. Do you wait for cloud provider to resolve?
35
+ - Pro: Minimal risk of making things worse
36
+ - Con: Extended downtime, SLO violation
37
+
38
+ Complicating factors:
39
+ - The database is in us-east-1 with read replicas in us-west-2
40
+ - Database writes will fail if primary is affected
41
+ - 3 teams have critical releases scheduled for today
42
+ - Board meeting at 2 PM expects a platform stability update
43
+ - Customer support reporting 500+ tickets in the last hour
44
+ - Media coverage of the cloud provider outage
45
+
46
+ Task: Walk through managing this multi-cluster incident. Write: the
47
+ decision framework for failover (when to failover vs wait), the
48
+ communication plan (technical teams, leadership, customers), traffic
49
+ management across clusters, database write handling during partial
50
+ outage, post-incident analysis, and how this shapes future DR
51
+ architecture investments.
52
+
53
+ assertions:
54
+ - type: llm_judge
55
+ criteria: "Decision framework is structured — assess severity and duration estimate from cloud provider status page. If degradation < 30 min expected: wait with monitoring. If 30+ min or getting worse: failover to prod-us-2 for reads, queue writes or fail gracefully. If 2+ hours: full failover including database promotion in us-west-2. Decision criteria: SLO budget remaining, customer impact, risk of action vs inaction"
56
+ weight: 0.35
57
+ description: "Decision framework"
58
+ - type: llm_judge
59
+ criteria: "Communication plan is multi-layered — incident commander coordinates all communication. Technical teams: dedicated Slack channel, 15-min sync calls, clear ownership assignments. Leadership: executive summary every 30 min (impact, actions, ETA). Customers: status page update within 15 min, customer support talking points. Board meeting: prepare brief on incident response demonstrating platform resilience and DR investment need"
60
+ weight: 0.35
61
+ description: "Communication plan"
62
+ - type: llm_judge
63
+ criteria: "Technical response and post-incident are practical — traffic management: Route53 weighted routing to shift traffic, scale up prod-us-2 HPA limits and node count. Database: promote read replica if primary down > 30 min, accept brief data inconsistency. Freeze all non-critical deployments. Post-incident: review single-region database as SPOF, justify multi-region active-active investment, update DR runbooks with this scenario, reduce RTO target"
64
+ weight: 0.30
65
+ description: "Technical response"