dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (222) hide show
  1. package/courses/GENERATION_LOG.md +45 -0
  2. package/courses/aws-lambda-debugging/course.yaml +11 -0
  3. package/courses/aws-lambda-debugging/scenarios/level-1/api-gateway-integration.yaml +71 -0
  4. package/courses/aws-lambda-debugging/scenarios/level-1/cloudwatch-logs-basics.yaml +64 -0
  5. package/courses/aws-lambda-debugging/scenarios/level-1/cold-start-basics.yaml +70 -0
  6. package/courses/aws-lambda-debugging/scenarios/level-1/environment-variable-issues.yaml +72 -0
  7. package/courses/aws-lambda-debugging/scenarios/level-1/first-debugging-shift.yaml +73 -0
  8. package/courses/aws-lambda-debugging/scenarios/level-1/handler-import-errors.yaml +71 -0
  9. package/courses/aws-lambda-debugging/scenarios/level-1/iam-permission-errors.yaml +68 -0
  10. package/courses/aws-lambda-debugging/scenarios/level-1/invocation-errors.yaml +72 -0
  11. package/courses/aws-lambda-debugging/scenarios/level-1/lambda-timeout-errors.yaml +65 -0
  12. package/courses/aws-lambda-debugging/scenarios/level-1/memory-and-oom.yaml +70 -0
  13. package/courses/aws-lambda-debugging/scenarios/level-2/async-invocation-failures.yaml +72 -0
  14. package/courses/aws-lambda-debugging/scenarios/level-2/cold-start-optimization.yaml +76 -0
  15. package/courses/aws-lambda-debugging/scenarios/level-2/dynamodb-streams-debugging.yaml +70 -0
  16. package/courses/aws-lambda-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +71 -0
  17. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-concurrency-management.yaml +70 -0
  18. package/courses/aws-lambda-debugging/scenarios/level-2/lambda-layers-debugging.yaml +76 -0
  19. package/courses/aws-lambda-debugging/scenarios/level-2/sam-local-debugging.yaml +74 -0
  20. package/courses/aws-lambda-debugging/scenarios/level-2/sqs-event-source.yaml +72 -0
  21. package/courses/aws-lambda-debugging/scenarios/level-2/vpc-networking-issues.yaml +71 -0
  22. package/courses/aws-lambda-debugging/scenarios/level-2/xray-tracing.yaml +62 -0
  23. package/courses/aws-lambda-debugging/scenarios/level-3/advanced-debugging-shift.yaml +72 -0
  24. package/courses/aws-lambda-debugging/scenarios/level-3/container-image-lambda.yaml +79 -0
  25. package/courses/aws-lambda-debugging/scenarios/level-3/cross-account-invocation.yaml +72 -0
  26. package/courses/aws-lambda-debugging/scenarios/level-3/eventbridge-patterns.yaml +79 -0
  27. package/courses/aws-lambda-debugging/scenarios/level-3/iac-deployment-debugging.yaml +68 -0
  28. package/courses/aws-lambda-debugging/scenarios/level-3/kinesis-stream-processing.yaml +64 -0
  29. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-at-edge.yaml +64 -0
  30. package/courses/aws-lambda-debugging/scenarios/level-3/lambda-extensions-debugging.yaml +67 -0
  31. package/courses/aws-lambda-debugging/scenarios/level-3/powertools-observability.yaml +79 -0
  32. package/courses/aws-lambda-debugging/scenarios/level-3/step-functions-debugging.yaml +80 -0
  33. package/courses/aws-lambda-debugging/scenarios/level-4/cost-optimization-strategy.yaml +67 -0
  34. package/courses/aws-lambda-debugging/scenarios/level-4/expert-debugging-shift.yaml +62 -0
  35. package/courses/aws-lambda-debugging/scenarios/level-4/incident-management-serverless.yaml +61 -0
  36. package/courses/aws-lambda-debugging/scenarios/level-4/multi-region-serverless.yaml +67 -0
  37. package/courses/aws-lambda-debugging/scenarios/level-4/observability-platform-design.yaml +71 -0
  38. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-architecture-design.yaml +64 -0
  39. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-data-architecture.yaml +66 -0
  40. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-migration-strategy.yaml +65 -0
  41. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-security-design.yaml +60 -0
  42. package/courses/aws-lambda-debugging/scenarios/level-4/serverless-testing-strategy.yaml +62 -0
  43. package/courses/aws-lambda-debugging/scenarios/level-5/board-serverless-strategy.yaml +63 -0
  44. package/courses/aws-lambda-debugging/scenarios/level-5/consulting-serverless-adoption.yaml +57 -0
  45. package/courses/aws-lambda-debugging/scenarios/level-5/industry-serverless-patterns.yaml +62 -0
  46. package/courses/aws-lambda-debugging/scenarios/level-5/ma-serverless-integration.yaml +75 -0
  47. package/courses/aws-lambda-debugging/scenarios/level-5/master-debugging-shift.yaml +61 -0
  48. package/courses/aws-lambda-debugging/scenarios/level-5/organizational-serverless-transformation.yaml +65 -0
  49. package/courses/aws-lambda-debugging/scenarios/level-5/regulatory-serverless.yaml +61 -0
  50. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-economics.yaml +65 -0
  51. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-future-technology.yaml +66 -0
  52. package/courses/aws-lambda-debugging/scenarios/level-5/serverless-platform-design.yaml +71 -0
  53. package/courses/docker-container-debugging/course.yaml +11 -0
  54. package/courses/docker-container-debugging/scenarios/level-1/container-exit-codes.yaml +59 -0
  55. package/courses/docker-container-debugging/scenarios/level-1/container-networking-basics.yaml +69 -0
  56. package/courses/docker-container-debugging/scenarios/level-1/docker-logs-debugging.yaml +67 -0
  57. package/courses/docker-container-debugging/scenarios/level-1/dockerfile-build-failures.yaml +71 -0
  58. package/courses/docker-container-debugging/scenarios/level-1/environment-variable-issues.yaml +74 -0
  59. package/courses/docker-container-debugging/scenarios/level-1/first-debugging-shift.yaml +70 -0
  60. package/courses/docker-container-debugging/scenarios/level-1/image-pull-failures.yaml +68 -0
  61. package/courses/docker-container-debugging/scenarios/level-1/port-mapping-issues.yaml +67 -0
  62. package/courses/docker-container-debugging/scenarios/level-1/resource-limits-oom.yaml +70 -0
  63. package/courses/docker-container-debugging/scenarios/level-1/volume-mount-problems.yaml +66 -0
  64. package/courses/docker-container-debugging/scenarios/level-2/container-health-checks.yaml +73 -0
  65. package/courses/docker-container-debugging/scenarios/level-2/docker-compose-debugging.yaml +66 -0
  66. package/courses/docker-container-debugging/scenarios/level-2/docker-exec-debugging.yaml +71 -0
  67. package/courses/docker-container-debugging/scenarios/level-2/image-layer-optimization.yaml +81 -0
  68. package/courses/docker-container-debugging/scenarios/level-2/intermediate-debugging-shift.yaml +73 -0
  69. package/courses/docker-container-debugging/scenarios/level-2/logging-and-log-rotation.yaml +76 -0
  70. package/courses/docker-container-debugging/scenarios/level-2/multi-stage-build-debugging.yaml +76 -0
  71. package/courses/docker-container-debugging/scenarios/level-2/network-debugging-tools.yaml +67 -0
  72. package/courses/docker-container-debugging/scenarios/level-2/pid1-signal-handling.yaml +71 -0
  73. package/courses/docker-container-debugging/scenarios/level-2/security-scanning-basics.yaml +67 -0
  74. package/courses/docker-container-debugging/scenarios/level-3/advanced-debugging-shift.yaml +77 -0
  75. package/courses/docker-container-debugging/scenarios/level-3/buildkit-optimization.yaml +67 -0
  76. package/courses/docker-container-debugging/scenarios/level-3/container-filesystem-debugging.yaml +70 -0
  77. package/courses/docker-container-debugging/scenarios/level-3/container-security-hardening.yaml +74 -0
  78. package/courses/docker-container-debugging/scenarios/level-3/disk-space-management.yaml +74 -0
  79. package/courses/docker-container-debugging/scenarios/level-3/docker-api-automation.yaml +72 -0
  80. package/courses/docker-container-debugging/scenarios/level-3/docker-daemon-issues.yaml +73 -0
  81. package/courses/docker-container-debugging/scenarios/level-3/docker-in-docker-ci.yaml +69 -0
  82. package/courses/docker-container-debugging/scenarios/level-3/overlay-network-debugging.yaml +70 -0
  83. package/courses/docker-container-debugging/scenarios/level-3/production-container-ops.yaml +71 -0
  84. package/courses/docker-container-debugging/scenarios/level-4/cicd-pipeline-design.yaml +66 -0
  85. package/courses/docker-container-debugging/scenarios/level-4/container-monitoring-observability.yaml +63 -0
  86. package/courses/docker-container-debugging/scenarios/level-4/container-orchestration-strategy.yaml +62 -0
  87. package/courses/docker-container-debugging/scenarios/level-4/container-performance-engineering.yaml +64 -0
  88. package/courses/docker-container-debugging/scenarios/level-4/container-security-architecture.yaml +66 -0
  89. package/courses/docker-container-debugging/scenarios/level-4/enterprise-image-management.yaml +58 -0
  90. package/courses/docker-container-debugging/scenarios/level-4/expert-debugging-shift.yaml +63 -0
  91. package/courses/docker-container-debugging/scenarios/level-4/incident-response-containers.yaml +70 -0
  92. package/courses/docker-container-debugging/scenarios/level-4/multi-environment-management.yaml +65 -0
  93. package/courses/docker-container-debugging/scenarios/level-4/stateful-service-containers.yaml +65 -0
  94. package/courses/docker-container-debugging/scenarios/level-5/board-infrastructure-strategy.yaml +58 -0
  95. package/courses/docker-container-debugging/scenarios/level-5/consulting-container-strategy.yaml +61 -0
  96. package/courses/docker-container-debugging/scenarios/level-5/container-platform-architecture.yaml +67 -0
  97. package/courses/docker-container-debugging/scenarios/level-5/container-platform-economics.yaml +67 -0
  98. package/courses/docker-container-debugging/scenarios/level-5/container-technology-evolution.yaml +67 -0
  99. package/courses/docker-container-debugging/scenarios/level-5/disaster-recovery-containers.yaml +66 -0
  100. package/courses/docker-container-debugging/scenarios/level-5/industry-container-patterns.yaml +71 -0
  101. package/courses/docker-container-debugging/scenarios/level-5/master-debugging-shift.yaml +62 -0
  102. package/courses/docker-container-debugging/scenarios/level-5/organizational-transformation.yaml +67 -0
  103. package/courses/docker-container-debugging/scenarios/level-5/regulatory-compliance-containers.yaml +61 -0
  104. package/courses/kubernetes-deployment-troubleshooting/course.yaml +12 -0
  105. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/configmap-secret-issues.yaml +69 -0
  106. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/crashloopbackoff.yaml +68 -0
  107. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/deployment-rollout.yaml +56 -0
  108. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/first-troubleshooting-shift.yaml +65 -0
  109. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/health-probe-failures.yaml +70 -0
  110. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/imagepullbackoff.yaml +57 -0
  111. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/kubectl-debugging-basics.yaml +56 -0
  112. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/oomkilled.yaml +70 -0
  113. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/pending-pods.yaml +68 -0
  114. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/service-not-reachable.yaml +66 -0
  115. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/dns-resolution-failures.yaml +63 -0
  116. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/helm-deployment-failures.yaml +63 -0
  117. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/hpa-scaling-issues.yaml +62 -0
  118. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/ingress-routing-issues.yaml +63 -0
  119. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/init-container-failures.yaml +63 -0
  120. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/intermediate-troubleshooting-shift.yaml +66 -0
  121. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/network-policy-blocking.yaml +67 -0
  122. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/persistent-volume-issues.yaml +69 -0
  123. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/rbac-permission-denied.yaml +57 -0
  124. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/resource-quota-limits.yaml +64 -0
  125. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/advanced-troubleshooting-shift.yaml +69 -0
  126. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/cluster-upgrade-failures.yaml +71 -0
  127. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/gitops-drift-detection.yaml +62 -0
  128. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/job-cronjob-failures.yaml +67 -0
  129. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/monitoring-alerting-gaps.yaml +64 -0
  130. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/multi-container-debugging.yaml +68 -0
  131. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/node-pressure-evictions.yaml +70 -0
  132. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/pod-disruption-budgets.yaml +59 -0
  133. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/service-mesh-debugging.yaml +64 -0
  134. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/statefulset-troubleshooting.yaml +69 -0
  135. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/capacity-planning.yaml +65 -0
  136. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/cost-optimization.yaml +57 -0
  137. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/disaster-recovery-design.yaml +56 -0
  138. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/executive-communication.yaml +62 -0
  139. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/expert-troubleshooting-shift.yaml +65 -0
  140. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/incident-management-process.yaml +59 -0
  141. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-cluster-operations.yaml +62 -0
  142. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-tenancy-design.yaml +55 -0
  143. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/platform-engineering.yaml +59 -0
  144. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/security-hardening.yaml +58 -0
  145. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/behavioral-science.yaml +62 -0
  146. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/board-strategy.yaml +61 -0
  147. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/cloud-native-future.yaml +65 -0
  148. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/comprehensive-platform.yaml +57 -0
  149. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/consulting-engagement.yaml +62 -0
  150. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/industry-benchmarks.yaml +58 -0
  151. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/ma-integration.yaml +62 -0
  152. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/master-troubleshooting-shift.yaml +73 -0
  153. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/product-development.yaml +65 -0
  154. package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/regulatory-compliance.yaml +76 -0
  155. package/courses/mysql-query-optimization/course.yaml +11 -0
  156. package/courses/mysql-query-optimization/scenarios/level-1/buffer-pool-basics.yaml +65 -0
  157. package/courses/mysql-query-optimization/scenarios/level-1/explain-basics.yaml +66 -0
  158. package/courses/mysql-query-optimization/scenarios/level-1/first-optimization-shift.yaml +78 -0
  159. package/courses/mysql-query-optimization/scenarios/level-1/innodb-index-fundamentals.yaml +68 -0
  160. package/courses/mysql-query-optimization/scenarios/level-1/join-basics.yaml +66 -0
  161. package/courses/mysql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml +67 -0
  162. package/courses/mysql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml +66 -0
  163. package/courses/mysql-query-optimization/scenarios/level-1/select-star-problems.yaml +68 -0
  164. package/courses/mysql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml +65 -0
  165. package/courses/mysql-query-optimization/scenarios/level-1/where-clause-optimization.yaml +65 -0
  166. package/courses/mysql-query-optimization/scenarios/level-2/buffer-pool-tuning.yaml +64 -0
  167. package/courses/mysql-query-optimization/scenarios/level-2/composite-index-design.yaml +71 -0
  168. package/courses/mysql-query-optimization/scenarios/level-2/covering-and-invisible-indexes.yaml +69 -0
  169. package/courses/mysql-query-optimization/scenarios/level-2/cte-and-window-functions.yaml +78 -0
  170. package/courses/mysql-query-optimization/scenarios/level-2/intermediate-optimization-shift.yaml +68 -0
  171. package/courses/mysql-query-optimization/scenarios/level-2/join-optimization.yaml +67 -0
  172. package/courses/mysql-query-optimization/scenarios/level-2/performance-schema-analysis.yaml +69 -0
  173. package/courses/mysql-query-optimization/scenarios/level-2/query-optimizer-hints.yaml +74 -0
  174. package/courses/mysql-query-optimization/scenarios/level-2/subquery-optimization.yaml +70 -0
  175. package/courses/mysql-query-optimization/scenarios/level-2/write-optimization.yaml +63 -0
  176. package/courses/mysql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml +71 -0
  177. package/courses/mysql-query-optimization/scenarios/level-3/connection-management.yaml +67 -0
  178. package/courses/mysql-query-optimization/scenarios/level-3/full-text-search.yaml +77 -0
  179. package/courses/mysql-query-optimization/scenarios/level-3/json-optimization.yaml +87 -0
  180. package/courses/mysql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml +68 -0
  181. package/courses/mysql-query-optimization/scenarios/level-3/monitoring-alerting.yaml +63 -0
  182. package/courses/mysql-query-optimization/scenarios/level-3/online-schema-changes.yaml +79 -0
  183. package/courses/mysql-query-optimization/scenarios/level-3/partitioning-strategies.yaml +83 -0
  184. package/courses/mysql-query-optimization/scenarios/level-3/query-profiling-deep-dive.yaml +84 -0
  185. package/courses/mysql-query-optimization/scenarios/level-3/replication-optimization.yaml +66 -0
  186. package/courses/mysql-query-optimization/scenarios/level-4/aurora-vs-rds-evaluation.yaml +61 -0
  187. package/courses/mysql-query-optimization/scenarios/level-4/data-architecture.yaml +62 -0
  188. package/courses/mysql-query-optimization/scenarios/level-4/database-migration-planning.yaml +59 -0
  189. package/courses/mysql-query-optimization/scenarios/level-4/enterprise-governance.yaml +50 -0
  190. package/courses/mysql-query-optimization/scenarios/level-4/executive-communication.yaml +54 -0
  191. package/courses/mysql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml +67 -0
  192. package/courses/mysql-query-optimization/scenarios/level-4/high-availability-architecture.yaml +60 -0
  193. package/courses/mysql-query-optimization/scenarios/level-4/optimizer-internals.yaml +62 -0
  194. package/courses/mysql-query-optimization/scenarios/level-4/performance-sla-design.yaml +52 -0
  195. package/courses/mysql-query-optimization/scenarios/level-4/read-replica-scaling.yaml +51 -0
  196. package/courses/mysql-query-optimization/scenarios/level-5/ai-database-future.yaml +45 -0
  197. package/courses/mysql-query-optimization/scenarios/level-5/behavioral-science.yaml +44 -0
  198. package/courses/mysql-query-optimization/scenarios/level-5/benchmark-design.yaml +47 -0
  199. package/courses/mysql-query-optimization/scenarios/level-5/board-strategy.yaml +48 -0
  200. package/courses/mysql-query-optimization/scenarios/level-5/comprehensive-platform.yaml +49 -0
  201. package/courses/mysql-query-optimization/scenarios/level-5/consulting-engagement.yaml +52 -0
  202. package/courses/mysql-query-optimization/scenarios/level-5/ma-database-integration.yaml +47 -0
  203. package/courses/mysql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +56 -0
  204. package/courses/mysql-query-optimization/scenarios/level-5/product-development.yaml +48 -0
  205. package/courses/mysql-query-optimization/scenarios/level-5/regulatory-compliance.yaml +48 -0
  206. package/courses/postgresql-query-optimization/scenarios/level-5/comprehensive-database-system.yaml +70 -0
  207. package/courses/postgresql-query-optimization/scenarios/level-5/database-ai-future.yaml +81 -0
  208. package/courses/postgresql-query-optimization/scenarios/level-5/database-behavioral-science.yaml +63 -0
  209. package/courses/postgresql-query-optimization/scenarios/level-5/database-board-strategy.yaml +77 -0
  210. package/courses/postgresql-query-optimization/scenarios/level-5/database-consulting-engagement.yaml +61 -0
  211. package/courses/postgresql-query-optimization/scenarios/level-5/database-industry-benchmarks.yaml +64 -0
  212. package/courses/postgresql-query-optimization/scenarios/level-5/database-ma-integration.yaml +71 -0
  213. package/courses/postgresql-query-optimization/scenarios/level-5/database-product-development.yaml +72 -0
  214. package/courses/postgresql-query-optimization/scenarios/level-5/database-regulatory-landscape.yaml +76 -0
  215. package/courses/postgresql-query-optimization/scenarios/level-5/master-optimization-shift.yaml +66 -0
  216. package/courses/terraform-infrastructure-setup/course.yaml +11 -0
  217. package/courses/terraform-infrastructure-setup/scenarios/level-1/terraform-init-errors.yaml +72 -0
  218. package/dist/mcp/session-manager.d.ts +7 -4
  219. package/dist/mcp/session-manager.d.ts.map +1 -1
  220. package/dist/mcp/session-manager.js +23 -8
  221. package/dist/mcp/session-manager.js.map +1 -1
  222. package/package.json +1 -1
@@ -0,0 +1,67 @@
1
+ meta:
2
+ id: network-policy-blocking
3
+ level: 2
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug NetworkPolicy issues — diagnose why pods can't communicate when network policies restrict traffic"
7
+ tags: [Kubernetes, NetworkPolicy, networking, security, traffic, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ After a security team applied NetworkPolicies, your microservices
13
+ stopped communicating:
14
+
15
+ $ kubectl exec -it frontend-pod -- curl -s http://api-service:8080/health
16
+ curl: (7) Failed to connect to api-service port 8080: Connection timed out
17
+
18
+ Before the NetworkPolicies, this worked fine. The services are in the
19
+ same namespace.
20
+
21
+ $ kubectl get networkpolicy -n app
22
+ NAME POD-SELECTOR AGE
23
+ deny-all <none> 5m
24
+ allow-frontend app=frontend 5m
25
+
26
+ $ kubectl describe networkpolicy deny-all
27
+ Spec:
28
+ PodSelector: <none> (applies to all pods)
29
+ Allowing ingress traffic: <none> (deny all ingress)
30
+ Allowing egress traffic: <none> (deny all egress)
31
+
32
+ $ kubectl describe networkpolicy allow-frontend
33
+ Spec:
34
+ PodSelector: app=frontend
35
+ Allowing ingress traffic:
36
+ From: <any> (allow all ingress to frontend)
37
+ Allowing egress traffic: <none> (not specified)
38
+
39
+ The deny-all policy blocks all ingress AND egress for every pod.
40
+ The allow-frontend policy allows ingress TO frontend but doesn't
41
+ allow egress FROM frontend. And there's no policy allowing ingress
42
+ to the api-service.
43
+
44
+ Problems:
45
+ 1. Frontend can't send requests (egress blocked by deny-all)
46
+ 2. API service can't receive requests (ingress blocked by deny-all)
47
+ 3. DNS resolution also blocked (egress to kube-dns on port 53)
48
+
49
+ Task: Explain NetworkPolicies and how to debug them. Write: how
50
+ NetworkPolicies work (default allow-all, additive when applied),
51
+ ingress vs egress rules, how to read policy selectors (podSelector,
52
+ namespaceSelector, ipBlock), why DNS breaks when egress is blocked,
53
+ how to write a working zero-trust policy set, and debugging techniques.
54
+
55
+ assertions:
56
+ - type: llm_judge
57
+ criteria: "NetworkPolicy behavior is explained — by default all traffic allowed, applying any policy to a pod makes it deny-all for that direction (ingress/egress), then rules are additive (allow specific traffic). A deny-all policy with empty podSelector applies to ALL pods in namespace. Policies are namespace-scoped. Must explicitly allow both directions — if frontend needs to talk to API, frontend needs egress rule AND API needs ingress rule"
58
+ weight: 0.35
59
+ description: "NetworkPolicy behavior"
60
+ - type: llm_judge
61
+ criteria: "DNS and common pitfalls are addressed — when egress is denied, DNS resolution breaks (CoreDNS runs in kube-system on port 53 UDP/TCP). Must add egress rule allowing port 53 to kube-system namespace. Common pitfall: forget to allow DNS, forget both directions needed, label selectors don't match pods, missing namespaceSelector for cross-namespace policies"
62
+ weight: 0.35
63
+ description: "DNS and pitfalls"
64
+ - type: llm_judge
65
+ criteria: "Debugging workflow is practical — check what policies apply to a pod (kubectl get netpol, match podSelector to pod labels), verify traffic is blocked vs application error (test with kubectl exec curl/nc), check if CNI plugin supports NetworkPolicy (not all do — e.g., Flannel doesn't by default, Calico does), use packet capture or logging to trace blocked connections"
66
+ weight: 0.30
67
+ description: "Debugging workflow"
@@ -0,0 +1,69 @@
1
+ meta:
2
+ id: persistent-volume-issues
3
+ level: 2
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug PersistentVolume issues — diagnose PVC binding failures, mount errors, and storage class misconfigurations"
7
+ tags: [Kubernetes, PersistentVolume, PVC, storage, StorageClass, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your StatefulSet for PostgreSQL won't start. The pods are stuck in
13
+ Pending because their PVCs can't bind:
14
+
15
+ $ kubectl get pods
16
+ NAME READY STATUS RESTARTS AGE
17
+ postgres-0 0/1 Pending 0 10m
18
+
19
+ $ kubectl get pvc
20
+ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
21
+ data-postgres-0 Pending fast-ssd 10m
22
+
23
+ $ kubectl describe pvc data-postgres-0
24
+ Events:
25
+ Warning ProvisioningFailed 2m persistentvolume-controller
26
+ storageclass.storage.k8s.io "fast-ssd" not found
27
+
28
+ The StorageClass "fast-ssd" doesn't exist in this cluster. But there's
29
+ more — even after creating the StorageClass, a second issue appears:
30
+
31
+ $ kubectl get pvc
32
+ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
33
+ data-postgres-0 Pending fast-ssd 15m
34
+
35
+ $ kubectl describe pvc data-postgres-0
36
+ Events:
37
+ Warning ProvisioningFailed 1m ebs.csi.aws.com_ebs-csi-controller
38
+ could not create volume in zone "us-east-1c": volume limit of 25
39
+ per node reached
40
+
41
+ The CSI driver can't provision because the node has hit its volume
42
+ attachment limit.
43
+
44
+ Key concepts:
45
+ - PV/PVC binding: PVCs request storage, PVs provide it
46
+ - StorageClass: defines how storage is provisioned dynamically
47
+ - Access modes: RWO, ROX, RWX — must match between PV and PVC
48
+ - volumeBindingMode: Immediate vs WaitForFirstConsumer
49
+ - CSI drivers: external storage providers
50
+
51
+ Task: Explain how Kubernetes persistent storage works and how to
52
+ debug issues. Write: the PV/PVC/StorageClass relationship, dynamic
53
+ vs static provisioning, access modes and their constraints, common
54
+ binding failures (missing StorageClass, capacity, access mode
55
+ mismatch, zone affinity), CSI driver issues, and volumeBindingMode.
56
+
57
+ assertions:
58
+ - type: llm_judge
59
+ criteria: "PV/PVC/StorageClass relationship is explained — PVCs are requests for storage (size, access mode, StorageClass), PVs are the actual storage resources, StorageClass defines the provisioner and parameters for dynamic provisioning. When a PVC is created, the StorageClass provisioner automatically creates a PV. Static provisioning means pre-creating PVs for PVCs to bind to"
60
+ weight: 0.35
61
+ description: "Storage relationship explained"
62
+ - type: llm_judge
63
+ criteria: "Common binding failures are covered — missing StorageClass, insufficient capacity, access mode mismatch (requesting RWX when storage only supports RWO), zone affinity (PV in different zone than pod), CSI driver issues (volume attachment limits, driver not installed), volumeBindingMode Immediate creates PV immediately vs WaitForFirstConsumer waits until pod is scheduled (better for zone-aware provisioning)"
64
+ weight: 0.35
65
+ description: "Binding failures covered"
66
+ - type: llm_judge
67
+ criteria: "Debugging workflow is practical — check PVC status and events (kubectl describe pvc), check StorageClass exists (kubectl get sc), verify CSI driver pods running (kubectl get pods -n kube-system), check node volume limits, verify access modes match. For StatefulSets: PVCs persist even after pod deletion, must manually delete PVC to re-provision"
68
+ weight: 0.30
69
+ description: "Debugging workflow"
@@ -0,0 +1,57 @@
1
+ meta:
2
+ id: rbac-permission-denied
3
+ level: 2
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug RBAC permission issues — diagnose Forbidden errors when pods or users can't access Kubernetes API resources"
7
+ tags: [Kubernetes, RBAC, ServiceAccount, permissions, security, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your application pod needs to list other pods for service discovery,
13
+ but it's getting Forbidden errors:
14
+
15
+ $ kubectl logs discovery-agent-pod
16
+ Error: pods is forbidden: User "system:serviceaccount:app:default"
17
+ cannot list resource "pods" in API group "" in the namespace "app"
18
+
19
+ The application uses the Kubernetes API client to discover peer pods
20
+ for cluster coordination. It's using the default ServiceAccount which
21
+ has no permissions beyond the basics.
22
+
23
+ $ kubectl auth can-i list pods --as=system:serviceaccount:app:default -n app
24
+ no
25
+
26
+ You need to create proper RBAC resources. But which combination?
27
+
28
+ Current state of RBAC resources:
29
+ - A Role "pod-reader" exists but only grants "get" (not "list" or "watch")
30
+ - A RoleBinding exists but binds to ServiceAccount "discovery-agent"
31
+ (which doesn't exist) instead of "default"
32
+ - The pod is using the "default" ServiceAccount
33
+
34
+ Multiple issues:
35
+ 1. Role needs "list" and "watch" verbs, not just "get"
36
+ 2. RoleBinding references wrong ServiceAccount name
37
+ 3. Should create a dedicated ServiceAccount instead of using default
38
+
39
+ Task: Explain Kubernetes RBAC and how to debug permission issues.
40
+ Write: the RBAC model (Role, ClusterRole, RoleBinding, ClusterRoleBinding),
41
+ how ServiceAccounts work (pods authenticate as their SA), how to
42
+ check permissions (kubectl auth can-i), common RBAC mistakes, and
43
+ the principle of least privilege.
44
+
45
+ assertions:
46
+ - type: llm_judge
47
+ criteria: "RBAC model is explained — Role defines permissions (verbs on resources) within a namespace, ClusterRole defines cluster-wide permissions, RoleBinding grants a Role to a subject (User, Group, ServiceAccount) within a namespace, ClusterRoleBinding grants cluster-wide. Pods authenticate to the API server using their ServiceAccount token mounted at /var/run/secrets/kubernetes.io/serviceaccount/"
48
+ weight: 0.35
49
+ description: "RBAC model explained"
50
+ - type: llm_judge
51
+ criteria: "Debugging permission issues is systematic — use kubectl auth can-i to test specific permissions, check which ServiceAccount the pod uses (kubectl get pod -o yaml | grep serviceAccountName), verify Role has correct verbs and resources, verify RoleBinding references correct Role and ServiceAccount, check namespace scope (Role vs ClusterRole). The 'system:serviceaccount:<namespace>:<name>' format identifies ServiceAccounts"
52
+ weight: 0.35
53
+ description: "Permission debugging"
54
+ - type: llm_judge
55
+ criteria: "Best practices and fixes are covered — create dedicated ServiceAccounts per application (don't use default), grant minimum required permissions (least privilege), use Role/RoleBinding for namespace-scoped access (prefer over ClusterRole), audit permissions regularly (kubectl auth can-i --list), never use cluster-admin for application ServiceAccounts"
56
+ weight: 0.30
57
+ description: "Best practices"
@@ -0,0 +1,64 @@
1
+ meta:
2
+ id: resource-quota-limits
3
+ level: 2
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug ResourceQuota and LimitRange issues — diagnose why deployments fail due to namespace resource constraints"
7
+ tags: [Kubernetes, ResourceQuota, LimitRange, QoS, resource-management, intermediate]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your team's deployment suddenly fails to create new pods:
13
+
14
+ $ kubectl apply -f deployment.yaml
15
+ Error from server (Forbidden): error when creating "deployment.yaml":
16
+ pods "cache-svc-7a8b9c0d-e1f2" is forbidden: exceeded quota:
17
+ team-alpha-quota, requested: cpu=500m,memory=512Mi,
18
+ used: cpu=3500m,memory=7Gi, limited: cpu=4,memory=8Gi
19
+
20
+ $ kubectl get resourcequota -n team-alpha
21
+ NAME AGE REQUEST LIMIT
22
+ team-alpha-quota 30d cpu: 3500m/4, memory: 7Gi/8Gi cpu: 8/8, memory: 16Gi/16Gi
23
+
24
+ The namespace has a ResourceQuota and you're hitting the ceiling.
25
+ But wait — your deployment doesn't specify resource requests:
26
+
27
+ $ kubectl get deployment cache-svc -o yaml | grep -A5 resources
28
+ resources: {}
29
+
30
+ Yet the error says it's requesting 500m CPU and 512Mi memory. Why?
31
+
32
+ $ kubectl get limitrange -n team-alpha
33
+ NAME CREATED AT
34
+ default-limits 2025-11-01T00:00:00Z
35
+
36
+ $ kubectl describe limitrange default-limits
37
+ Type Resource Min Max Default Request Default Limit
38
+ Container cpu 100m 2 500m 1
39
+ Container memory 128Mi 4Gi 512Mi 1Gi
40
+
41
+ The LimitRange is automatically injecting default requests! The pod
42
+ gets 500m CPU and 512Mi memory even though the deployment doesn't
43
+ specify them.
44
+
45
+ Task: Explain ResourceQuotas and LimitRanges. Write: how ResourceQuotas
46
+ enforce namespace-level limits, how LimitRanges set per-container
47
+ defaults and min/max, how they interact (when quota exists, all pods
48
+ must have requests — LimitRange provides defaults), QoS classes
49
+ (Guaranteed, Burstable, BestEffort) and eviction priority, and how
50
+ to right-size resource settings.
51
+
52
+ assertions:
53
+ - type: llm_judge
54
+ criteria: "ResourceQuota and LimitRange interaction is explained — ResourceQuota sets total namespace limits (aggregate CPU, memory, pod count). When ResourceQuota exists, every pod MUST specify resource requests — if not set, LimitRange provides defaults. LimitRange also enforces min/max per container. If no LimitRange default and no request specified, pod creation fails with quota enabled"
55
+ weight: 0.35
56
+ description: "Quota and LimitRange interaction"
57
+ - type: llm_judge
58
+ criteria: "QoS classes are explained — Guaranteed (requests=limits for all containers, highest priority, last evicted), Burstable (requests<limits or partial specification, medium priority), BestEffort (no requests or limits, first evicted under pressure). Understanding QoS helps prioritize which pods survive during resource contention"
59
+ weight: 0.35
60
+ description: "QoS classes explained"
61
+ - type: llm_judge
62
+ criteria: "Right-sizing and debugging are practical — check quota usage (kubectl describe quota), check LimitRange defaults (kubectl describe limitrange), use kubectl top pod for actual usage vs requests, review QoS class with kubectl get pod -o yaml. Right-sizing: set requests to P95 usage, limits to peak + buffer. Avoid BestEffort for production workloads"
63
+ weight: 0.30
64
+ description: "Right-sizing and debugging"
@@ -0,0 +1,69 @@
1
+ meta:
2
+ id: advanced-troubleshooting-shift
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Combined advanced troubleshooting shift — diagnose a production incident involving node failures, storage issues, service mesh, and monitoring gaps"
7
+ tags: [Kubernetes, troubleshooting, combined, shift-simulation, production-incident, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ 3:00 AM alert: "Multiple services degraded in production." You're the
13
+ on-call SRE. The incident cascaded from a single root cause:
14
+
15
+ Timeline:
16
+ - 2:30 AM: worker-5 hits MemoryPressure, starts evicting pods
17
+ - 2:35 AM: Evicted pods reschedule to worker-2 and worker-3
18
+ - 2:40 AM: worker-2 runs out of disk space from container logs
19
+ - 2:45 AM: StatefulSet database pods fail to reschedule (PVC zone affinity)
20
+ - 2:50 AM: Services depending on the database start failing
21
+ - 2:55 AM: Alertmanager didn't page because Slack webhook expired
22
+ - 3:00 AM: Customer reports trigger manual investigation
23
+
24
+ Current state:
25
+ $ kubectl get nodes
26
+ NAME STATUS ROLES VERSION
27
+ worker-1 Ready worker v1.29.0
28
+ worker-2 Ready,SchedulingDisabled worker v1.29.0
29
+ worker-3 Ready worker v1.29.0
30
+ worker-4 Ready worker v1.29.0
31
+ worker-5 NotReady,SchedulingDisabled worker v1.29.0
32
+
33
+ $ kubectl get pods -n critical --field-selector=status.phase!=Running
34
+ NAME STATUS RESTARTS AGE
35
+ postgres-1 Pending 0 25m
36
+ search-indexer-7a8b-c9d0 Failed 0 25m
37
+ api-cache-1e2f-g3h4 Failed 0 25m
38
+
39
+ Issues to address:
40
+ 1. worker-5 MemoryPressure — identify memory-hungry pod, assess if
41
+ node can recover or needs replacement
42
+ 2. worker-2 DiskPressure — container log rotation not configured,
43
+ /var/lib/docker full
44
+ 3. postgres-1 Pending — PVC bound to volume in worker-5's AZ, can't
45
+ reschedule to other AZ
46
+ 4. Cascading failures — services failing because database unavailable
47
+ 5. Monitoring gap — Alertmanager webhook expired, no backup channel
48
+ 6. No PDB on critical services — evictions were uncontrolled
49
+
50
+ Task: Walk through this cascading incident. Write: the root cause
51
+ analysis (chain of events), immediate remediation steps for each
52
+ issue, how to restore service for the database (PVC zone affinity
53
+ options), why the monitoring gap let this escalate for 30 minutes,
54
+ and the post-incident improvements (PDBs, log rotation, monitoring
55
+ redundancy, capacity planning).
56
+
57
+ assertions:
58
+ - type: llm_judge
59
+ criteria: "Cascading failure chain is explained — initial trigger was MemoryPressure on worker-5 causing evictions. Evicted pods redistributed, overloading worker-2 which then hit DiskPressure. StatefulSet pods couldn't reschedule due to PVC zone affinity. Dependent services failed. Monitoring gap delayed response by 30 minutes. Each failure amplified the next"
60
+ weight: 0.35
61
+ description: "Cascade chain explained"
62
+ - type: llm_judge
63
+ criteria: "Immediate remediation is practical — (1) free disk on worker-2 (truncate/rotate logs, clear unused images: crictl rmi --prune), (2) for postgres-1: either bring worker-5 back online or create new PV in available AZ and restore from backup, (3) restart failed pods once dependencies are back, (4) fix Alertmanager config immediately. Prioritize database recovery as it's the dependency for other services"
64
+ weight: 0.35
65
+ description: "Immediate remediation"
66
+ - type: llm_judge
67
+ criteria: "Post-incident improvements are comprehensive — add PodDisruptionBudgets for critical services, configure container log rotation (logrotate or container runtime config), add backup alerting channels (PagerDuty + Slack), implement capacity alerts before pressure (warn at 80%), use WaitForFirstConsumer for PVCs, run regular backup/restore tests, add pod anti-affinity to spread critical pods across nodes"
68
+ weight: 0.30
69
+ description: "Post-incident improvements"
@@ -0,0 +1,71 @@
1
+ meta:
2
+ id: cluster-upgrade-failures
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug Kubernetes cluster upgrade failures — diagnose API deprecations, node rotation issues, and workload disruptions during upgrades"
7
+ tags: [Kubernetes, cluster-upgrade, API-deprecation, node-rotation, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your team is upgrading a Kubernetes cluster from 1.28 to 1.30 and
13
+ encountering multiple failures:
14
+
15
+ Phase 1 — Control plane upgrade:
16
+ Pre-upgrade validation shows deprecated API usage:
17
+
18
+ $ kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
19
+ apiserver_requested_deprecated_apis{group="extensions",version="v1beta1",
20
+ resource="ingresses"} 47
21
+ apiserver_requested_deprecated_apis{group="policy",version="v1beta1",
22
+ resource="podsecuritypolicies"} 12
23
+
24
+ 47 Ingress resources still use extensions/v1beta1 (removed in 1.22)
25
+ and 12 PodSecurityPolicy resources use policy/v1beta1 (removed in 1.25).
26
+
27
+ Phase 2 — Node rotation:
28
+ After control plane upgrade, worker node drain fails:
29
+
30
+ $ kubectl drain worker-1 --ignore-daemonsets
31
+ error: unable to drain node "worker-1": cannot evict pod
32
+ "critical-db-0": disruption budget prevents eviction
33
+
34
+ PDB blocks eviction. After increasing replicas to allow drain:
35
+
36
+ $ kubectl get nodes
37
+ NAME STATUS VERSION
38
+ master-1 Ready v1.30.0
39
+ worker-1 Ready,SchedulingDisabled v1.28.5
40
+ worker-2 Ready v1.30.0
41
+ worker-3 Ready v1.28.5
42
+
43
+ Mixed version cluster! The skew policy allows at most 1 minor version
44
+ between control plane and nodes. 1.30 to 1.28 is 2 versions — worker-1
45
+ and worker-3 are out of skew support.
46
+
47
+ Phase 3 — Workload issues post-upgrade:
48
+ After all nodes upgraded, some DaemonSets fail because they use
49
+ privileged containers and the new Pod Security Standards enforce
50
+ restricted profile on certain namespaces.
51
+
52
+ Task: Explain Kubernetes cluster upgrade process and troubleshooting.
53
+ Write: the upgrade sequence (control plane first, then nodes), API
54
+ deprecation checking and migration, version skew policy (kubelet can
55
+ be at most N-1 relative to API server), PDB considerations during
56
+ node drain, Pod Security Standards/Admission replacing PodSecurityPolicy,
57
+ and upgrade planning best practices.
58
+
59
+ assertions:
60
+ - type: llm_judge
61
+ criteria: "Upgrade sequence is explained — always upgrade control plane first (API server, controller manager, scheduler, etcd), then worker nodes one by one. Version skew policy: kubelet can be at most 1 minor version behind API server. Cannot skip minor versions (must go 1.28→1.29→1.30). Check deprecated API usage before upgrade with metrics or kubectl commands"
62
+ weight: 0.35
63
+ description: "Upgrade sequence"
64
+ - type: llm_judge
65
+ criteria: "API deprecation and migration are covered — deprecated APIs continue working until removed. Track deprecations with apiserver_requested_deprecated_apis metric. Use kubectl convert or manual manifest updates to migrate (e.g., extensions/v1beta1 Ingress → networking.k8s.io/v1). PodSecurityPolicy removed in 1.25, replaced by Pod Security Standards/Admission. Plan migration before upgrading"
66
+ weight: 0.35
67
+ description: "API deprecation"
68
+ - type: llm_judge
69
+ criteria: "Operational considerations are practical — pre-upgrade checklist: check API deprecations, review PDBs for drain compatibility, backup etcd, test in staging first. During upgrade: drain nodes one at a time, monitor workload health, use surge capacity for zero-downtime. Post-upgrade: verify all nodes at new version, run integration tests, check for Pod Security Standard violations. Use managed Kubernetes (EKS/GKE/AKS) to simplify"
70
+ weight: 0.30
71
+ description: "Operational considerations"
@@ -0,0 +1,62 @@
1
+ meta:
2
+ id: gitops-drift-detection
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug GitOps drift — diagnose when cluster state diverges from Git and how ArgoCD/Flux detect and reconcile drift"
7
+ tags: [Kubernetes, GitOps, ArgoCD, Flux, drift-detection, reconciliation, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your ArgoCD dashboard shows several applications as "OutOfSync" and
13
+ one as "Degraded":
14
+
15
+ ArgoCD Applications:
16
+ | NAME | SYNC STATUS | HEALTH STATUS | MESSAGE |
17
+ |---------------|-------------|---------------|------------------------|
18
+ | api-gateway | OutOfSync | Healthy | live != desired |
19
+ | user-service | Synced | Degraded | 0/3 pods available |
20
+ | order-service | OutOfSync | Healthy | manual override |
21
+ | monitoring | Unknown | Unknown | ComparisonError |
22
+
23
+ Investigation:
24
+
25
+ 1. api-gateway (OutOfSync, Healthy) — someone used kubectl edit to
26
+ change the replica count from 2 to 5 directly in the cluster.
27
+ ArgoCD detects the drift but auto-sync is disabled, so it shows
28
+ OutOfSync without correcting it.
29
+
30
+ 2. user-service (Synced, Degraded) — Git has the correct manifest but
31
+ the pods are failing. ArgoCD shows Synced because the desired state
32
+ matches Git, but the health check shows Degraded because pods
33
+ aren't running. The issue is in the application code, not GitOps.
34
+
35
+ 3. order-service (OutOfSync, Healthy) — a Helm values override was
36
+ applied directly, bypassing the Git repo. The rendered manifests
37
+ differ from what Git produces.
38
+
39
+ 4. monitoring (Unknown, ComparisonError) — ArgoCD can't compare the
40
+ desired state because a CRD was deleted from the cluster, making
41
+ the custom resources unresolvable.
42
+
43
+ Task: Explain GitOps drift detection and reconciliation. Write: what
44
+ drift means (cluster state differs from Git source of truth), how
45
+ ArgoCD detects drift (periodic comparison), sync vs health status,
46
+ auto-sync vs manual sync, why manual kubectl changes are problematic
47
+ in GitOps, how to handle legitimate emergency changes, and CRD
48
+ dependency management.
49
+
50
+ assertions:
51
+ - type: llm_judge
52
+ criteria: "Drift detection is explained — ArgoCD periodically compares live cluster state with the desired state from Git. OutOfSync means live != Git. Sync status and health status are independent: an app can be Synced but Degraded (code bug), or OutOfSync but Healthy (manual scale change). Auto-sync automatically corrects drift by applying Git state. Manual sync requires human approval"
53
+ weight: 0.35
54
+ description: "Drift detection"
55
+ - type: llm_judge
56
+ criteria: "Manual changes problem is addressed — kubectl edit/apply bypasses Git, creating drift that ArgoCD flags. In GitOps, ALL changes should go through Git (PR → merge → sync). For emergencies: make the change directly BUT immediately commit the same change to Git so the source of truth is updated. ArgoCD self-heal option auto-reverts manual changes. Flux uses similar reconciliation loop"
57
+ weight: 0.35
58
+ description: "Manual changes problem"
59
+ - type: llm_judge
60
+ criteria: "CRD and operational issues are covered — CRD deletion causes ComparisonError because ArgoCD can't parse custom resources without the CRD. Fix: ensure CRDs are managed by ArgoCD with proper sync waves (CRDs before resources). Use app-of-apps pattern for dependency ordering. Monitor ArgoCD itself for health. Use argocd app diff to see what changed, argocd app sync --dry-run to preview"
61
+ weight: 0.30
62
+ description: "CRD and operations"
@@ -0,0 +1,67 @@
1
+ meta:
2
+ id: job-cronjob-failures
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug Job and CronJob failures — diagnose stuck jobs, missed schedules, concurrent execution issues, and backoff limits"
7
+ tags: [Kubernetes, Job, CronJob, batch-processing, scheduling, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your data pipeline CronJobs are failing silently — reports aren't
13
+ being generated but nobody noticed for 3 days:
14
+
15
+ $ kubectl get cronjobs -n data
16
+ NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
17
+ daily-report 0 2 * * * False 0 3d 30d
18
+ hourly-etl 0 * * * * False 3 5m 30d
19
+ weekly-cleanup 0 0 * * 0 True 0 7d 30d
20
+
21
+ Issues discovered:
22
+
23
+ 1. daily-report — Last scheduled 3 days ago, no recent Jobs:
24
+ $ kubectl get jobs -n data -l job-name=daily-report --sort-by=.status.startTime
25
+ NAME COMPLETIONS DURATION AGE
26
+ daily-report-28930560 0/1 3d 3d
27
+
28
+ The job is still running (stuck) from 3 days ago. The CronJob's
29
+ concurrencyPolicy is "Forbid", so no new Jobs are created while the
30
+ old one is active. The Job's pod is stuck in Init:0/1 waiting for
31
+ a database migration init container that will never complete.
32
+
33
+ 2. hourly-etl — 3 active jobs running concurrently:
34
+ concurrencyPolicy is "Allow" (default), so every hour a new Job
35
+ starts even if previous ones haven't finished. Jobs are piling up,
36
+ consuming resources and causing database contention.
37
+
38
+ 3. weekly-cleanup — Suspended=True:
39
+ Someone suspended it during debugging and forgot to re-enable it.
40
+
41
+ Additionally:
42
+ $ kubectl get jobs -n data | grep -c "0/1"
43
+ 47
44
+
45
+ 47 failed jobs sitting in the namespace, never cleaned up because
46
+ failedJobsHistoryLimit wasn't set.
47
+
48
+ Task: Explain Job and CronJob troubleshooting. Write: how Jobs work
49
+ (completions, parallelism, backoffLimit), CronJob scheduling (schedule
50
+ syntax, concurrencyPolicy, startingDeadlineSeconds), common failure
51
+ modes (stuck jobs blocking schedules, missed schedules, resource
52
+ accumulation), cleanup and history limits, and monitoring Jobs
53
+ effectively.
54
+
55
+ assertions:
56
+ - type: llm_judge
57
+ criteria: "Job mechanics are explained — completions: how many pod completions needed, parallelism: how many pods run simultaneously, backoffLimit: retries before marking as Failed (default 6), activeDeadlineSeconds: maximum runtime. Jobs create pods that run to completion. Failed pods are retried with exponential backoff. A Job stuck in active state blocks CronJobs with Forbid concurrency"
58
+ weight: 0.35
59
+ description: "Job mechanics"
60
+ - type: llm_judge
61
+ criteria: "CronJob-specific issues are covered — concurrencyPolicy: Allow (multiple jobs simultaneously, risk of pile-up), Forbid (skip if previous still running, risk of stuck blocking), Replace (kill previous, start new). startingDeadlineSeconds: if a schedule is missed by more than this, skip it. successfulJobsHistoryLimit and failedJobsHistoryLimit control cleanup (default 3/1). Suspended field pauses scheduling. Always check for suspended CronJobs during debugging"
62
+ weight: 0.35
63
+ description: "CronJob issues"
64
+ - type: llm_judge
65
+ criteria: "Debugging and monitoring are practical — check CronJob last schedule time, list active Jobs, check Job pod status and logs. For stuck Jobs: delete the Job or set activeDeadlineSeconds. Monitor: alert on CronJobs that haven't run within expected window, alert on failed Job count, use ttlSecondsAfterFinished (K8s 1.23+) for automatic cleanup. Set appropriate history limits to prevent namespace resource accumulation"
66
+ weight: 0.30
67
+ description: "Debugging and monitoring"
@@ -0,0 +1,64 @@
1
+ meta:
2
+ id: monitoring-alerting-gaps
3
+ level: 3
4
+ course: kubernetes-deployment-troubleshooting
5
+ type: output
6
+ description: "Debug monitoring and alerting gaps — diagnose why Prometheus isn't scraping metrics, Grafana dashboards show no data, and alerts don't fire"
7
+ tags: [Kubernetes, Prometheus, Grafana, monitoring, alerting, advanced]
8
+
9
+ state: {}
10
+
11
+ trigger: |
12
+ Your team deployed a new service but it doesn't appear in Grafana
13
+ dashboards and no alerts are configured for it:
14
+
15
+ $ kubectl get pods -n monitoring
16
+ NAME READY STATUS RESTARTS AGE
17
+ prometheus-server-0 2/2 Running 0 30d
18
+ grafana-7a8b9c0d-e1f2 1/1 Running 0 30d
19
+ alertmanager-0 1/1 Running 0 30d
20
+
21
+ Prometheus is running but not scraping the new service:
22
+
23
+ Checking Prometheus targets UI shows the new service is not listed.
24
+
25
+ $ kubectl get servicemonitor -n app
26
+ No resources found in app namespace.
27
+
28
+ The team expected Prometheus to auto-discover the service, but:
29
+ 1. No ServiceMonitor resource was created for the new service
30
+ 2. The Prometheus instance is configured to only watch the "monitoring"
31
+ namespace for ServiceMonitors (serviceMonitorNamespaceSelector)
32
+ 3. The application exposes metrics on /metrics:9090 but the
33
+ ServiceMonitor targets port 8080
34
+
35
+ Additionally, existing alerts aren't firing during outages:
36
+
37
+ $ kubectl get prometheusrules -n monitoring
38
+ NAME AGE
39
+ default-rules 30d
40
+
41
+ The alert rules exist but Alertmanager isn't sending notifications:
42
+ - Alertmanager config has a Slack webhook URL that expired
43
+ - Alert routing doesn't match the team's namespace labels
44
+ - inhibitRules are suppressing lower-severity alerts
45
+
46
+ Task: Explain Kubernetes monitoring and alerting setup. Write: how
47
+ Prometheus discovers targets (ServiceMonitor, PodMonitor, annotations),
48
+ how to debug missing metrics (targets page, scrape config), how
49
+ Grafana connects to Prometheus, alert pipeline (PrometheusRule →
50
+ Alertmanager → notification channel), and common monitoring gaps.
51
+
52
+ assertions:
53
+ - type: llm_judge
54
+ criteria: "Prometheus service discovery is explained — ServiceMonitor CRD tells Prometheus which Services to scrape (port, path, interval). PodMonitor for pod-level scraping. Prometheus must be configured to watch the namespace where ServiceMonitors exist (serviceMonitorNamespaceSelector). Legacy method: annotations (prometheus.io/scrape=true). Targets page shows what's being scraped and any errors"
55
+ weight: 0.35
56
+ description: "Prometheus service discovery"
57
+ - type: llm_judge
58
+ criteria: "Alert pipeline is explained — PrometheusRule defines alerting conditions (PromQL expressions with for duration), Prometheus evaluates rules and sends firing alerts to Alertmanager, Alertmanager routes alerts to receivers (Slack, PagerDuty, email) based on labels, inhibitRules can suppress alerts. Debug: check Prometheus /alerts page, Alertmanager UI for silences and routing, verify receiver config"
59
+ weight: 0.35
60
+ description: "Alert pipeline"
61
+ - type: llm_judge
62
+ criteria: "Common gaps and fixes are covered — missing ServiceMonitor for new services (should be part of deployment template), Prometheus not watching correct namespaces, wrong port/path in ServiceMonitor, Alertmanager webhook URLs expiring, alert routing not matching labels, no alerts defined for new services. Fix: include ServiceMonitor in Helm charts, use namespace-wide selectors, test alerts regularly"
63
+ weight: 0.30
64
+ description: "Common gaps and fixes"