npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/kubectl-debugging-basics.yaml ADDED Viewed

@@ -0,0 +1,56 @@
+meta:
+  id: kubectl-debugging-basics
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Learn essential kubectl debugging commands — logs, describe, exec, port-forward, and events for systematic troubleshooting"
+  tags: [Kubernetes, kubectl, debugging, logs, describe, exec, beginner]
+state: {}
+trigger: |
+  A teammate asks for help debugging a misbehaving pod. You need to
+  walk them through the essential kubectl debugging toolkit.
+  Here's what they see:
+  $ kubectl get pods -n production
+  NAME                         READY   STATUS    RESTARTS   AGE
+  inventory-svc-6f7g8h9i-j0  1/1     Running   3          45m
+  inventory-svc-6f7g8h9i-k1  1/1     Running   0          45m
+  One pod has 3 restarts, the other has 0. Both show Running now.
+  They need to figure out:
+  1. Why did the first pod restart 3 times?
+  2. What's different about the two pods?
+  3. Is the application actually healthy?
+  Key commands to teach:
+  - kubectl describe pod: events, conditions, container state, restart reason
+  - kubectl logs: current and previous container output
+  - kubectl logs --previous: see logs from the crashed container
+  - kubectl exec: run commands inside the container
+  - kubectl port-forward: test the application locally
+  - kubectl get events: cluster-wide event timeline
+  - kubectl top pod: resource usage (requires metrics-server)
+  Task: Explain the essential kubectl debugging workflow. Write: the
+  purpose and usage of each debugging command (describe, logs, logs
+  --previous, exec, port-forward, events, top), the order to use them
+  for systematic debugging, what information each command reveals, and
+  how to debug multi-container pods (using -c flag).
+assertions:
+  - type: llm_judge
+    criteria: "Core debugging commands are explained with purpose — kubectl describe pod shows events/conditions/state (first command to run), kubectl logs shows stdout/stderr output, kubectl logs --previous shows logs from the crashed container instance, kubectl exec -it allows interactive debugging inside the container, kubectl port-forward tunnels to test the app locally, kubectl get events shows cluster-wide timeline"
+    weight: 0.35
+    description: "Core debugging commands"
+  - type: llm_judge
+    criteria: "Systematic debugging order is presented — recommended flow: (1) kubectl get pods to see status/restarts, (2) kubectl describe pod for events and exit codes, (3) kubectl logs / logs --previous for application errors, (4) kubectl exec to inspect container state (env vars, files, connectivity), (5) kubectl top pod for resource usage. For multi-container pods: use -c <container-name> flag"
+    weight: 0.35
+    description: "Systematic debugging order"
+  - type: llm_judge
+    criteria: "Practical usage patterns are shown — how to stream logs (kubectl logs -f), follow multiple pods (kubectl logs -l app=inventory-svc), filter events by type (kubectl get events --field-selector type=Warning), test connectivity from inside a pod (kubectl exec -- curl, nslookup), and use ephemeral debug containers (kubectl debug) for distroless images that lack shell access"
+    weight: 0.30
+    description: "Practical usage patterns"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/oomkilled.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: oomkilled
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug OOMKilled pods — understand memory limits, diagnose memory leaks, and configure appropriate resource limits"
+  tags: [Kubernetes, OOMKilled, memory, resource-limits, debugging, beginner]
+state: {}
+trigger: |
+  Your Java application pod keeps getting OOMKilled every few hours:
+  $ kubectl get pods
+  NAME                       READY   STATUS    RESTARTS   AGE
+  order-svc-8a9b0c1d-xyz    1/1     Running   4          6h
+  $ kubectl describe pod order-svc-8a9b0c1d-xyz
+  Last State:     Terminated
+    Reason:       OOMKilled
+    Exit Code:    137
+    Started:      2025-12-01T10:00:00Z
+    Finished:     2025-12-01T11:30:00Z
+  Containers:
+    order-svc:
+      Limits:
+        memory:  512Mi
+      Requests:
+        memory:  256Mi
+  $ kubectl top pod order-svc-8a9b0c1d-xyz
+  NAME                      CPU(cores)   MEMORY(bytes)
+  order-svc-8a9b0c1d-xyz    150m         498Mi
+  The pod is using 498Mi of its 512Mi limit — about to be killed again.
+  The application is a Java Spring Boot service with:
+  - JVM heap: not explicitly configured (defaults to 25% of container RAM)
+  - No -XX:MaxRAMPercentage set
+  - Container limit: 512Mi
+  - JVM sees 512Mi and sets max heap to ~128Mi
+  - But JVM total memory (heap + metaspace + threads + native) exceeds 512Mi
+  Questions:
+  1. What does OOMKilled mean and why is exit code 137?
+  2. Why does the JVM exceed the container memory limit?
+  3. How should you configure JVM memory in containers?
+  4. How do Kubernetes memory requests vs limits work?
+  5. How to monitor memory to prevent OOMKilled?
+  Task: Explain OOMKilled and how to fix it. Write: what OOMKilled
+  means (kernel OOM killer, exit code 137), why JVM apps often get
+  OOMKilled in containers, the correct JVM memory configuration for
+  containers, how requests and limits work (QoS classes), and memory
+  monitoring approach.
+assertions:
+  - type: llm_judge
+    criteria: "OOMKilled is correctly explained — the Linux kernel's OOM killer terminates the process when it exceeds cgroup memory limit. Exit code 137 = 128 + 9 (SIGKILL). This is different from the application running out of heap space (which throws OutOfMemoryError). Kubernetes sets the cgroup limit based on the pod's memory limit"
+    weight: 0.35
+    description: "OOMKilled explained"
+  - type: llm_judge
+    criteria: "JVM container issue is addressed — JVM total memory = heap + metaspace + thread stacks + direct buffers + native memory. Even if heap is within limit, total can exceed. Fix: set -XX:MaxRAMPercentage=75 (leave 25% for non-heap), or explicitly set -Xmx to 75% of container limit. Modern JVMs (11+) are container-aware but still need tuning"
+    weight: 0.35
+    description: "JVM container memory"
+  - type: llm_judge
+    criteria: "Requests, limits, and QoS are explained — requests: minimum guaranteed memory (used for scheduling), limits: maximum allowed (enforced by cgroup). QoS classes: Guaranteed (requests=limits), Burstable (requests<limits), BestEffort (no requests/limits). Guaranteed pods are last to be evicted. Monitoring: kubectl top, Prometheus metrics, set up alerts before hitting limits"
+    weight: 0.30
+    description: "Requests, limits, QoS"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/pending-pods.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: pending-pods
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Pending pods — diagnose why pods won't schedule, from insufficient resources to node affinity mismatches"
+  tags: [Kubernetes, Pending, scheduling, resources, node-affinity, beginner]
+state: {}
+trigger: |
+  You deployed a new service but the pods are stuck in Pending state
+  for over 10 minutes:
+  $ kubectl get pods
+  NAME                         READY   STATUS    RESTARTS   AGE
+  analytics-svc-9d8e7f6g-a1   0/1     Pending   0          10m
+  analytics-svc-9d8e7f6g-b2   0/1     Pending   0          10m
+  analytics-svc-9d8e7f6g-c3   0/1     Pending   0          10m
+  $ kubectl describe pod analytics-svc-9d8e7f6g-a1
+  Events:
+    Warning  FailedScheduling  10m  default-scheduler
+    0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had
+    untolerated taint {dedicated: gpu-workload}
+  The deployment requests:
+    resources:
+      requests:
+        cpu: "4"
+        memory: "8Gi"
+  Cluster nodes:
+  - node-1: 8 CPU, 32GB — 6.5 CPU already allocated
+  - node-2: 8 CPU, 32GB — 7 CPU already allocated
+  - node-3: 16 CPU, 64GB — taint: dedicated=gpu-workload:NoSchedule
+  - node-4: 16 CPU, 64GB — taint: dedicated=gpu-workload:NoSchedule
+  - node-5: 16 CPU, 64GB — taint: dedicated=gpu-workload:NoSchedule
+  So: nodes 1-2 don't have enough CPU available, nodes 3-5 have taints
+  that the pod doesn't tolerate.
+  Common Pending causes:
+  1. Insufficient CPU or memory on available nodes
+  2. Node taints without matching tolerations
+  3. Node affinity/anti-affinity rules not satisfied
+  4. PVC not bound (waiting for volume)
+  5. ResourceQuota exceeded for namespace
+  6. Too many pods (max pods per node reached)
+  Task: Explain how to debug Pending pods. Write: what Pending means
+  (scheduler can't place the pod), how to read the FailedScheduling
+  event message, common causes and their fixes, how taints and
+  tolerations work, and how to check cluster capacity.
+assertions:
+  - type: llm_judge
+    criteria: "Pending state is explained — the pod is in the scheduling queue but no node satisfies all constraints (resource requests, taints, affinity rules, PVC availability). The FailedScheduling event message tells you exactly why — it lists how many nodes failed each check"
+    weight: 0.35
+    description: "Pending state explained"
+  - type: llm_judge
+    criteria: "Taints and tolerations are explained — taints on nodes repel pods unless the pod has a matching toleration. NoSchedule prevents scheduling, PreferNoSchedule is soft, NoExecute evicts existing pods. In this case, nodes 3-5 have dedicated=gpu-workload:NoSchedule, so non-GPU pods can't schedule there"
+    weight: 0.35
+    description: "Taints and tolerations"
+  - type: llm_judge
+    criteria: "Fixes are practical — options: reduce resource requests, add capacity (new nodes), add toleration to the pod spec (if appropriate), check if allocated resources can be reclaimed (pods using less than requested), or use the cluster autoscaler. Shows kubectl commands to check node capacity (kubectl describe node, kubectl top node)"
+    weight: 0.30
+    description: "Practical fixes"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/service-not-reachable.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: service-not-reachable
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Service connectivity — diagnose why a Kubernetes Service can't reach its backend pods"
+  tags: [Kubernetes, Service, networking, endpoints, DNS, beginner]
+state: {}
+trigger: |
+  Your frontend app can't connect to the backend API service. The
+  frontend logs show:
+  Error: connect ECONNREFUSED api-service:8080
+  Error: getaddrinfo ENOTFOUND api-service
+  Both the frontend and backend are running in the same namespace.
+  $ kubectl get pods -n app
+  NAME                        READY   STATUS    RESTARTS   AGE
+  frontend-6a7b8c9d-p1       1/1     Running   0          10m
+  backend-api-3e4f5g6h-q2    1/1     Running   0          10m
+  $ kubectl get svc -n app
+  NAME          TYPE        CLUSTER-IP    PORT(S)    AGE
+  api-service   ClusterIP   10.96.45.12   80/TCP     10m
+  $ kubectl get endpoints api-service -n app
+  NAME          ENDPOINTS   AGE
+  api-service   <none>      10m
+  The endpoints are empty! The Service exists and the pod is Running,
+  but they're not connected.
+  Investigation:
+  - Service selector: app=backend-api
+  - Pod labels: app=api-backend (label mismatch!)
+  - The Service uses port 80 but the container listens on 8080
+    (targetPort should be 8080)
+  Two problems:
+  1. Label selector mismatch: Service looks for app=backend-api but
+     pods have app=api-backend
+  2. Port mismatch: Service port 80 → targetPort not set (defaults
+     to 80, but container listens on 8080)
+  Task: Explain how Kubernetes Services connect to pods. Write: how
+  selectors match pods to Services (label-based), how to verify
+  endpoints, the port vs targetPort distinction, common Service
+  connectivity issues, and how to test connectivity from inside the
+  cluster.
+assertions:
+  - type: llm_judge
+    criteria: "Service-to-pod connection is explained — Services use label selectors to find matching pods and create endpoints. If no pods match the selector, endpoints list is empty and the Service has nothing to route to. The selector must exactly match the pod labels"
+    weight: 0.35
+    description: "Service-pod connection"
+  - type: llm_judge
+    criteria: "Debugging steps identify both issues — check endpoints (kubectl get endpoints), compare Service selector with pod labels (kubectl get svc -o yaml vs kubectl get pod --show-labels), verify port configuration (Service port vs container port vs targetPort). Shows how to test with kubectl exec and curl/wget from within the cluster"
+    weight: 0.35
+    description: "Debugging identifies issues"
+  - type: llm_judge
+    criteria: "Port configuration is clearly explained — Service port: what clients connect to, targetPort: the container port to forward to (defaults to port if not specified), containerPort: informational only (doesn't actually restrict access). DNS resolution: <service-name>.<namespace>.svc.cluster.local"
+    weight: 0.30
+    description: "Port configuration explained"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/dns-resolution-failures.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: dns-resolution-failures
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug DNS resolution failures — diagnose CoreDNS issues, service discovery problems, and cross-namespace resolution"
+  tags: [Kubernetes, DNS, CoreDNS, service-discovery, networking, intermediate]
+state: {}
+trigger: |
+  Your microservices can't discover each other. Applications report DNS
+  resolution failures:
+  $ kubectl exec -it frontend-pod -- nslookup api-service
+  ;; connection timed out; no servers could be reached
+  $ kubectl exec -it frontend-pod -- cat /etc/resolv.conf
+  nameserver 10.96.0.10
+  search default.svc.cluster.local svc.cluster.local cluster.local
+  options ndots:5
+  $ kubectl get pods -n kube-system -l k8s-app=kube-dns
+  NAME                       READY   STATUS             RESTARTS   AGE
+  coredns-5d78c9869d-abc12   0/1     CrashLoopBackOff   8          1h
+  coredns-5d78c9869d-def34   0/1     CrashLoopBackOff   8          1h
+  Both CoreDNS pods are crashing! Without DNS, no service can resolve
+  any other service by name.
+  $ kubectl logs coredns-5d78c9869d-abc12 -n kube-system
+  [FATAL] plugin/loop: Loop detected for zone ".", forwarding to
+  "10.96.0.10", aborting.
+  CoreDNS detected a DNS loop — it's forwarding queries to itself.
+  The node's /etc/resolv.conf points to the cluster DNS IP (10.96.0.10),
+  so CoreDNS tries to forward to itself, creating an infinite loop.
+  Additionally, even after fixing CoreDNS, a service in the "backend"
+  namespace can't be reached from the "frontend" namespace using just
+  the service name — cross-namespace resolution requires the full DNS
+  name: <service>.<namespace>.svc.cluster.local
+  Task: Explain Kubernetes DNS and how to debug resolution failures.
+  Write: how CoreDNS provides service discovery, the DNS name format
+  (<svc>.<ns>.svc.cluster.local), the search domains in resolv.conf
+  and ndots setting, common CoreDNS failures (loop detection, resource
+  exhaustion, network policy blocking), cross-namespace resolution,
+  and how to test DNS from inside pods.
+assertions:
+  - type: llm_judge
+    criteria: "Kubernetes DNS architecture is explained — CoreDNS runs as a Deployment in kube-system, provides DNS for all services. DNS format: <service>.<namespace>.svc.cluster.local. The search domains in /etc/resolv.conf allow short names within the same namespace. ndots:5 means names with fewer than 5 dots get the search domains appended first. Cross-namespace requires at least <service>.<namespace>"
+    weight: 0.35
+    description: "DNS architecture explained"
+  - type: llm_judge
+    criteria: "CoreDNS failure modes are covered — loop detection (forwarding to itself, common when node resolv.conf points to cluster DNS), OOMKilled (too many DNS queries), CrashLoopBackOff (configuration errors), NetworkPolicy blocking DNS traffic on port 53. The loop detection fix: edit CoreDNS ConfigMap to forward to upstream DNS (e.g., 8.8.8.8) instead of /etc/resolv.conf"
+    weight: 0.35
+    description: "CoreDNS failures covered"
+  - type: llm_judge
+    criteria: "Debugging and testing are practical — test with kubectl exec nslookup/dig, check CoreDNS pods and logs, inspect CoreDNS ConfigMap (kubectl get cm coredns -n kube-system), verify the kube-dns Service exists (kubectl get svc kube-dns -n kube-system), use a debug pod with networking tools (nicolaka/netshoot). Headless services (clusterIP: None) return pod IPs directly instead of cluster IP"
+    weight: 0.30
+    description: "Debugging and testing"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/helm-deployment-failures.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: helm-deployment-failures
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Helm deployment failures — diagnose template rendering errors, stuck releases, and rollback issues"
+  tags: [Kubernetes, Helm, deployment, chart, rollback, intermediate]
+state: {}
+trigger: |
+  Your Helm upgrade failed and now the release is stuck in a bad state:
+  $ helm upgrade myapp ./chart --values prod-values.yaml
+  Error: UPGRADE FAILED: template: chart/templates/deployment.yaml:42:
+  function "toJson" not defined
+  The chart uses a custom template function that doesn't exist. After
+  fixing the template, you try again:
+  $ helm upgrade myapp ./chart --values prod-values.yaml
+  Error: UPGRADE FAILED: another operation (install/upgrade/rollback)
+  is in progress
+  The release is stuck in "pending-upgrade" status from the failed attempt:
+  $ helm list
+  NAME    NAMESPACE  REVISION  STATUS          CHART
+  myapp   default    5         pending-upgrade  myapp-2.3.0
+  $ helm history myapp
+  REVISION  STATUS           CHART        DESCRIPTION
+  3         superseded       myapp-2.1.0  Upgrade complete
+  4         superseded       myapp-2.2.0  Upgrade complete
+  5         pending-upgrade  myapp-2.3.0  Preparing upgrade
+  The release is stuck because Helm's operation tracking thinks an
+  upgrade is still in progress. You need to rollback first:
+  $ helm rollback myapp 4
+  Rollback was a success! Happy Helming!
+  Now the upgrade can proceed with the fixed template.
+  Task: Explain Helm troubleshooting. Write: how Helm manages releases
+  (revision history, status tracking), common template errors and how
+  to debug them (helm template, helm lint, --dry-run), stuck release
+  states and how to fix them, helm rollback workflow, helm get commands
+  for inspecting releases, and best practices for safe Helm deployments.
+assertions:
+  - type: llm_judge
+    criteria: "Helm release management is explained — Helm stores release history as Secrets in the namespace, each upgrade creates a new revision. Release statuses: deployed, pending-upgrade, pending-install, pending-rollback, failed, superseded. A stuck pending state means the previous operation didn't complete cleanly. Fix with helm rollback to last good revision"
+    weight: 0.35
+    description: "Release management"
+  - type: llm_judge
+    criteria: "Template debugging is covered — helm template renders templates locally without installing (catches syntax errors), helm lint validates chart structure and values, --dry-run --debug shows rendered manifests before applying. helm get manifest shows what was actually deployed, helm get values shows values used. Common errors: undefined functions, missing values, YAML indentation"
+    weight: 0.35
+    description: "Template debugging"
+  - type: llm_judge
+    criteria: "Best practices are practical — always use --dry-run before production upgrades, set --timeout and --wait for reliable status tracking, use --atomic flag (auto-rollback on failure), keep revision history (--history-max), version pin chart dependencies, use helm diff plugin to preview changes before applying"
+    weight: 0.30
+    description: "Best practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/hpa-scaling-issues.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: hpa-scaling-issues
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug HPA scaling issues — diagnose why autoscaling isn't working, from missing metrics to thrashing"
+  tags: [Kubernetes, HPA, autoscaling, metrics-server, scaling, intermediate]
+state: {}
+trigger: |
+  Your HPA isn't scaling despite high CPU usage on your pods:
+  $ kubectl get hpa
+  NAME       REFERENCE             TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
+  web-app    Deployment/web-app    <unknown>/70%   2         10        2          30m
+  The TARGETS column shows <unknown> — the HPA can't read metrics.
+  $ kubectl describe hpa web-app
+  Conditions:
+    Type           Status  Reason
+    AbleToScale    True    SucceededGetScale
+    ScalingActive  False   FailedGetResourceMetric
+  Events:
+    Warning  FailedComputeMetricsReplicas  1m  horizontal-pod-autoscaler
+    invalid metrics (1 invalid out of 1), first error is: failed to get
+    cpu utilization: missing request for cpu in container "web-app"
+  Two issues found:
+  1. The pods don't have CPU requests set — HPA calculates utilization
+     as a percentage of the request, so without requests it can't compute
+  2. After fixing requests, metrics-server isn't installed:
+  $ kubectl top pods
+  error: Metrics API not available
+  $ kubectl get deployment metrics-server -n kube-system
+  Error from server (NotFound): deployments.apps "metrics-server" not found
+  After installing metrics-server and setting CPU requests, the HPA
+  works but thrashes — scaling up and down rapidly every minute.
+  Task: Explain HPA and how to debug scaling issues. Write: how HPA
+  works (metrics → desired replicas calculation), why CPU requests are
+  required, metrics-server role, the stabilization window to prevent
+  thrashing, custom metrics with Prometheus adapter, and how to tune
+  HPA behavior.
+assertions:
+  - type: llm_judge
+    criteria: "HPA mechanics are explained — HPA queries metrics API every 15s (default), calculates desired replicas as ceil(currentReplicas * (currentMetric/targetMetric)). CPU utilization is percentage of CPU request, so requests MUST be set. metrics-server collects resource metrics from kubelets and exposes them via the Metrics API. Without metrics-server or without requests, HPA shows <unknown>"
+    weight: 0.35
+    description: "HPA mechanics"
+  - type: llm_judge
+    criteria: "Debugging steps are systematic — check TARGETS column (unknown = no metrics), kubectl describe hpa for conditions and events, verify metrics-server running, verify pods have resource requests, check if ScalingActive condition is True. Use kubectl top pod to verify metrics are available. Check HPA events for error messages"
+    weight: 0.35
+    description: "Debugging steps"
+  - type: llm_judge
+    criteria: "Thrashing prevention and tuning are covered — stabilization window (scaleDown stabilizationWindowSeconds defaults to 300s, prevents rapid scale-down), scaling policies (pods-per-minute or percent-per-minute limits), behavior field in HPA spec for fine-grained control. Custom metrics via Prometheus adapter for application-specific scaling (requests per second, queue depth)"
+    weight: 0.30
+    description: "Thrashing and tuning"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/ingress-routing-issues.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: ingress-routing-issues
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Ingress routing issues — diagnose 404s, TLS errors, and misconfigured path routing in Ingress resources"
+  tags: [Kubernetes, Ingress, routing, TLS, nginx-ingress, intermediate]
+state: {}
+trigger: |
+  Your application is returning 404 for all API routes through the
+  Ingress, but the app works fine when accessed via port-forward:
+  $ curl https://app.example.com/api/users
+  <html><body><h1>404 Not Found</h1></body></html>
+  $ kubectl port-forward svc/api-service 8080:80
+  $ curl http://localhost:8080/api/users
+  {"users": [...]}  # Works!
+  $ kubectl get ingress
+  NAME       CLASS   HOSTS              ADDRESS        PORTS     AGE
+  app-ing    nginx   app.example.com    10.0.50.100    80, 443   30m
+  $ kubectl describe ingress app-ing
+  Rules:
+    Host              Path  Backends
+    app.example.com
+                      /api   api-service:80
+  Annotations:
+    nginx.ingress.kubernetes.io/rewrite-target: /
+  The problem: the rewrite-target annotation rewrites /api/users to
+  just / — the backend receives / instead of /api/users. The app needs
+  the full path.
+  Additionally, HTTPS isn't working:
+  $ curl https://app.example.com/
+  curl: (60) SSL: certificate subject name does not match target host name
+  The TLS secret has a certificate for "*.internal.com", not
+  "app.example.com".
+  Task: Explain how Kubernetes Ingress works and how to debug routing
+  issues. Write: how Ingress controllers route traffic (host-based and
+  path-based), path types (Exact, Prefix, ImplementationSpecific),
+  the rewrite-target annotation and when to use it, TLS configuration
+  and certificate management, and debugging techniques for 404/502/503.
+assertions:
+  - type: llm_judge
+    criteria: "Ingress routing is explained — Ingress resource defines rules mapping hostnames and paths to backend Services. Ingress controller (nginx, traefik, etc.) implements the routing. Path types: Exact (exact match only), Prefix (prefix match, / matches everything), ImplementationSpecific (controller decides). The rewrite-target annotation modifies the URL path before forwarding to the backend"
+    weight: 0.35
+    description: "Ingress routing explained"
+  - type: llm_judge
+    criteria: "TLS and common errors are covered — TLS configured via tls section referencing a Kubernetes Secret containing tls.crt and tls.key. Certificate must match the hostname in the Ingress rules. Common errors: 404 (path mismatch, wrong backend, rewrite issues), 502 (backend pod not running or port wrong), 503 (no endpoints, readiness probe failing). cert-manager for automatic certificate management"
+    weight: 0.35
+    description: "TLS and errors"
+  - type: llm_judge
+    criteria: "Debugging workflow is practical — check Ingress controller logs (kubectl logs -n ingress-nginx <controller-pod>), verify backend Service exists and has endpoints, test directly via port-forward to isolate Ingress vs app issue, check Ingress controller config (kubectl exec into controller and inspect nginx.conf), verify DNS resolves correctly, check annotations match the controller type"
+    weight: 0.30
+    description: "Debugging workflow"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/init-container-failures.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: init-container-failures
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug init container failures — diagnose why pods are stuck in Init state and how init containers affect pod startup"
+  tags: [Kubernetes, init-containers, pod-startup, dependencies, intermediate]
+state: {}
+trigger: |
+  Your application pod is stuck in Init:0/2 state and never starts:
+  $ kubectl get pods
+  NAME                        READY   STATUS     RESTARTS   AGE
+  webapp-4a5b6c7d-e8f9       0/1     Init:0/2   0          10m
+  $ kubectl describe pod webapp-4a5b6c7d-e8f9
+  Init Containers:
+    wait-for-db:
+      Image:   busybox
+      Command: ["sh", "-c", "until nc -z postgres-svc 5432; do echo
+               waiting for postgres; sleep 2; done"]
+      State:   Running
+        Started: 2025-12-01T10:00:00Z
+    run-migrations:
+      Image:   webapp:v2.0
+      Command: ["./migrate", "--up"]
+      State:   Waiting
+        Reason: PodInitializing
+  Events:
+    Normal  Pulled   10m  kubelet  Container image "busybox" pulled
+    Normal  Created  10m  kubelet  Created container wait-for-db
+    Normal  Started  10m  kubelet  Started container wait-for-db
+  The first init container (wait-for-db) is running forever because
+  postgres-svc doesn't exist in this namespace — the database is in the
+  "data" namespace and should be referenced as postgres-svc.data.
+  The second init container (run-migrations) can't start because init
+  containers run sequentially — it waits for wait-for-db to complete.
+  Task: Explain init containers and how to debug them. Write: what init
+  containers are (run-to-completion before main container starts), how
+  they run sequentially, how to read Init:X/Y status, how to view init
+  container logs (kubectl logs <pod> -c <init-container>), common init
+  container patterns (wait for dependency, run migrations, download
+  config), and how to fix stuck init containers.
+assertions:
+  - type: llm_judge
+    criteria: "Init containers are explained — specialized containers that run to completion before the main container starts. They run sequentially (init-1 must succeed before init-2 starts). Init:0/2 means 0 of 2 init containers have completed. If an init container fails, the pod restarts (applying the restartPolicy). Init containers share volumes with the main container but have their own image and command"
+    weight: 0.35
+    description: "Init containers explained"
+  - type: llm_judge
+    criteria: "Debugging workflow is clear — read Init status (Init:X/Y shows progress), kubectl describe pod shows each init container state and events, kubectl logs <pod> -c <init-container-name> shows init container output (critical for debugging), identify if init container is stuck running (infinite wait loop) vs failing (exit code). In this case, the DNS name is wrong — should use cross-namespace DNS"
+    weight: 0.35
+    description: "Init container debugging"
+  - type: llm_judge
+    criteria: "Common patterns and fixes covered — patterns: wait-for-dependency (check service availability), run-migrations (database schema changes), download-config (fetch from external source), set-permissions (chown/chmod on shared volumes). Fixes: add timeouts to wait loops, use correct DNS names for cross-namespace services (<svc>.<namespace>.svc.cluster.local), add resource limits to init containers"
+    weight: 0.30
+    description: "Patterns and fixes"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/intermediate-troubleshooting-shift.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: intermediate-troubleshooting-shift
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Combined intermediate troubleshooting shift — diagnose interconnected failures involving storage, networking, RBAC, and scaling"
+  tags: [Kubernetes, troubleshooting, combined, shift-simulation, intermediate]
+state: {}
+trigger: |
+  You're investigating a complex outage in the "ecommerce" namespace.
+  A Helm upgrade was deployed 30 minutes ago and multiple things broke:
+  $ kubectl get pods -n ecommerce
+  NAME                          READY   STATUS             RESTARTS   AGE
+  catalog-svc-new-7a8b-c9d0   0/1     Pending            0          30m
+  catalog-svc-old-1e2f-g3h4   1/1     Running            0          2d
+  cart-svc-5i6j7k8l-m9n0      0/1     CrashLoopBackOff   8          30m
+  search-svc-1o2p3q4r-s5t6    1/1     Running            0          30m
+  payment-svc-7u8v9w0x-y1z2   1/1     Running            0          2d
+  $ kubectl get hpa -n ecommerce
+  NAME          TARGETS         MINPODS   MAXPODS   REPLICAS
+  catalog-svc   <unknown>/80%   2         20        1
+  $ kubectl get ingress -n ecommerce
+  NAME          HOSTS              ADDRESS   PORTS     AGE
+  ecommerce     shop.example.com             80, 443   30m
+  Investigation reveals four interconnected issues:
+  1. catalog-svc — new pod Pending, PVC can't bind. The Helm upgrade
+     changed StorageClass from "standard" to "premium-ssd" which doesn't
+     exist. Old pod still running because Deployment strategy is
+     RollingUpdate with maxUnavailable=0.
+  2. cart-svc — CrashLoopBackOff, logs show "FORBIDDEN: cannot list
+     endpoints in namespace ecommerce." The new version uses Kubernetes
+     API for service discovery but the ServiceAccount lacks RBAC.
+  3. HPA — showing <unknown> targets because the Helm upgrade removed
+     resource requests from the catalog-svc Deployment spec.
+  4. Ingress — no ADDRESS assigned. The Ingress resource was recreated
+     with an invalid ingressClassName that doesn't match any controller.
+  Task: Walk through diagnosing and fixing all four issues. Write:
+  the triage approach for a Helm-triggered multi-failure incident,
+  how to identify the Helm upgrade as the common cause, the fix for
+  each issue, whether to rollback the entire Helm release or fix
+  forward, and the post-incident review process.
+assertions:
+  - type: llm_judge
+    criteria: "All four issues are diagnosed — (1) PVC binding failure from non-existent StorageClass, (2) RBAC Forbidden error for cart-svc ServiceAccount, (3) HPA <unknown> from missing CPU requests, (4) Ingress no address from wrong ingressClassName. The root cause is identified as the Helm upgrade introducing multiple configuration errors"
+    weight: 0.35
+    description: "All issues diagnosed"
+  - type: llm_judge
+    criteria: "Rollback vs fix-forward decision is analyzed — helm rollback is faster and safer when multiple issues exist (reverts all changes at once), but may lose legitimate improvements in the new version. Fix-forward makes sense for single, well-understood issues. In this case with 4 issues, rollback to last good revision is recommended, then fix the chart and re-deploy"
+    weight: 0.35
+    description: "Rollback decision"
+  - type: llm_judge
+    criteria: "Post-incident process is covered — review the Helm chart diff (helm diff), add validation checks (helm lint, --dry-run, OPA policies), implement staging environment that mirrors production, use helm test for post-deploy verification, add monitoring alerts for HPA issues and Ingress health, consider GitOps with ArgoCD for change review"
+    weight: 0.30
+    description: "Post-incident process"