npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/multi-container-debugging.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: multi-container-debugging
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug multi-container pods — diagnose sidecar issues, shared volume problems, and inter-container communication failures"
+  tags: [Kubernetes, multi-container, sidecar, init-container, volumes, advanced]
+state: {}
+trigger: |
+  Your application pod has 3 containers: the main app, a log-shipper
+  sidecar, and a config-reloader sidecar. The pod shows 2/3 Ready:
+  $ kubectl get pods
+  NAME                     READY   STATUS    RESTARTS   AGE
+  webapp-5a6b7c-d8e9      2/3     Running   0          10m
+  $ kubectl describe pod webapp-5a6b7c-d8e9
+  Containers:
+    webapp:
+      State:      Running
+      Ready:      True
+    log-shipper:
+      State:      Running
+      Ready:      True
+    config-reloader:
+      State:      Running
+      Ready:      False
+      Readiness:  http-get http://:9091/healthz
+      Message:    Readiness probe failed: connection refused
+  The config-reloader container is Running but not Ready. Its readiness
+  probe targets port 9091 but the container is actually listening on
+  9090 (port mismatch in the probe config).
+  Because one container isn't Ready, the entire pod is not fully Ready,
+  which means the Service might remove it from endpoints if the Service
+  requires all containers Ready.
+  Additional issues discovered:
+  - log-shipper can't read app logs because the shared volume mount
+    path is wrong (/var/log/app vs /var/log/webapp)
+  - config-reloader watches a ConfigMap volume but the ConfigMap update
+    propagation delay (up to 60s) causes stale config reads
+  Task: Explain multi-container pod debugging. Write: how containers
+  in a pod share resources (network, storage, but NOT filesystem by
+  default), sidecar container patterns (logging, config reload, proxy),
+  how pod readiness is determined (ALL containers must be Ready), how
+  shared volumes work between containers, troubleshooting techniques
+  for each container (-c flag), and native sidecar containers (K8s 1.29+).
+assertions:
+  - type: llm_judge
+    criteria: "Multi-container pod model is explained — containers in a pod share network namespace (same IP, localhost communication), can share volumes via volumeMounts, but have separate filesystems otherwise. Pod is Ready only when ALL containers pass readiness probes. Each container has independent lifecycle, restart policy, and resource limits. Logs and exec require -c <container> flag"
+    weight: 0.35
+    description: "Pod model explained"
+  - type: llm_judge
+    criteria: "Sidecar patterns and debugging are covered — logging sidecar reads from shared volume (both containers must mount the SAME path), config-reloader watches ConfigMap volumes (updates propagate every kubelet sync period ~60s), proxy sidecar shares network namespace. Debug: kubectl logs <pod> -c <container>, kubectl exec <pod> -c <container>, check shared volume mount paths match, verify port configurations per container"
+    weight: 0.35
+    description: "Sidecar debugging"
+  - type: llm_judge
+    criteria: "Native sidecars and best practices are covered — Kubernetes 1.29+ native sidecar containers (init containers with restartPolicy: Always) start before and stop after the main container, solving lifecycle ordering issues. Best practices: use emptyDir for shared volumes, ensure consistent mount paths, set resource limits per container, use separate health endpoints per container, consider if you really need sidecars vs a separate deployment"
+    weight: 0.30
+    description: "Native sidecars and practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/node-pressure-evictions.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: node-pressure-evictions
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug node pressure and pod evictions — diagnose DiskPressure, MemoryPressure, and PIDPressure conditions causing pod disruptions"
+  tags: [Kubernetes, node-pressure, eviction, DiskPressure, MemoryPressure, advanced]
+state: {}
+trigger: |
+  Pods are being evicted from nodes across your cluster, causing
+  intermittent service disruptions:
+  $ kubectl get pods --field-selector=status.phase=Failed
+  NAME                         STATUS   REASON    AGE
+  logger-svc-5a6b-c7d8        Failed   Evicted   15m
+  metrics-agg-9e0f-g1h2       Failed   Evicted   15m
+  cache-warm-3i4j-k5l6        Failed   Evicted   14m
+  $ kubectl describe node worker-3
+  Conditions:
+    Type              Status  Reason
+    MemoryPressure    True    KubeletHasInsufficientMemory
+    DiskPressure      True    KubeletHasDiskPressure
+    PIDPressure       False   KubeletHasSufficientPID
+    Ready             True    KubeletReady
+  Taints:
+    node.kubernetes.io/memory-pressure:NoSchedule
+    node.kubernetes.io/disk-pressure:NoSchedule
+  Allocated resources:
+    CPU Requests: 7200m (90%), Memory Requests: 28Gi (87%)
+    CPU Limits: 14000m (175%), Memory Limits: 56Gi (175%)
+  The node has memory and disk pressure. Kubernetes automatically added
+  taints to prevent new pods from scheduling. The kubelet is evicting
+  pods based on QoS class:
+  - BestEffort pods evicted first
+  - Then Burstable pods exceeding requests
+  - Guaranteed pods only if exceeding limits
+  The overcommitment ratio is concerning: limits are 175% of node
+  capacity. If all pods try to use their limits simultaneously, the
+  node will be severely overcommitted.
+  $ kubectl top node worker-3
+  NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
+  worker-3   6800m        85%    30Gi            94%
+  Task: Explain node pressure conditions and eviction. Write: the three
+  pressure types (memory, disk, PID), how kubelet eviction thresholds
+  work (soft vs hard), eviction order by QoS class, how Kubernetes
+  automatically taints pressured nodes, the relationship between
+  requests/limits and overcommitment, and strategies to prevent evictions.
+assertions:
+  - type: llm_judge
+    criteria: "Node pressure types are explained — MemoryPressure (memory usage exceeds threshold), DiskPressure (disk usage exceeds threshold, includes container images and logs), PIDPressure (process IDs exhausted). Kubelet monitors these and sets node conditions. Automatic taints are applied: node.kubernetes.io/<condition>:NoSchedule prevents new pods from scheduling on pressured nodes"
+    weight: 0.35
+    description: "Pressure types explained"
+  - type: llm_judge
+    criteria: "Eviction thresholds and order are explained — soft evictions: kubelet waits eviction-soft-grace-period before evicting (e.g., memory.available < 100Mi for 30s). Hard evictions: immediate eviction when threshold crossed (e.g., memory.available < 50Mi). Eviction order: BestEffort first, then Burstable pods exceeding requests sorted by usage, then Guaranteed only if exceeding limits. Pods using more than requested are evicted before those within requests"
+    weight: 0.35
+    description: "Eviction mechanics"
+  - type: llm_judge
+    criteria: "Prevention strategies are practical — set appropriate resource requests and limits (avoid overcommitment), use Guaranteed QoS for critical workloads, implement PodDisruptionBudgets to limit simultaneous evictions, monitor node capacity (kubectl top node, Prometheus), use cluster autoscaler to add nodes before pressure, set ResourceQuotas per namespace, configure kubelet eviction thresholds appropriately"
+    weight: 0.30
+    description: "Prevention strategies"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/pod-disruption-budgets.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: pod-disruption-budgets
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug PodDisruptionBudget issues — diagnose why node drains hang, upgrades stall, and voluntary disruptions are blocked"
+  tags: [Kubernetes, PDB, PodDisruptionBudget, node-drain, upgrades, advanced]
+state: {}
+trigger: |
+  You need to drain a node for maintenance but the drain command hangs:
+  $ kubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data
+  evicting pod app/critical-svc-7a8b-c9d0
+  evicting pod app/critical-svc-1e2f-g3h4
+  error when evicting pods/"critical-svc-7a8b-c9d0" -n "app":
+  Cannot evict pod as it would violate the pod's disruption budget.
+  $ kubectl get pdb -n app
+  NAME             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
+  critical-pdb     3               N/A               0                     7d
+  $ kubectl get pods -l app=critical-svc -n app
+  NAME                         READY   STATUS    RESTARTS   AGE
+  critical-svc-7a8b-c9d0      1/1     Running   0          2d    (worker-2)
+  critical-svc-1e2f-g3h4      1/1     Running   0          2d    (worker-2)
+  critical-svc-5i6j-k7l8      1/1     Running   0          2d    (worker-3)
+  The PDB requires minAvailable=3 but there are only 3 pods total,
+  two of which are on the node being drained. Evicting either would
+  drop below the minimum.
+  This also blocks cluster autoscaler from removing underutilized nodes
+  and blocks Kubernetes version upgrades that require node rotation.
+  Compounding the issue: the HPA has minReplicas=3, so it can't scale
+  up additional pods to create room for the drain.
+  Task: Explain PodDisruptionBudgets and how to debug drain issues.
+  Write: what PDBs protect against (voluntary disruptions), minAvailable
+  vs maxUnavailable, how PDBs interact with node drains and cluster
+  upgrades, the difference between voluntary and involuntary disruptions,
+  common PDB misconfigurations, and how to properly configure PDBs for
+  maintenance windows.
+assertions:
+  - type: llm_judge
+    criteria: "PDB behavior is explained — PDBs limit voluntary disruptions (node drain, cluster upgrade, pod eviction) but NOT involuntary disruptions (node crash, OOM kill, hardware failure). minAvailable: minimum pods that must be running. maxUnavailable: maximum pods that can be down simultaneously. ALLOWED DISRUPTIONS shows how many pods can currently be evicted without violating the budget"
+    weight: 0.35
+    description: "PDB behavior"
+  - type: llm_judge
+    criteria: "The misconfiguration is diagnosed — minAvailable=3 with 3 total pods means ALLOWED DISRUPTIONS=0 (can never evict any pod). Fix: use maxUnavailable=1 instead (allows 1 pod down at a time), or set minAvailable to N-1 (e.g., 2 for 3 replicas). Must also ensure HPA maxReplicas allows scaling up to create headroom for drains. PDB blocks kubectl drain, cluster autoscaler scale-down, and rolling node upgrades"
+    weight: 0.35
+    description: "Misconfiguration diagnosed"
+  - type: llm_judge
+    criteria: "Best practices are practical — use maxUnavailable instead of minAvailable for easier reasoning, ensure PDB allows at least 1 disruption at all times, coordinate PDB with HPA (HPA should scale up before drain to maintain minimum), use --timeout on kubectl drain to detect stuck drains, temporary PDB modification for emergency maintenance, test PDBs with kubectl drain --dry-run"
+    weight: 0.30
+    description: "Best practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/service-mesh-debugging.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: service-mesh-debugging
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug service mesh issues — diagnose Istio/Linkerd sidecar injection failures, mTLS errors, and traffic routing problems"
+  tags: [Kubernetes, service-mesh, Istio, Linkerd, sidecar, mTLS, advanced]
+state: {}
+trigger: |
+  After enabling Istio service mesh, several services are broken:
+  $ kubectl get pods -n bookstore
+  NAME                          READY   STATUS    RESTARTS   AGE
+  catalog-svc-7a8b9c-d0e1     2/2     Running   0          10m
+  order-svc-2f3g4h-i5j6       1/1     Running   0          10m
+  payment-svc-7k8l9m-n0p1     2/2     Running   0          10m
+  Notice: catalog-svc and payment-svc show 2/2 (app + istio-proxy
+  sidecar), but order-svc shows 1/1 — the sidecar wasn't injected.
+  $ kubectl get namespace bookstore --show-labels
+  NAME        STATUS   AGE   LABELS
+  bookstore   Active   1d    istio-injection=enabled
+  The namespace has automatic injection enabled, so why didn't order-svc
+  get a sidecar?
+  $ kubectl get deployment order-svc -o yaml | grep -A2 annotations
+  annotations:
+    sidecar.istio.io/inject: "false"
+  Someone explicitly disabled injection for order-svc. Now there's a
+  connectivity problem:
+  $ kubectl logs catalog-svc-7a8b9c-d0e1 -c istio-proxy
+  upstream connect error or disconnect/reset before headers. reset
+  reason: connection failure, transport failure reason: TLS error:
+  268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
+  catalog-svc (with mTLS via sidecar) can't talk to order-svc (no
+  sidecar, no mTLS). Strict mTLS mode requires both sides to have the
+  proxy.
+  Task: Explain service mesh troubleshooting. Write: how sidecar
+  injection works (namespace label, annotation override), mTLS between
+  services (what happens when one side lacks a sidecar), traffic routing
+  with VirtualService/DestinationRule, how to debug with istioctl
+  analyze and proxy logs, and common service mesh issues.
+assertions:
+  - type: llm_judge
+    criteria: "Sidecar injection is explained — automatic injection via namespace label (istio-injection=enabled), per-pod override via annotation (sidecar.istio.io/inject: true/false). The mutating webhook intercepts pod creation and adds the sidecar container. Pods created before labeling the namespace need restart. Injection can fail if webhook is misconfigured or the annotation explicitly disables it"
+    weight: 0.35
+    description: "Sidecar injection"
+  - type: llm_judge
+    criteria: "mTLS issues are diagnosed — strict mTLS mode requires both client and server to have Istio sidecar proxies for mutual TLS authentication. If one pod lacks a sidecar, the TLS handshake fails (CERTIFICATE_VERIFY_FAILED). Solutions: enable sidecar on all services, or use permissive mode (PeerAuthentication) to allow both plaintext and mTLS during migration"
+    weight: 0.35
+    description: "mTLS issues"
+  - type: llm_judge
+    criteria: "Debugging tools are covered — istioctl analyze checks configuration for issues, istioctl proxy-status shows sync status between control plane and proxies, istioctl proxy-config shows proxy configuration, kubectl logs -c istio-proxy for proxy errors. VirtualService/DestinationRule for traffic routing, retries, timeouts. Common issues: port naming conventions (http-, grpc-), protocol detection failures"
+    weight: 0.30
+    description: "Debugging tools"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/statefulset-troubleshooting.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: statefulset-troubleshooting
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug StatefulSet issues — diagnose ordered startup failures, persistent volume problems, and headless service misconfigurations"
+  tags: [Kubernetes, StatefulSet, ordered-deployment, PVC, headless-service, advanced]
+state: {}
+trigger: |
+  Your Kafka cluster StatefulSet is partially failed. Pod kafka-0 runs
+  fine but kafka-1 and kafka-2 won't start:
+  $ kubectl get pods -l app=kafka
+  NAME      READY   STATUS    RESTARTS   AGE
+  kafka-0   1/1     Running   0          1h
+  kafka-1   0/1     Pending   0          1h
+  kafka-2   0/1     Pending   0          1h
+  StatefulSets use OrderedReady pod management by default — kafka-1
+  won't even be attempted until kafka-0 is Running and Ready, and
+  kafka-2 waits for kafka-1. But kafka-0 IS Running...
+  $ kubectl describe pod kafka-1
+  Events:
+    Warning  FailedScheduling  5m  default-scheduler
+    0/5 nodes are available: 5 node(s) had volume node affinity conflict
+  $ kubectl get pvc -l app=kafka
+  NAME             STATUS   VOLUME          CAPACITY   STORAGECLASS
+  data-kafka-0     Bound    pv-us-east-1a   50Gi       gp3
+  data-kafka-1     Bound    pv-us-east-1b   50Gi       gp3
+  data-kafka-2     Pending                              gp3
+  PVC data-kafka-1 is bound to a PV in us-east-1b, but there are no
+  nodes in that AZ anymore (a node was decommissioned). PVC data-kafka-2
+  can't provision at all.
+  Additionally, the headless Service for Kafka is misconfigured:
+  $ kubectl get svc kafka-headless
+  NAME             TYPE        CLUSTER-IP   PORT(S)
+  kafka-headless   ClusterIP   10.96.1.50   9092/TCP
+  It has a ClusterIP! Headless services must have clusterIP: None to
+  return individual pod IPs for StatefulSet DNS entries like
+  kafka-0.kafka-headless.default.svc.cluster.local.
+  Task: Explain StatefulSet-specific troubleshooting. Write: how
+  StatefulSets differ from Deployments (ordered, stable identity,
+  stable storage), OrderedReady vs Parallel pod management, PVC
+  lifecycle (PVCs persist across pod restarts and aren't auto-deleted),
+  headless Service requirements, volume node affinity issues, and
+  common StatefulSet failure patterns.
+assertions:
+  - type: llm_judge
+    criteria: "StatefulSet semantics are explained — ordered pod creation/deletion (pod-0 before pod-1), stable network identity (pod-name.headless-svc.namespace.svc.cluster.local), stable persistent storage (PVCs bound to specific pods and persist across restarts/rescheduling). PVCs are NOT deleted when StatefulSet is scaled down — must be manually cleaned up. OrderedReady: wait for Ready before next pod. Parallel: all pods simultaneously"
+    weight: 0.35
+    description: "StatefulSet semantics"
+  - type: llm_judge
+    criteria: "Volume and headless Service issues are diagnosed — PVCs bound to specific PVs may have zone affinity, preventing scheduling if no nodes exist in that zone. Headless Service must have clusterIP: None to enable DNS A records for individual pods. With a ClusterIP, DNS returns the cluster IP instead of pod IPs, breaking StatefulSet peer discovery (Kafka, etcd, ZooKeeper need direct pod addressing)"
+    weight: 0.35
+    description: "Volume and DNS issues"
+  - type: llm_judge
+    criteria: "Fixes and patterns are practical — for zone affinity: use WaitForFirstConsumer volumeBindingMode to co-locate PV with pod's node, or ensure nodes exist in all zones. For stuck PVCs: delete PVC and let StatefulSet recreate it (data loss!). For headless Service: set clusterIP: None. For rolling updates: use partition field to do canary-style updates. Monitor StatefulSet with kubectl rollout status"
+    weight: 0.30
+    description: "Fixes and patterns"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/capacity-planning.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: capacity-planning
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes capacity planning strategy — right-sizing workloads, cost optimization, and scaling architecture for growth"
+  tags: [Kubernetes, capacity-planning, cost-optimization, right-sizing, scaling, expert]
+state: {}
+trigger: |
+  Your CTO wants a capacity planning review. The Kubernetes platform
+  serves 50 microservices with monthly cloud costs of $180,000. Key
+  metrics from the past quarter:
+  Resource Utilization Summary:
+  | Metric                  | Current | Target |
+  |------------------------|---------|--------|
+  | Avg CPU utilization    | 23%     | 65%    |
+  | Avg Memory utilization | 34%     | 70%    |
+  | Node count             | 85      | ???    |
+  | Monthly cost           | $180K   | $120K  |
+  | Overprovisioned pods   | 72%     | <20%   |
+  Analysis reveals:
+  1. 72% of pods request 3-5x more CPU/memory than they actually use.
+     Example: order-svc requests 2 CPU, 4Gi memory but averages 200m
+     CPU and 800Mi memory.
+  2. No VPA is configured — all requests were set by developers during
+     initial deployment and never adjusted.
+  3. HPA min/max are set too conservatively — several services have
+     minReplicas=5 but traffic analysis shows they need only 2 during
+     off-peak (midnight-6am).
+  4. Spot/preemptible nodes are not used at all — everything runs on
+     on-demand instances.
+  5. No cluster autoscaler — nodes are manually provisioned. Some nodes
+     run at 90%+ while others sit at 15%.
+  6. The team is planning for 3x growth in the next year (Black Friday
+     peak expected to be 10x normal traffic).
+  Task: Design a comprehensive capacity planning strategy. Write:
+  right-sizing methodology (VPA recommendations, historical analysis),
+  cost optimization techniques (spot instances, bin-packing, reserved
+  instances), autoscaling architecture (HPA + cluster autoscaler +
+  Karpenter), capacity modeling for growth, and how to present the
+  business case for optimization investment.
+assertions:
+  - type: llm_judge
+    criteria: "Right-sizing methodology is explained — use VPA in recommendation mode to analyze actual usage vs requests, review P95/P99 usage (not average) for requests, set limits at 2x requests for burstable workloads. Start with non-critical services, measure for 2 weeks minimum. Expected savings: reducing from 3-5x overprovisioning to 1.5x can cut costs 40-60%"
+    weight: 0.35
+    description: "Right-sizing methodology"
+  - type: llm_judge
+    criteria: "Cost optimization techniques are comprehensive — spot/preemptible instances for stateless workloads (60-90% savings), Karpenter for intelligent node provisioning (right-sizes node types automatically), bin-packing optimization (consolidate workloads to fewer nodes), reserved instances for baseline capacity, scheduled scaling for predictable traffic patterns, namespace-level budgets with ResourceQuotas"
+    weight: 0.35
+    description: "Cost optimization"
+  - type: llm_judge
+    criteria: "Growth planning is practical — capacity model: current baseline × growth factor × headroom (1.5x) for peak. Autoscaling architecture: HPA for pod-level scaling, cluster autoscaler/Karpenter for node-level. Load testing to validate scaling limits. Black Friday planning: pre-scale 24h before, warm caches, increase HPA limits, pre-provision spot capacity. Business case: $60K/month savings pays for 2 FTE platform engineers"
+    weight: 0.30
+    description: "Growth planning"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/cost-optimization.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: cost-optimization
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes cost optimization strategy — FinOps practices, spot instances, Karpenter, and cost attribution for multi-team clusters"
+  tags: [Kubernetes, cost-optimization, FinOps, spot-instances, Karpenter, expert]
+state: {}
+trigger: |
+  The CFO has flagged Kubernetes cloud costs as growing 3x faster than
+  revenue. Your monthly bill breakdown:
+  | Category              | Monthly Cost | % of Total |
+  |-----------------------|-------------|------------|
+  | EC2 compute (on-demand)| $95,000     | 55%        |
+  | EBS storage            | $25,000     | 15%        |
+  | Data transfer          | $20,000     | 12%        |
+  | Load balancers         | $12,000     | 7%         |
+  | NAT gateways           | $10,000     | 6%         |
+  | Other (ECR, Route53)   | $8,000      | 5%         |
+  | **Total**              | **$170,000**|            |
+  Cost attribution analysis:
+  - 4 of 12 teams account for 70% of compute costs
+  - The ML team runs GPU instances 24/7 but only uses them 8 hours/day
+  - 3 staging environments run full replicas of production (identical
+    node count) but get 5% of the traffic
+  - 40% of pods are overprovisioned by 3x or more
+  - No spot instances are used anywhere
+  - Each service has its own LoadBalancer ($18/mo each × 50 services)
+  - NAT gateway costs are high because pods pull images from public
+    registries on every deployment
+  Target: Reduce monthly costs from $170K to $100K within 6 months
+  without reducing availability.
+  Task: Design the cost optimization strategy. Write: the quick wins
+  (what can save money this month), medium-term optimizations (1-3
+  months), structural changes (3-6 months), cost attribution and
+  showback model for teams, FinOps culture practices, and how to
+  maintain cost discipline as the platform grows.
+assertions:
+  - type: llm_judge
+    criteria: "Quick wins are specific with estimated savings — right-size staging (reduce to 20% of prod capacity: save ~$20K/mo), schedule GPU instances off-hours (save ~$8K/mo), consolidate LoadBalancers into shared Ingress controller (save ~$7K/mo), set up ECR pull-through cache for public images (save ~$3K/mo on NAT). Each quick win has clear implementation steps and expected savings"
+    weight: 0.35
+    description: "Quick wins"
+  - type: llm_judge
+    criteria: "Medium and structural optimizations are comprehensive — medium: implement Karpenter for intelligent node provisioning (right-size instances automatically, mix instance types), use spot instances for stateless workloads (60-90% savings), implement VPA for right-sizing pod requests. Structural: reserved instances or savings plans for baseline capacity, implement cost attribution tags per team/service, automated idle resource detection and cleanup, data transfer optimization (VPC endpoints, regional caching)"
+    weight: 0.35
+    description: "Optimization strategy"
+  - type: llm_judge
+    criteria: "FinOps culture and sustainability are addressed — cost attribution model: tag all resources by team/service, monthly cost reports per team, set team-level budgets with alerts. FinOps practices: cost reviews in sprint retrospectives, 'cost of change' in PR reviews, engineering KPI for cost efficiency (cost per request). Governance: require cost estimates for new services, auto-scale-down for non-production during off-hours, regular quarterly cost optimization sprints"
+    weight: 0.30
+    description: "FinOps culture"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/disaster-recovery-design.yaml ADDED Viewed

@@ -0,0 +1,56 @@
+meta:
+  id: disaster-recovery-design
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes disaster recovery strategy — backup architecture, recovery procedures, RPO/RTO targets, and DR testing"
+  tags: [Kubernetes, disaster-recovery, backup, Velero, RPO, RTO, expert]
+state: {}
+trigger: |
+  After a near-miss incident where an engineer accidentally deleted a
+  production namespace, leadership wants a comprehensive disaster
+  recovery strategy. Currently there are no cluster backups.
+  Requirements from leadership:
+  - RPO: Maximum 1 hour of data loss for critical services
+  - RTO: Maximum 15 minutes to restore critical services
+  - Scope: 3 production clusters, 2 staging, 150+ namespaces
+  - Compliance: SOC2 requires documented DR procedures and annual testing
+  - Budget: $15K/month for DR infrastructure
+  Current gaps:
+  1. No etcd backups — complete cluster loss means rebuilding from scratch
+  2. No PV snapshots — database data would be lost entirely
+  3. Manifests are in Git but not all (some resources created via kubectl)
+  4. No DR runbook exists — team has never practiced recovery
+  5. Secrets are stored only in the cluster (not in external vault)
+  6. No cross-region replication for any stateful services
+  Disaster scenarios to plan for:
+  A. Accidental namespace deletion (most common)
+  B. etcd corruption or loss
+  C. Complete cluster failure
+  D. Regional outage (entire AZ/region down)
+  E. Ransomware/security breach requiring clean rebuild
+  Task: Design a comprehensive Kubernetes DR strategy. Write: the backup
+  architecture (Velero + etcd snapshots + PV snapshots), recovery
+  procedures for each scenario, RPO/RTO analysis and trade-offs, DR
+  testing plan (chaos engineering, game days), secrets management for
+  DR, and the cost analysis for the proposed solution.
+assertions:
+  - type: llm_judge
+    criteria: "Backup architecture is comprehensive — Velero for Kubernetes resource backup (scheduled, namespace-scoped), etcd snapshots for cluster state (every 30 min for 1-hour RPO), CSI volume snapshots for persistent data, cross-region backup storage (S3 cross-region replication). GitOps as source of truth for declarative resources. External secrets manager (HashiCorp Vault) so secrets survive cluster loss"
+    weight: 0.35
+    description: "Backup architecture"
+  - type: llm_judge
+    criteria: "Recovery procedures per scenario are defined — (A) Namespace deletion: Velero restore with namespace filter (< 5 min RTO), (B) etcd corruption: restore from etcd snapshot (10-15 min), (C) Cluster failure: provision new cluster + Velero full restore (30-60 min), (D) Regional outage: failover to DR cluster with pre-provisioned capacity (< 15 min with warm standby), (E) Security breach: clean cluster from IaC + restore verified backups (2-4 hours)"
+    weight: 0.35
+    description: "Recovery procedures"
+  - type: llm_judge
+    criteria: "Testing and cost analysis are practical — DR testing: quarterly game days simulating each scenario, automated backup verification (restore to test cluster nightly), chaos engineering with Litmus/Chaos Mesh for ongoing resilience. Cost breakdown: Velero + S3 storage ($2-3K/mo), cross-region replication ($3-5K/mo), warm standby cluster ($5-7K/mo). Justification: 1 hour of downtime costs more than a year of DR infrastructure"
+    weight: 0.30
+    description: "Testing and cost"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/executive-communication.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: executive-communication
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Communicate Kubernetes platform value to executives — translate technical metrics into business outcomes and justify infrastructure investment"
+  tags: [Kubernetes, executive-communication, ROI, business-case, leadership, expert]
+state: {}
+trigger: |
+  The CFO questions the Kubernetes platform cost during quarterly budget
+  review: "We're spending $2.4M annually on Kubernetes infrastructure
+  and a 6-person platform team. What's the ROI? Can we just use
+  serverless instead?"
+  You need to prepare a business case. Current data:
+  Platform metrics:
+  - 50 microservices across 5 clusters
+  - 200 deployments per week (up from 4/week pre-Kubernetes)
+  - 99.95% availability (up from 99.2%)
+  - Mean time to recovery: 8 minutes (down from 4 hours)
+  - 12 development teams using the platform
+  Cost breakdown:
+  - Cloud infrastructure: $1.8M/year
+  - Platform team (6 engineers): $900K/year
+  - Tooling licenses: $120K/year
+  - Total: $2.82M/year
+  Business impact (estimated):
+  - Each hour of downtime costs $50K in revenue
+  - Pre-Kubernetes: 52 hours of downtime/year = $2.6M revenue impact
+  - Post-Kubernetes: 4.4 hours of downtime/year = $220K revenue impact
+  - Faster deployments enabled features that generated $5M in new revenue
+  - Developer productivity improved 35% (measured by DORA metrics)
+  The CFO also asks:
+  - "Why can't we use AWS Lambda for everything?"
+  - "What would happen if we cut the platform team to 3 people?"
+  - "How does this compare to industry benchmarks?"
+  Task: Prepare an executive-level business case for the Kubernetes
+  platform. Write: the ROI calculation, comparison with serverless
+  alternatives (where Kubernetes wins vs where serverless wins), risk
+  analysis of reducing the platform team, industry benchmarks (DORA
+  metrics comparison), and a 3-year cost projection with scaling plans.
+assertions:
+  - type: llm_judge
+    criteria: "ROI calculation is clear — revenue saved from reduced downtime: $2.38M/year. Revenue generated from faster feature delivery: $5M attributed. Developer productivity gain: 35% across 50 developers ≈ 17.5 FTE equivalent ($2.6M value). Total value: ~$10M. Cost: $2.82M. ROI: ~255%. Present in business terms, not technical jargon"
+    weight: 0.35
+    description: "ROI calculation"
+  - type: llm_judge
+    criteria: "Serverless comparison is balanced — Lambda wins for: event-driven workloads, unpredictable traffic, simple functions. Kubernetes wins for: long-running services, complex networking, stateful workloads, predictable high-traffic services (cheaper at scale), multi-cloud portability. Hybrid approach: use Lambda for event processing, Kubernetes for core services. Migration cost and vendor lock-in considered"
+    weight: 0.35
+    description: "Serverless comparison"
+  - type: llm_judge
+    criteria: "Risk and benchmarks are practical — cutting platform team to 3: increased MTTR (8 min → 30+ min), slower developer onboarding, security audit gaps, higher incident rate. Industry benchmarks: DORA Elite performers deploy multiple times/day, < 1 hour lead time, < 5% change failure rate, < 1 hour MTTR. 3-year projection: costs grow 15%/year with infrastructure, but revenue scales 40%/year. Cost per deployment drops as volume increases"
+    weight: 0.30
+    description: "Risk and benchmarks"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/expert-troubleshooting-shift.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: expert-troubleshooting-shift
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Combined expert troubleshooting shift — manage a multi-cluster incident requiring cross-team coordination, executive communication, and architectural decisions"
+  tags: [Kubernetes, troubleshooting, combined, shift-simulation, multi-cluster, expert]
+state: {}
+trigger: |
+  You're the platform engineering lead. A regional cloud provider outage
+  affects your primary US cluster (prod-us-1). You must coordinate
+  response across multiple teams and communicate to leadership.
+  9:00 AM — Cloud provider status page: "Investigating increased error
+  rates in us-east-1"
+  Cluster impact:
+  - prod-us-1 (us-east-1): 40% of API calls failing, 3 nodes NotReady
+  - prod-us-2 (us-west-2): Healthy, running at 60% capacity
+  - prod-eu-1 (eu-west-1): Healthy
+  Your decision matrix:
+  1. Do you failover US traffic to prod-us-2?
+     - Pro: Restores service immediately
+     - Con: prod-us-2 may not handle 100% US traffic
+     - Risk: If us-west-2 also fails, complete US outage
+  2. Do you split traffic between prod-us-2 and prod-eu-1?
+     - Pro: Distributes load
+     - Con: EU latency for US users, GDPR implications for EU processing
+  3. Do you wait for cloud provider to resolve?
+     - Pro: Minimal risk of making things worse
+     - Con: Extended downtime, SLO violation
+  Complicating factors:
+  - The database is in us-east-1 with read replicas in us-west-2
+  - Database writes will fail if primary is affected
+  - 3 teams have critical releases scheduled for today
+  - Board meeting at 2 PM expects a platform stability update
+  - Customer support reporting 500+ tickets in the last hour
+  - Media coverage of the cloud provider outage
+  Task: Walk through managing this multi-cluster incident. Write: the
+  decision framework for failover (when to failover vs wait), the
+  communication plan (technical teams, leadership, customers), traffic
+  management across clusters, database write handling during partial
+  outage, post-incident analysis, and how this shapes future DR
+  architecture investments.
+assertions:
+  - type: llm_judge
+    criteria: "Decision framework is structured — assess severity and duration estimate from cloud provider status page. If degradation < 30 min expected: wait with monitoring. If 30+ min or getting worse: failover to prod-us-2 for reads, queue writes or fail gracefully. If 2+ hours: full failover including database promotion in us-west-2. Decision criteria: SLO budget remaining, customer impact, risk of action vs inaction"
+    weight: 0.35
+    description: "Decision framework"
+  - type: llm_judge
+    criteria: "Communication plan is multi-layered — incident commander coordinates all communication. Technical teams: dedicated Slack channel, 15-min sync calls, clear ownership assignments. Leadership: executive summary every 30 min (impact, actions, ETA). Customers: status page update within 15 min, customer support talking points. Board meeting: prepare brief on incident response demonstrating platform resilience and DR investment need"
+    weight: 0.35
+    description: "Communication plan"
+  - type: llm_judge
+    criteria: "Technical response and post-incident are practical — traffic management: Route53 weighted routing to shift traffic, scale up prod-us-2 HPA limits and node count. Database: promote read replica if primary down > 30 min, accept brief data inconsistency. Freeze all non-critical deployments. Post-incident: review single-region database as SPOF, justify multi-region active-active investment, update DR runbooks with this scenario, reduce RTO target"
+    weight: 0.30
+    description: "Technical response"