npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/network-policy-blocking.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: network-policy-blocking
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug NetworkPolicy issues — diagnose why pods can't communicate when network policies restrict traffic"
+  tags: [Kubernetes, NetworkPolicy, networking, security, traffic, intermediate]
+state: {}
+trigger: |
+  After a security team applied NetworkPolicies, your microservices
+  stopped communicating:
+  $ kubectl exec -it frontend-pod -- curl -s http://api-service:8080/health
+  curl: (7) Failed to connect to api-service port 8080: Connection timed out
+  Before the NetworkPolicies, this worked fine. The services are in the
+  same namespace.
+  $ kubectl get networkpolicy -n app
+  NAME             POD-SELECTOR    AGE
+  deny-all         <none>          5m
+  allow-frontend   app=frontend    5m
+  $ kubectl describe networkpolicy deny-all
+  Spec:
+    PodSelector:     <none> (applies to all pods)
+    Allowing ingress traffic: <none> (deny all ingress)
+    Allowing egress traffic: <none> (deny all egress)
+  $ kubectl describe networkpolicy allow-frontend
+  Spec:
+    PodSelector: app=frontend
+    Allowing ingress traffic:
+      From: <any> (allow all ingress to frontend)
+    Allowing egress traffic: <none> (not specified)
+  The deny-all policy blocks all ingress AND egress for every pod.
+  The allow-frontend policy allows ingress TO frontend but doesn't
+  allow egress FROM frontend. And there's no policy allowing ingress
+  to the api-service.
+  Problems:
+  1. Frontend can't send requests (egress blocked by deny-all)
+  2. API service can't receive requests (ingress blocked by deny-all)
+  3. DNS resolution also blocked (egress to kube-dns on port 53)
+  Task: Explain NetworkPolicies and how to debug them. Write: how
+  NetworkPolicies work (default allow-all, additive when applied),
+  ingress vs egress rules, how to read policy selectors (podSelector,
+  namespaceSelector, ipBlock), why DNS breaks when egress is blocked,
+  how to write a working zero-trust policy set, and debugging techniques.
+assertions:
+  - type: llm_judge
+    criteria: "NetworkPolicy behavior is explained — by default all traffic allowed, applying any policy to a pod makes it deny-all for that direction (ingress/egress), then rules are additive (allow specific traffic). A deny-all policy with empty podSelector applies to ALL pods in namespace. Policies are namespace-scoped. Must explicitly allow both directions — if frontend needs to talk to API, frontend needs egress rule AND API needs ingress rule"
+    weight: 0.35
+    description: "NetworkPolicy behavior"
+  - type: llm_judge
+    criteria: "DNS and common pitfalls are addressed — when egress is denied, DNS resolution breaks (CoreDNS runs in kube-system on port 53 UDP/TCP). Must add egress rule allowing port 53 to kube-system namespace. Common pitfall: forget to allow DNS, forget both directions needed, label selectors don't match pods, missing namespaceSelector for cross-namespace policies"
+    weight: 0.35
+    description: "DNS and pitfalls"
+  - type: llm_judge
+    criteria: "Debugging workflow is practical — check what policies apply to a pod (kubectl get netpol, match podSelector to pod labels), verify traffic is blocked vs application error (test with kubectl exec curl/nc), check if CNI plugin supports NetworkPolicy (not all do — e.g., Flannel doesn't by default, Calico does), use packet capture or logging to trace blocked connections"
+    weight: 0.30
+    description: "Debugging workflow"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/persistent-volume-issues.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: persistent-volume-issues
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug PersistentVolume issues — diagnose PVC binding failures, mount errors, and storage class misconfigurations"
+  tags: [Kubernetes, PersistentVolume, PVC, storage, StorageClass, intermediate]
+state: {}
+trigger: |
+  Your StatefulSet for PostgreSQL won't start. The pods are stuck in
+  Pending because their PVCs can't bind:
+  $ kubectl get pods
+  NAME           READY   STATUS    RESTARTS   AGE
+  postgres-0     0/1     Pending   0          10m
+  $ kubectl get pvc
+  NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
+  data-postgres-0     Pending                                      fast-ssd       10m
+  $ kubectl describe pvc data-postgres-0
+  Events:
+    Warning  ProvisioningFailed  2m  persistentvolume-controller
+    storageclass.storage.k8s.io "fast-ssd" not found
+  The StorageClass "fast-ssd" doesn't exist in this cluster. But there's
+  more — even after creating the StorageClass, a second issue appears:
+  $ kubectl get pvc
+  NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
+  data-postgres-0     Pending                                      fast-ssd       15m
+  $ kubectl describe pvc data-postgres-0
+  Events:
+    Warning  ProvisioningFailed  1m  ebs.csi.aws.com_ebs-csi-controller
+    could not create volume in zone "us-east-1c": volume limit of 25
+    per node reached
+  The CSI driver can't provision because the node has hit its volume
+  attachment limit.
+  Key concepts:
+  - PV/PVC binding: PVCs request storage, PVs provide it
+  - StorageClass: defines how storage is provisioned dynamically
+  - Access modes: RWO, ROX, RWX — must match between PV and PVC
+  - volumeBindingMode: Immediate vs WaitForFirstConsumer
+  - CSI drivers: external storage providers
+  Task: Explain how Kubernetes persistent storage works and how to
+  debug issues. Write: the PV/PVC/StorageClass relationship, dynamic
+  vs static provisioning, access modes and their constraints, common
+  binding failures (missing StorageClass, capacity, access mode
+  mismatch, zone affinity), CSI driver issues, and volumeBindingMode.
+assertions:
+  - type: llm_judge
+    criteria: "PV/PVC/StorageClass relationship is explained — PVCs are requests for storage (size, access mode, StorageClass), PVs are the actual storage resources, StorageClass defines the provisioner and parameters for dynamic provisioning. When a PVC is created, the StorageClass provisioner automatically creates a PV. Static provisioning means pre-creating PVs for PVCs to bind to"
+    weight: 0.35
+    description: "Storage relationship explained"
+  - type: llm_judge
+    criteria: "Common binding failures are covered — missing StorageClass, insufficient capacity, access mode mismatch (requesting RWX when storage only supports RWO), zone affinity (PV in different zone than pod), CSI driver issues (volume attachment limits, driver not installed), volumeBindingMode Immediate creates PV immediately vs WaitForFirstConsumer waits until pod is scheduled (better for zone-aware provisioning)"
+    weight: 0.35
+    description: "Binding failures covered"
+  - type: llm_judge
+    criteria: "Debugging workflow is practical — check PVC status and events (kubectl describe pvc), check StorageClass exists (kubectl get sc), verify CSI driver pods running (kubectl get pods -n kube-system), check node volume limits, verify access modes match. For StatefulSets: PVCs persist even after pod deletion, must manually delete PVC to re-provision"
+    weight: 0.30
+    description: "Debugging workflow"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/rbac-permission-denied.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: rbac-permission-denied
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug RBAC permission issues — diagnose Forbidden errors when pods or users can't access Kubernetes API resources"
+  tags: [Kubernetes, RBAC, ServiceAccount, permissions, security, intermediate]
+state: {}
+trigger: |
+  Your application pod needs to list other pods for service discovery,
+  but it's getting Forbidden errors:
+  $ kubectl logs discovery-agent-pod
+  Error: pods is forbidden: User "system:serviceaccount:app:default"
+  cannot list resource "pods" in API group "" in the namespace "app"
+  The application uses the Kubernetes API client to discover peer pods
+  for cluster coordination. It's using the default ServiceAccount which
+  has no permissions beyond the basics.
+  $ kubectl auth can-i list pods --as=system:serviceaccount:app:default -n app
+  no
+  You need to create proper RBAC resources. But which combination?
+  Current state of RBAC resources:
+  - A Role "pod-reader" exists but only grants "get" (not "list" or "watch")
+  - A RoleBinding exists but binds to ServiceAccount "discovery-agent"
+    (which doesn't exist) instead of "default"
+  - The pod is using the "default" ServiceAccount
+  Multiple issues:
+  1. Role needs "list" and "watch" verbs, not just "get"
+  2. RoleBinding references wrong ServiceAccount name
+  3. Should create a dedicated ServiceAccount instead of using default
+  Task: Explain Kubernetes RBAC and how to debug permission issues.
+  Write: the RBAC model (Role, ClusterRole, RoleBinding, ClusterRoleBinding),
+  how ServiceAccounts work (pods authenticate as their SA), how to
+  check permissions (kubectl auth can-i), common RBAC mistakes, and
+  the principle of least privilege.
+assertions:
+  - type: llm_judge
+    criteria: "RBAC model is explained — Role defines permissions (verbs on resources) within a namespace, ClusterRole defines cluster-wide permissions, RoleBinding grants a Role to a subject (User, Group, ServiceAccount) within a namespace, ClusterRoleBinding grants cluster-wide. Pods authenticate to the API server using their ServiceAccount token mounted at /var/run/secrets/kubernetes.io/serviceaccount/"
+    weight: 0.35
+    description: "RBAC model explained"
+  - type: llm_judge
+    criteria: "Debugging permission issues is systematic — use kubectl auth can-i to test specific permissions, check which ServiceAccount the pod uses (kubectl get pod -o yaml | grep serviceAccountName), verify Role has correct verbs and resources, verify RoleBinding references correct Role and ServiceAccount, check namespace scope (Role vs ClusterRole). The 'system:serviceaccount:<namespace>:<name>' format identifies ServiceAccounts"
+    weight: 0.35
+    description: "Permission debugging"
+  - type: llm_judge
+    criteria: "Best practices and fixes are covered — create dedicated ServiceAccounts per application (don't use default), grant minimum required permissions (least privilege), use Role/RoleBinding for namespace-scoped access (prefer over ClusterRole), audit permissions regularly (kubectl auth can-i --list), never use cluster-admin for application ServiceAccounts"
+    weight: 0.30
+    description: "Best practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-2/resource-quota-limits.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: resource-quota-limits
+  level: 2
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug ResourceQuota and LimitRange issues — diagnose why deployments fail due to namespace resource constraints"
+  tags: [Kubernetes, ResourceQuota, LimitRange, QoS, resource-management, intermediate]
+state: {}
+trigger: |
+  Your team's deployment suddenly fails to create new pods:
+  $ kubectl apply -f deployment.yaml
+  Error from server (Forbidden): error when creating "deployment.yaml":
+  pods "cache-svc-7a8b9c0d-e1f2" is forbidden: exceeded quota:
+  team-alpha-quota, requested: cpu=500m,memory=512Mi,
+  used: cpu=3500m,memory=7Gi, limited: cpu=4,memory=8Gi
+  $ kubectl get resourcequota -n team-alpha
+  NAME               AGE   REQUEST                          LIMIT
+  team-alpha-quota   30d   cpu: 3500m/4, memory: 7Gi/8Gi   cpu: 8/8, memory: 16Gi/16Gi
+  The namespace has a ResourceQuota and you're hitting the ceiling.
+  But wait — your deployment doesn't specify resource requests:
+  $ kubectl get deployment cache-svc -o yaml | grep -A5 resources
+      resources: {}
+  Yet the error says it's requesting 500m CPU and 512Mi memory. Why?
+  $ kubectl get limitrange -n team-alpha
+  NAME              CREATED AT
+  default-limits    2025-11-01T00:00:00Z
+  $ kubectl describe limitrange default-limits
+  Type        Resource  Min    Max    Default Request  Default Limit
+  Container   cpu       100m   2      500m             1
+  Container   memory    128Mi  4Gi    512Mi            1Gi
+  The LimitRange is automatically injecting default requests! The pod
+  gets 500m CPU and 512Mi memory even though the deployment doesn't
+  specify them.
+  Task: Explain ResourceQuotas and LimitRanges. Write: how ResourceQuotas
+  enforce namespace-level limits, how LimitRanges set per-container
+  defaults and min/max, how they interact (when quota exists, all pods
+  must have requests — LimitRange provides defaults), QoS classes
+  (Guaranteed, Burstable, BestEffort) and eviction priority, and how
+  to right-size resource settings.
+assertions:
+  - type: llm_judge
+    criteria: "ResourceQuota and LimitRange interaction is explained — ResourceQuota sets total namespace limits (aggregate CPU, memory, pod count). When ResourceQuota exists, every pod MUST specify resource requests — if not set, LimitRange provides defaults. LimitRange also enforces min/max per container. If no LimitRange default and no request specified, pod creation fails with quota enabled"
+    weight: 0.35
+    description: "Quota and LimitRange interaction"
+  - type: llm_judge
+    criteria: "QoS classes are explained — Guaranteed (requests=limits for all containers, highest priority, last evicted), Burstable (requests<limits or partial specification, medium priority), BestEffort (no requests or limits, first evicted under pressure). Understanding QoS helps prioritize which pods survive during resource contention"
+    weight: 0.35
+    description: "QoS classes explained"
+  - type: llm_judge
+    criteria: "Right-sizing and debugging are practical — check quota usage (kubectl describe quota), check LimitRange defaults (kubectl describe limitrange), use kubectl top pod for actual usage vs requests, review QoS class with kubectl get pod -o yaml. Right-sizing: set requests to P95 usage, limits to peak + buffer. Avoid BestEffort for production workloads"
+    weight: 0.30
+    description: "Right-sizing and debugging"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/advanced-troubleshooting-shift.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: advanced-troubleshooting-shift
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Combined advanced troubleshooting shift — diagnose a production incident involving node failures, storage issues, service mesh, and monitoring gaps"
+  tags: [Kubernetes, troubleshooting, combined, shift-simulation, production-incident, advanced]
+state: {}
+trigger: |
+  3:00 AM alert: "Multiple services degraded in production." You're the
+  on-call SRE. The incident cascaded from a single root cause:
+  Timeline:
+  - 2:30 AM: worker-5 hits MemoryPressure, starts evicting pods
+  - 2:35 AM: Evicted pods reschedule to worker-2 and worker-3
+  - 2:40 AM: worker-2 runs out of disk space from container logs
+  - 2:45 AM: StatefulSet database pods fail to reschedule (PVC zone affinity)
+  - 2:50 AM: Services depending on the database start failing
+  - 2:55 AM: Alertmanager didn't page because Slack webhook expired
+  - 3:00 AM: Customer reports trigger manual investigation
+  Current state:
+  $ kubectl get nodes
+  NAME       STATUS                        ROLES    VERSION
+  worker-1   Ready                         worker   v1.29.0
+  worker-2   Ready,SchedulingDisabled      worker   v1.29.0
+  worker-3   Ready                         worker   v1.29.0
+  worker-4   Ready                         worker   v1.29.0
+  worker-5   NotReady,SchedulingDisabled   worker   v1.29.0
+  $ kubectl get pods -n critical --field-selector=status.phase!=Running
+  NAME                        STATUS    RESTARTS   AGE
+  postgres-1                  Pending   0          25m
+  search-indexer-7a8b-c9d0   Failed    0          25m
+  api-cache-1e2f-g3h4        Failed    0          25m
+  Issues to address:
+  1. worker-5 MemoryPressure — identify memory-hungry pod, assess if
+     node can recover or needs replacement
+  2. worker-2 DiskPressure — container log rotation not configured,
+     /var/lib/docker full
+  3. postgres-1 Pending — PVC bound to volume in worker-5's AZ, can't
+     reschedule to other AZ
+  4. Cascading failures — services failing because database unavailable
+  5. Monitoring gap — Alertmanager webhook expired, no backup channel
+  6. No PDB on critical services — evictions were uncontrolled
+  Task: Walk through this cascading incident. Write: the root cause
+  analysis (chain of events), immediate remediation steps for each
+  issue, how to restore service for the database (PVC zone affinity
+  options), why the monitoring gap let this escalate for 30 minutes,
+  and the post-incident improvements (PDBs, log rotation, monitoring
+  redundancy, capacity planning).
+assertions:
+  - type: llm_judge
+    criteria: "Cascading failure chain is explained — initial trigger was MemoryPressure on worker-5 causing evictions. Evicted pods redistributed, overloading worker-2 which then hit DiskPressure. StatefulSet pods couldn't reschedule due to PVC zone affinity. Dependent services failed. Monitoring gap delayed response by 30 minutes. Each failure amplified the next"
+    weight: 0.35
+    description: "Cascade chain explained"
+  - type: llm_judge
+    criteria: "Immediate remediation is practical — (1) free disk on worker-2 (truncate/rotate logs, clear unused images: crictl rmi --prune), (2) for postgres-1: either bring worker-5 back online or create new PV in available AZ and restore from backup, (3) restart failed pods once dependencies are back, (4) fix Alertmanager config immediately. Prioritize database recovery as it's the dependency for other services"
+    weight: 0.35
+    description: "Immediate remediation"
+  - type: llm_judge
+    criteria: "Post-incident improvements are comprehensive — add PodDisruptionBudgets for critical services, configure container log rotation (logrotate or container runtime config), add backup alerting channels (PagerDuty + Slack), implement capacity alerts before pressure (warn at 80%), use WaitForFirstConsumer for PVCs, run regular backup/restore tests, add pod anti-affinity to spread critical pods across nodes"
+    weight: 0.30
+    description: "Post-incident improvements"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/cluster-upgrade-failures.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: cluster-upgrade-failures
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Kubernetes cluster upgrade failures — diagnose API deprecations, node rotation issues, and workload disruptions during upgrades"
+  tags: [Kubernetes, cluster-upgrade, API-deprecation, node-rotation, advanced]
+state: {}
+trigger: |
+  Your team is upgrading a Kubernetes cluster from 1.28 to 1.30 and
+  encountering multiple failures:
+  Phase 1 — Control plane upgrade:
+  Pre-upgrade validation shows deprecated API usage:
+  $ kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
+  apiserver_requested_deprecated_apis{group="extensions",version="v1beta1",
+  resource="ingresses"} 47
+  apiserver_requested_deprecated_apis{group="policy",version="v1beta1",
+  resource="podsecuritypolicies"} 12
+  47 Ingress resources still use extensions/v1beta1 (removed in 1.22)
+  and 12 PodSecurityPolicy resources use policy/v1beta1 (removed in 1.25).
+  Phase 2 — Node rotation:
+  After control plane upgrade, worker node drain fails:
+  $ kubectl drain worker-1 --ignore-daemonsets
+  error: unable to drain node "worker-1": cannot evict pod
+  "critical-db-0": disruption budget prevents eviction
+  PDB blocks eviction. After increasing replicas to allow drain:
+  $ kubectl get nodes
+  NAME       STATUS                     VERSION
+  master-1   Ready                      v1.30.0
+  worker-1   Ready,SchedulingDisabled   v1.28.5
+  worker-2   Ready                      v1.30.0
+  worker-3   Ready                      v1.28.5
+  Mixed version cluster! The skew policy allows at most 1 minor version
+  between control plane and nodes. 1.30 to 1.28 is 2 versions — worker-1
+  and worker-3 are out of skew support.
+  Phase 3 — Workload issues post-upgrade:
+  After all nodes upgraded, some DaemonSets fail because they use
+  privileged containers and the new Pod Security Standards enforce
+  restricted profile on certain namespaces.
+  Task: Explain Kubernetes cluster upgrade process and troubleshooting.
+  Write: the upgrade sequence (control plane first, then nodes), API
+  deprecation checking and migration, version skew policy (kubelet can
+  be at most N-1 relative to API server), PDB considerations during
+  node drain, Pod Security Standards/Admission replacing PodSecurityPolicy,
+  and upgrade planning best practices.
+assertions:
+  - type: llm_judge
+    criteria: "Upgrade sequence is explained — always upgrade control plane first (API server, controller manager, scheduler, etcd), then worker nodes one by one. Version skew policy: kubelet can be at most 1 minor version behind API server. Cannot skip minor versions (must go 1.28→1.29→1.30). Check deprecated API usage before upgrade with metrics or kubectl commands"
+    weight: 0.35
+    description: "Upgrade sequence"
+  - type: llm_judge
+    criteria: "API deprecation and migration are covered — deprecated APIs continue working until removed. Track deprecations with apiserver_requested_deprecated_apis metric. Use kubectl convert or manual manifest updates to migrate (e.g., extensions/v1beta1 Ingress → networking.k8s.io/v1). PodSecurityPolicy removed in 1.25, replaced by Pod Security Standards/Admission. Plan migration before upgrading"
+    weight: 0.35
+    description: "API deprecation"
+  - type: llm_judge
+    criteria: "Operational considerations are practical — pre-upgrade checklist: check API deprecations, review PDBs for drain compatibility, backup etcd, test in staging first. During upgrade: drain nodes one at a time, monitor workload health, use surge capacity for zero-downtime. Post-upgrade: verify all nodes at new version, run integration tests, check for Pod Security Standard violations. Use managed Kubernetes (EKS/GKE/AKS) to simplify"
+    weight: 0.30
+    description: "Operational considerations"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/gitops-drift-detection.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: gitops-drift-detection
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug GitOps drift — diagnose when cluster state diverges from Git and how ArgoCD/Flux detect and reconcile drift"
+  tags: [Kubernetes, GitOps, ArgoCD, Flux, drift-detection, reconciliation, advanced]
+state: {}
+trigger: |
+  Your ArgoCD dashboard shows several applications as "OutOfSync" and
+  one as "Degraded":
+  ArgoCD Applications:
+  | NAME          | SYNC STATUS | HEALTH STATUS | MESSAGE                |
+  |---------------|-------------|---------------|------------------------|
+  | api-gateway   | OutOfSync   | Healthy       | live != desired        |
+  | user-service  | Synced      | Degraded      | 0/3 pods available     |
+  | order-service | OutOfSync   | Healthy       | manual override        |
+  | monitoring    | Unknown     | Unknown       | ComparisonError        |
+  Investigation:
+  1. api-gateway (OutOfSync, Healthy) — someone used kubectl edit to
+     change the replica count from 2 to 5 directly in the cluster.
+     ArgoCD detects the drift but auto-sync is disabled, so it shows
+     OutOfSync without correcting it.
+  2. user-service (Synced, Degraded) — Git has the correct manifest but
+     the pods are failing. ArgoCD shows Synced because the desired state
+     matches Git, but the health check shows Degraded because pods
+     aren't running. The issue is in the application code, not GitOps.
+  3. order-service (OutOfSync, Healthy) — a Helm values override was
+     applied directly, bypassing the Git repo. The rendered manifests
+     differ from what Git produces.
+  4. monitoring (Unknown, ComparisonError) — ArgoCD can't compare the
+     desired state because a CRD was deleted from the cluster, making
+     the custom resources unresolvable.
+  Task: Explain GitOps drift detection and reconciliation. Write: what
+  drift means (cluster state differs from Git source of truth), how
+  ArgoCD detects drift (periodic comparison), sync vs health status,
+  auto-sync vs manual sync, why manual kubectl changes are problematic
+  in GitOps, how to handle legitimate emergency changes, and CRD
+  dependency management.
+assertions:
+  - type: llm_judge
+    criteria: "Drift detection is explained — ArgoCD periodically compares live cluster state with the desired state from Git. OutOfSync means live != Git. Sync status and health status are independent: an app can be Synced but Degraded (code bug), or OutOfSync but Healthy (manual scale change). Auto-sync automatically corrects drift by applying Git state. Manual sync requires human approval"
+    weight: 0.35
+    description: "Drift detection"
+  - type: llm_judge
+    criteria: "Manual changes problem is addressed — kubectl edit/apply bypasses Git, creating drift that ArgoCD flags. In GitOps, ALL changes should go through Git (PR → merge → sync). For emergencies: make the change directly BUT immediately commit the same change to Git so the source of truth is updated. ArgoCD self-heal option auto-reverts manual changes. Flux uses similar reconciliation loop"
+    weight: 0.35
+    description: "Manual changes problem"
+  - type: llm_judge
+    criteria: "CRD and operational issues are covered — CRD deletion causes ComparisonError because ArgoCD can't parse custom resources without the CRD. Fix: ensure CRDs are managed by ArgoCD with proper sync waves (CRDs before resources). Use app-of-apps pattern for dependency ordering. Monitor ArgoCD itself for health. Use argocd app diff to see what changed, argocd app sync --dry-run to preview"
+    weight: 0.30
+    description: "CRD and operations"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/job-cronjob-failures.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: job-cronjob-failures
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug Job and CronJob failures — diagnose stuck jobs, missed schedules, concurrent execution issues, and backoff limits"
+  tags: [Kubernetes, Job, CronJob, batch-processing, scheduling, advanced]
+state: {}
+trigger: |
+  Your data pipeline CronJobs are failing silently — reports aren't
+  being generated but nobody noticed for 3 days:
+  $ kubectl get cronjobs -n data
+  NAME               SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
+  daily-report       0 2 * * *     False     0        3d              30d
+  hourly-etl         0 * * * *     False     3        5m              30d
+  weekly-cleanup     0 0 * * 0     True      0        7d              30d
+  Issues discovered:
+  1. daily-report — Last scheduled 3 days ago, no recent Jobs:
+  $ kubectl get jobs -n data -l job-name=daily-report --sort-by=.status.startTime
+  NAME                      COMPLETIONS   DURATION   AGE
+  daily-report-28930560     0/1           3d         3d
+  The job is still running (stuck) from 3 days ago. The CronJob's
+  concurrencyPolicy is "Forbid", so no new Jobs are created while the
+  old one is active. The Job's pod is stuck in Init:0/1 waiting for
+  a database migration init container that will never complete.
+  2. hourly-etl — 3 active jobs running concurrently:
+  concurrencyPolicy is "Allow" (default), so every hour a new Job
+  starts even if previous ones haven't finished. Jobs are piling up,
+  consuming resources and causing database contention.
+  3. weekly-cleanup — Suspended=True:
+  Someone suspended it during debugging and forgot to re-enable it.
+  Additionally:
+  $ kubectl get jobs -n data | grep -c "0/1"
+  47
+  47 failed jobs sitting in the namespace, never cleaned up because
+  failedJobsHistoryLimit wasn't set.
+  Task: Explain Job and CronJob troubleshooting. Write: how Jobs work
+  (completions, parallelism, backoffLimit), CronJob scheduling (schedule
+  syntax, concurrencyPolicy, startingDeadlineSeconds), common failure
+  modes (stuck jobs blocking schedules, missed schedules, resource
+  accumulation), cleanup and history limits, and monitoring Jobs
+  effectively.
+assertions:
+  - type: llm_judge
+    criteria: "Job mechanics are explained — completions: how many pod completions needed, parallelism: how many pods run simultaneously, backoffLimit: retries before marking as Failed (default 6), activeDeadlineSeconds: maximum runtime. Jobs create pods that run to completion. Failed pods are retried with exponential backoff. A Job stuck in active state blocks CronJobs with Forbid concurrency"
+    weight: 0.35
+    description: "Job mechanics"
+  - type: llm_judge
+    criteria: "CronJob-specific issues are covered — concurrencyPolicy: Allow (multiple jobs simultaneously, risk of pile-up), Forbid (skip if previous still running, risk of stuck blocking), Replace (kill previous, start new). startingDeadlineSeconds: if a schedule is missed by more than this, skip it. successfulJobsHistoryLimit and failedJobsHistoryLimit control cleanup (default 3/1). Suspended field pauses scheduling. Always check for suspended CronJobs during debugging"
+    weight: 0.35
+    description: "CronJob issues"
+  - type: llm_judge
+    criteria: "Debugging and monitoring are practical — check CronJob last schedule time, list active Jobs, check Job pod status and logs. For stuck Jobs: delete the Job or set activeDeadlineSeconds. Monitor: alert on CronJobs that haven't run within expected window, alert on failed Job count, use ttlSecondsAfterFinished (K8s 1.23+) for automatic cleanup. Set appropriate history limits to prevent namespace resource accumulation"
+    weight: 0.30
+    description: "Debugging and monitoring"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-3/monitoring-alerting-gaps.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: monitoring-alerting-gaps
+  level: 3
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug monitoring and alerting gaps — diagnose why Prometheus isn't scraping metrics, Grafana dashboards show no data, and alerts don't fire"
+  tags: [Kubernetes, Prometheus, Grafana, monitoring, alerting, advanced]
+state: {}
+trigger: |
+  Your team deployed a new service but it doesn't appear in Grafana
+  dashboards and no alerts are configured for it:
+  $ kubectl get pods -n monitoring
+  NAME                          READY   STATUS    RESTARTS   AGE
+  prometheus-server-0           2/2     Running   0          30d
+  grafana-7a8b9c0d-e1f2        1/1     Running   0          30d
+  alertmanager-0                1/1     Running   0          30d
+  Prometheus is running but not scraping the new service:
+  Checking Prometheus targets UI shows the new service is not listed.
+  $ kubectl get servicemonitor -n app
+  No resources found in app namespace.
+  The team expected Prometheus to auto-discover the service, but:
+  1. No ServiceMonitor resource was created for the new service
+  2. The Prometheus instance is configured to only watch the "monitoring"
+     namespace for ServiceMonitors (serviceMonitorNamespaceSelector)
+  3. The application exposes metrics on /metrics:9090 but the
+     ServiceMonitor targets port 8080
+  Additionally, existing alerts aren't firing during outages:
+  $ kubectl get prometheusrules -n monitoring
+  NAME           AGE
+  default-rules  30d
+  The alert rules exist but Alertmanager isn't sending notifications:
+  - Alertmanager config has a Slack webhook URL that expired
+  - Alert routing doesn't match the team's namespace labels
+  - inhibitRules are suppressing lower-severity alerts
+  Task: Explain Kubernetes monitoring and alerting setup. Write: how
+  Prometheus discovers targets (ServiceMonitor, PodMonitor, annotations),
+  how to debug missing metrics (targets page, scrape config), how
+  Grafana connects to Prometheus, alert pipeline (PrometheusRule →
+  Alertmanager → notification channel), and common monitoring gaps.
+assertions:
+  - type: llm_judge
+    criteria: "Prometheus service discovery is explained — ServiceMonitor CRD tells Prometheus which Services to scrape (port, path, interval). PodMonitor for pod-level scraping. Prometheus must be configured to watch the namespace where ServiceMonitors exist (serviceMonitorNamespaceSelector). Legacy method: annotations (prometheus.io/scrape=true). Targets page shows what's being scraped and any errors"
+    weight: 0.35
+    description: "Prometheus service discovery"
+  - type: llm_judge
+    criteria: "Alert pipeline is explained — PrometheusRule defines alerting conditions (PromQL expressions with for duration), Prometheus evaluates rules and sends firing alerts to Alertmanager, Alertmanager routes alerts to receivers (Slack, PagerDuty, email) based on labels, inhibitRules can suppress alerts. Debug: check Prometheus /alerts page, Alertmanager UI for silences and routing, verify receiver config"
+    weight: 0.35
+    description: "Alert pipeline"
+  - type: llm_judge
+    criteria: "Common gaps and fixes are covered — missing ServiceMonitor for new services (should be part of deployment template), Prometheus not watching correct namespaces, wrong port/path in ServiceMonitor, Alertmanager webhook URLs expiring, alert routing not matching labels, no alerts defined for new services. Fix: include ServiceMonitor in Helm charts, use namespace-wide selectors, test alerts regularly"
+    weight: 0.30
+    description: "Common gaps and fixes"