npm - dojo.md - Versions diffs - 0.2.0 → 0.2.2 - Mend

dojo.md 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (225) hide show

package/courses/docker-container-debugging/scenarios/level-5/master-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: master-debugging-shift
+  level: 5
+  course: docker-container-debugging
+  type: output
+  description: "Combined master debugging shift — serve as fractional CTO advising on a complete container platform strategy encompassing technology, people, process, and business alignment"
+  tags: [Docker, troubleshooting, combined, shift-simulation, CTO, master]
+state: {}
+trigger: |
+  You're engaged as a fractional CTO for a Series B startup ($30M
+  raised, 80 engineers, growing to 200). They've been running on
+  Docker Compose across 5 servers for 2 years. Growing pains are
+  severe:
+  Technical challenges:
+  - 40 microservices on Docker Compose, manual deployment
+  - Last week: wrong image tag deployed, 3-hour outage ($150K impact)
+  - No security scanning — audit found 12 CRITICAL CVEs in production
+  - Database containers without proper backup — near-miss data loss
+  - Developers wait 30+ minutes for CI builds (no caching)
+  - No centralized logging — debugging requires SSH to 5 servers
+  People challenges:
+  - 2-person DevOps team overwhelmed (they're firefighting 80% of time)
+  - Developers have no container training, copy-paste Dockerfiles
+  - No on-call rotation — same 2 DevOps engineers handle everything
+  - Engineering velocity declining as team grows
+  Business context:
+  - Series C fundraising in 9 months — need to demonstrate scalability
+  - Enterprise customer prospects require SOC2 compliance
+  - Planning international expansion (EU data residency requirements)
+  - Board expects 99.95% availability SLA for enterprise tier
+  Budget: $500K for platform investment over next 12 months
+  Hiring: Can add 3-4 people
+  The CEO asks: "Give me a 12-month roadmap that gets us to Series C
+  ready. We need to stop firefighting and start scaling."
+  Task: Design the comprehensive 12-month roadmap. Write: the phased
+  approach (stabilize → automate → scale → optimize), technology
+  decisions (stay on Compose? Move to K8s? Use managed services?),
+  hiring plan (who to hire first and why), SOC2 compliance path,
+  cost breakdown and ROI, risk register with mitigations, and the
+  key milestones that demonstrate investor readiness.
+assertions:
+  - type: llm_judge
+    criteria: "Phased roadmap is realistic — Month 1-3 (Stabilize): fix critical security CVEs, implement image scanning in CI, set up proper database backups, add health checks, configure log rotation, basic monitoring (Prometheus + Grafana). Month 4-6 (Automate): CI/CD pipeline with automated build/scan/deploy, move to managed Kubernetes (EKS/GKE), centralized logging, on-call rotation. Month 7-9 (Scale): SOC2 controls implementation, multi-region readiness, developer self-service platform. Month 10-12 (Optimize): cost optimization, performance tuning, DR testing, compliance audit. Each phase builds on the previous"
+    weight: 0.35
+    description: "Phased roadmap"
+  - type: llm_judge
+    criteria: "Technology and hiring decisions are justified — recommend managed Kubernetes over Compose for 40+ services (Compose doesn't scale operationally). Managed K8s (EKS/GKE) over self-hosted (team too small to manage control plane). Hiring priority: (1) senior platform engineer (lead the migration), (2) security engineer (SOC2 + scanning), (3) SRE (on-call, monitoring, incident response), (4) DevOps engineer (CI/CD, automation). This relieves the existing 2-person team and adds specialization. Budget allocation: $200K tooling, $300K hiring (partial year)"
+    weight: 0.35
+    description: "Technology and hiring"
+  - type: llm_judge
+    criteria: "Investor readiness and compliance are addressed — SOC2 Type I achievable in 6-9 months (show controls exist), Type II requires 6+ months of evidence (start collecting immediately). Key investor metrics: 99.95% availability (track from month 3), deployment frequency (daily by month 6), MTTR < 30 minutes, security posture (0 CRITICAL CVEs). EU expansion: GDPR compliance, data residency (EU region deployment). Present to board quarterly: progress against milestones, risk reduction, cost efficiency. Risk register: migration timeline slippage (mitigate: phased approach), hiring delays (mitigate: start immediately), scope creep (mitigate: strict prioritization)"
+    weight: 0.30
+    description: "Investor readiness"

package/courses/docker-container-debugging/scenarios/level-5/organizational-transformation.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: organizational-transformation
+  level: 5
+  course: docker-container-debugging
+  type: output
+  description: "Lead organizational transformation through containerization — manage cultural change, team restructuring, and DevOps transformation alongside container adoption"
+  tags: [Docker, organizational-change, DevOps, culture, transformation, leadership, master]
+state: {}
+trigger: |
+  You're leading container adoption at a 5,000-person enterprise.
+  The technology is ready but the organization isn't. Three months
+  into the initiative, adoption is stalling:
+  Resistance patterns observed:
+  Operations team (50 people):
+  "Containers are a fad. We've managed VMs for 15 years. Why change?"
+  Fear: automation will eliminate their jobs.
+  Reality: need to transform from VM operators to platform engineers.
+  Development teams (200 people across 30 teams):
+  Only 4 teams have adopted containers. Others: "We're too busy
+  delivering features to learn new deployment tools."
+  Fear: containers add complexity to their already-complex workflow.
+  Reality: containers simplify deployment once learned.
+  Security team (10 people):
+  "Containers increase our attack surface. We can't audit them."
+  Fear: loss of visibility and control.
+  Reality: containers can improve security posture with proper tooling.
+  Management:
+  CTO sponsors the initiative but middle managers are neutral.
+  Project managers: "Container migration isn't in our roadmap."
+  No incentives aligned with container adoption.
+  Change management strategy needed:
+  1. Create urgency — show real costs of current approach
+  2. Build a coalition — identify champions in each group
+  3. Quick wins — solve real pain points first
+  4. Enablement — training, documentation, support
+  5. Incentive alignment — tie container adoption to OKRs
+  6. Celebrate success — publicize wins, recognize adopters
+  7. Sustain — embed in hiring, onboarding, promotion criteria
+  Task: Design the organizational transformation strategy. Write:
+  the change management framework, addressing each stakeholder
+  group's concerns, training and enablement program, metrics for
+  transformation progress, common transformation failure modes,
+  and the role of leadership in driving technology adoption.
+assertions:
+  - type: llm_judge
+    criteria: "Change management framework is structured — use Kotter's 8-step model or similar: create urgency (show competitor advantage, calculate cost of status quo), build coalition (executive sponsor + tech leads from willing teams + operations champion), quick wins (solve a visible pain point in 30 days), scale (expand from pilot teams to adjacent teams). Transformation timeline: 12-18 months for meaningful adoption, 3-5 years for full organizational shift. Don't announce a 'container mandate' — enable and incentivize instead"
+    weight: 0.35
+    description: "Change framework"
+  - type: llm_judge
+    criteria: "Stakeholder-specific strategies are empathetic — operations: retrain as platform engineers (higher-value role, not elimination), pair with developers, give ownership of the container platform. Developers: provide golden paths (make containers easier than current approach), don't ask teams to stop feature work — integrate container adoption into existing projects. Security: give better tools (runtime monitoring, automated scanning gives more visibility than VMs), involve in platform design. Management: show metrics (deployment speed, incident reduction), align with business OKRs"
+    weight: 0.35
+    description: "Stakeholder strategies"
+  - type: llm_judge
+    criteria: "Failure modes and measurement are realistic — common failures: mandating adoption without enablement, moving too fast (teams overwhelmed), moving too slow (initiative loses momentum), not investing in platform team (adoption stalls without support), not celebrating wins (no positive reinforcement). Measure: adoption rate (% services containerized), developer satisfaction (NPS), deployment frequency per team, time to onboard new service, support ticket volume for container issues. Transformation is a people problem, not a technology problem — treat it accordingly"
+    weight: 0.30
+    description: "Failures and measurement"

package/courses/docker-container-debugging/scenarios/level-5/regulatory-compliance-containers.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: regulatory-compliance-containers
+  level: 5
+  course: docker-container-debugging
+  type: output
+  description: "Design regulatory compliance for containers — implement controls for SOC2, PCI-DSS, HIPAA, and FedRAMP in containerized environments"
+  tags: [Docker, compliance, SOC2, PCI-DSS, HIPAA, FedRAMP, regulation, master]
+state: {}
+trigger: |
+  Your company is pursuing SOC2 Type II certification and PCI-DSS
+  compliance. The auditor has questions about your container platform:
+  Auditor question 1: "How do you ensure only authorized images
+  run in production?"
+  Current answer: "We trust our developers." (Not acceptable)
+  Required: Image signing, admission control, approved registry list.
+  Auditor question 2: "How do you track who deployed what and when?"
+  Current answer: "We can check git history." (Insufficient)
+  Required: Immutable audit logs of all deployment actions with
+  identity, timestamp, image digest, and approval chain.
+  Auditor question 3: "How do you ensure containers don't contain
+  known vulnerabilities?"
+  Current answer: "We scan periodically." (When? How? What's the SLA?)
+  Required: Automated scanning in CI/CD with defined severity thresholds,
+  documented exception process, SLA for patching (CRITICAL: 24h,
+  HIGH: 7d, MEDIUM: 30d).
+  Auditor question 4: "How do you isolate cardholder data environments?"
+  Current answer: "Different Docker network." (Insufficient for PCI)
+  Required: Network segmentation with documented firewall rules,
+  encrypted communication (mTLS), access logging, separate
+  infrastructure for CDE.
+  Auditor question 5: "How do you handle secrets and encryption keys?"
+  Current answer: "Environment variables in Docker Compose."
+  Required: Dedicated secrets manager (Vault), encryption at rest,
+  rotation policy, access auditing.
+  Task: Design compliance controls for containerized environments.
+  Write: control mappings for SOC2 and PCI-DSS, image governance
+  (signing, scanning, admission), audit logging architecture,
+  network segmentation for compliance, secrets management, and the
+  continuous compliance monitoring approach.
+assertions:
+  - type: llm_judge
+    criteria: "Control mappings are specific — SOC2: CC6.1 (logical access) → RBAC, namespace isolation, registry access control. CC7.1 (monitoring) → container runtime monitoring, audit logs. CC8.1 (change management) → GitOps, immutable images, deployment approvals. PCI-DSS: Requirement 2 (secure configuration) → hardened container images, CIS benchmarks. Requirement 6 (secure development) → image scanning in CI. Requirement 10 (logging) → centralized audit logs with tamper protection. Requirement 11 (testing) → regular vulnerability scanning"
+    weight: 0.35
+    description: "Control mappings"
+  - type: llm_judge
+    criteria: "Image governance and audit are covered — image lifecycle: build → scan → sign → approve → deploy. Admission controller (OPA/Kyverno) rejects unsigned or unscanned images. Approved registry allowlist prevents pulling from public registries. Audit logging: every docker/kubectl command logged with identity (who), action (what), resource (which container/image), timestamp (when), result (success/fail). Logs must be immutable (append-only, shipped to SIEM). Retention: 1 year minimum for SOC2, as defined by PCI-DSS"
+    weight: 0.35
+    description: "Governance and audit"
+  - type: llm_judge
+    criteria: "Network and secrets compliance are practical — PCI CDE isolation: separate cluster or namespace with strict network policies. mTLS between all services in CDE (service mesh). No direct internet access from CDE containers. Secrets: HashiCorp Vault or cloud KMS, automatic rotation, access auditing, encryption at rest. Never in environment variables, Docker Compose files, or image layers. Continuous compliance: automated scanning against CIS Docker Benchmark, regular penetration testing, compliance dashboards for auditors, automated evidence collection"
+    weight: 0.30
+    description: "Network and secrets"

package/courses/kubernetes-deployment-troubleshooting/course.yaml ADDED Viewed

@@ -0,0 +1,12 @@
+id: kubernetes-deployment-troubleshooting
+name: "Kubernetes Deployment Troubleshooting"
+description: >
+  Master Kubernetes deployment troubleshooting from pod debugging basics
+  to enterprise platform operations. Learn to diagnose CrashLoopBackOff,
+  ImagePullBackOff, OOMKilled pods, fix networking and service issues,
+  manage storage and RBAC, optimize resources with HPA/VPA, implement
+  GitOps pipelines, and design multi-cluster disaster recovery
+  strategies for large-scale Kubernetes deployments.
+levels: 5
+scenarios_per_level: 10
+tags: [development, Kubernetes, DevOps, troubleshooting, containers, deployment, cloud-native]

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/configmap-secret-issues.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: configmap-secret-issues
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug ConfigMap and Secret issues — diagnose missing environment variables, mount failures, and configuration mismatches"
+  tags: [Kubernetes, ConfigMap, Secret, environment-variables, configuration, beginner]
+state: {}
+trigger: |
+  Your application deployed successfully but is returning 500 errors.
+  The logs show it can't find its configuration:
+  $ kubectl get pods
+  NAME                       READY   STATUS    RESTARTS   AGE
+  auth-svc-2b3c4d5e-fg67   1/1     Running   0          2m
+  $ kubectl logs auth-svc-2b3c4d5e-fg67
+  Error: Missing required environment variable: JWT_SECRET
+  Error: Missing required environment variable: REDIS_URL
+  Error: Configuration validation failed, shutting down...
+  Process exited with code 1
+  Wait — the pod status shows Running, not CrashLoopBackOff? That's
+  because the container has a startup delay and the crash happened after
+  the liveness probe window.
+  $ kubectl describe pod auth-svc-2b3c4d5e-fg67
+  Environment:
+    DATABASE_URL:   <set to the key 'DATABASE_URL' in secret 'auth-secrets'>
+    JWT_SECRET:     <set to the key 'JWT_SECRET' in secret 'auth-secrets'>
+    REDIS_URL:      <set to the key 'REDIS_URL' in configmap 'auth-config'>
+  $ kubectl get secret auth-secrets -o jsonpath='{.data}' | jq
+  {
+    "DATABASE_URL": "cG9zdGdyZXM6Ly8uLi4="
+  }
+  The secret exists but only has DATABASE_URL — JWT_SECRET is missing
+  from the secret! And the ConfigMap:
+  $ kubectl get configmap auth-config
+  Error from server (NotFound): configmaps "auth-config" not found
+  The ConfigMap doesn't exist at all. Two issues:
+  1. Secret auth-secrets is missing the JWT_SECRET key
+  2. ConfigMap auth-config was never created
+  Task: Explain how ConfigMaps and Secrets work in Kubernetes. Write:
+  how pods consume configuration (env vars vs volume mounts), what
+  happens when a referenced ConfigMap/Secret is missing (pod may or may
+  not start depending on optional flag), how to debug missing config,
+  the difference between ConfigMaps and Secrets, and best practices for
+  configuration management.
+assertions:
+  - type: llm_judge
+    criteria: "ConfigMap and Secret consumption is explained — two methods: environment variables (envFrom or env with valueFrom) and volume mounts. envFrom loads all keys, env with valueFrom loads specific keys. Volume mounts project keys as files. If a referenced ConfigMap/Secret doesn't exist, the pod fails to start unless the reference is marked as optional"
+    weight: 0.35
+    description: "ConfigMap/Secret consumption"
+  - type: llm_judge
+    criteria: "Debugging steps are systematic — check pod events (kubectl describe pod), verify ConfigMap/Secret exists (kubectl get cm/secret), inspect keys (kubectl get secret -o jsonpath or -o yaml), verify key names match what the deployment references, check if the reference is in the correct namespace. Shows how base64 encoding works for secrets"
+    weight: 0.35
+    description: "Debugging steps"
+  - type: llm_judge
+    criteria: "Differences and best practices are covered — ConfigMaps for non-sensitive config, Secrets for sensitive data (base64 encoded, can be encrypted at rest in etcd). Best practices: use optional references where appropriate, validate config in CI/CD, use sealed-secrets or external secret managers for production, consider volume mounts for auto-updates vs env vars requiring restart"
+    weight: 0.30
+    description: "Differences and best practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/crashloopbackoff.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: crashloopbackoff
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug CrashLoopBackOff — diagnose why a pod keeps crashing and restarting, using kubectl logs, describe, and events"
+  tags: [Kubernetes, CrashLoopBackOff, debugging, pod-lifecycle, beginner]
+state: {}
+trigger: |
+  You deployed a new version of your API service and kubectl shows:
+  $ kubectl get pods
+  NAME                        READY   STATUS             RESTARTS   AGE
+  api-server-7b8f9c4d-x2k9l  0/1     CrashLoopBackOff   5          3m
+  api-server-7b8f9c4d-m4p7q  0/1     CrashLoopBackOff   5          3m
+  api-server-7b8f9c4d-r9n1j  0/1     CrashLoopBackOff   5          3m
+  All 3 replicas are in CrashLoopBackOff. The previous version was
+  running fine. The team is panicking because the API is down.
+  kubectl describe pod api-server-7b8f9c4d-x2k9l shows:
+  State:          Waiting
+    Reason:       CrashLoopBackOff
+  Last State:     Terminated
+    Reason:       Error
+    Exit Code:    1
+    Started:      2025-12-01T10:00:05Z
+    Finished:     2025-12-01T10:00:06Z
+  Events:
+    Warning  BackOff  3m  kubelet  Back-off restarting failed container
+  kubectl logs api-server-7b8f9c4d-x2k9l shows:
+  Error: connect ECONNREFUSED 10.100.50.3:5432
+  Error: Unable to connect to database
+  Process exited with code 1
+  The application requires a PostgreSQL database connection. The
+  database is running in the same cluster as a StatefulSet. Nothing
+  changed about the database — only the API image was updated.
+  Investigation reveals:
+  - The new image version changed the env var name from DATABASE_URL
+    to DB_CONNECTION_STRING
+  - The ConfigMap still has DATABASE_URL
+  - The container starts, fails to connect (wrong env var), and exits
+  Task: Explain how to debug CrashLoopBackOff. Write: what
+  CrashLoopBackOff means (the restart backoff mechanism), the debugging
+  workflow (kubectl logs → describe → events), common causes (app crash,
+  missing env vars, missing dependencies, OOM, wrong command), the fix
+  for this specific case, and how to prevent this in the future.
+assertions:
+  - type: llm_judge
+    criteria: "CrashLoopBackOff is explained — the container starts, crashes, Kubernetes restarts it with exponential backoff (10s, 20s, 40s... up to 5 minutes). Exit code 1 indicates application error (not OOMKilled=137, not SIGTERM=143). The backoff means Kubernetes is waiting longer between each restart attempt"
+    weight: 0.35
+    description: "CrashLoopBackOff explained"
+  - type: llm_judge
+    criteria: "Debugging workflow is systematic — (1) kubectl logs <pod> to see application error, (2) kubectl logs <pod> --previous if container already restarted, (3) kubectl describe pod to see events, exit codes, and container state, (4) check env vars with kubectl exec (if pod stays up) or kubectl get cm/secret. Identifies the env var mismatch as root cause"
+    weight: 0.35
+    description: "Systematic debugging workflow"
+  - type: llm_judge
+    criteria: "Fix and prevention are practical — immediate fix: update ConfigMap to include DB_CONNECTION_STRING, or update deployment to map the old var name. Prevention: validate env vars in CI/CD, use health checks (readiness probe on DB connection), and consider using a shared config schema between app versions"
+    weight: 0.30
+    description: "Fix and prevention"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/deployment-rollout.yaml ADDED Viewed

@@ -0,0 +1,56 @@
+meta:
+  id: deployment-rollout
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug deployment rollout issues — understand rolling updates, rollback, and why a deployment might get stuck"
+  tags: [Kubernetes, Deployment, rollout, rolling-update, rollback, beginner]
+state: {}
+trigger: |
+  You updated your deployment to a new image version but the rollout
+  is stuck. Old pods are still running alongside partially updated ones:
+  $ kubectl rollout status deployment/user-svc
+  Waiting for deployment "user-svc" spec update to be observed...
+  Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
+  $ kubectl get pods
+  NAME                        READY   STATUS             RESTARTS   AGE
+  user-svc-7a8b9c0d-old1    1/1     Running            0          2h
+  user-svc-7a8b9c0d-old2    1/1     Running            0          2h
+  user-svc-7a8b9c0d-old3    1/1     Running            0          2h
+  user-svc-3e4f5g6h-new1    0/1     CrashLoopBackOff   4          5m
+  The deployment strategy is RollingUpdate with maxSurge=1 and
+  maxUnavailable=0. The new pod keeps crashing, so the rollout can't
+  proceed (it needs at least 1 new pod Ready before terminating old ones).
+  $ kubectl logs user-svc-3e4f5g6h-new1
+  Error: Cannot connect to database — migration v15 not applied
+  The new version requires a database migration that wasn't run.
+  The deployment is stuck: new pods crash, old pods keep serving traffic,
+  but the deployment never completes. After 10 minutes the
+  progressDeadlineSeconds (default 600s) will mark it as Failed.
+  Task: Explain how Kubernetes deployment rollouts work. Write: the
+  RollingUpdate strategy (maxSurge, maxUnavailable), what happens when
+  a rollout gets stuck, how to check rollout status and history, how
+  to rollback (kubectl rollout undo), progressDeadlineSeconds, and the
+  Recreate strategy as an alternative.
+assertions:
+  - type: llm_judge
+    criteria: "Rolling update mechanics are explained — maxSurge controls how many extra pods can be created during update, maxUnavailable controls how many pods can be down. With maxSurge=1 maxUnavailable=0, Kubernetes creates 1 new pod, waits for it to be Ready, then terminates 1 old pod. If the new pod never becomes Ready, the rollout stalls"
+    weight: 0.35
+    description: "Rolling update mechanics"
+  - type: llm_judge
+    criteria: "Stuck rollout debugging is explained — kubectl rollout status shows progress, kubectl rollout history shows revision history, progressDeadlineSeconds (default 600s) marks deployment as Failed if no progress within deadline. A stuck rollout means old pods continue serving traffic (safe but incomplete update)"
+    weight: 0.35
+    description: "Stuck rollout debugging"
+  - type: llm_judge
+    criteria: "Rollback and alternatives are covered — kubectl rollout undo deployment/<name> reverts to previous revision, kubectl rollout undo --to-revision=N reverts to specific version. Recreate strategy: terminates all old pods before creating new ones (causes downtime but avoids version mixing). kubectl rollout pause/resume for controlled rollouts"
+    weight: 0.30
+    description: "Rollback and alternatives"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/first-troubleshooting-shift.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: first-troubleshooting-shift
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Combined troubleshooting shift — diagnose multiple pod failures across a namespace using the full beginner debugging toolkit"
+  tags: [Kubernetes, troubleshooting, debugging, combined, shift-simulation, beginner]
+state: {}
+trigger: |
+  You're on-call and get paged: "Multiple services down in staging
+  namespace." You run:
+  $ kubectl get pods -n staging
+  NAME                          READY   STATUS             RESTARTS   AGE
+  api-gw-5a6b7c8d-e9f0        0/1     CrashLoopBackOff   6          15m
+  user-svc-1g2h3i4j-k5l6      0/1     ImagePullBackOff   0          15m
+  order-svc-7m8n9o0p-q1r2     0/1     Pending            0          15m
+  payment-svc-3s4t5u6v-w7x8   1/1     Running            0          15m
+  notification-svc-9a0b-c1d2  0/1     Running            0          15m
+  Five services, four different problems:
+  1. api-gw — CrashLoopBackOff, exit code 1
+     Logs: "Error: REDIS_HOST environment variable not set"
+     The ConfigMap was deleted during a cleanup.
+  2. user-svc — ImagePullBackOff
+     Events: "Failed to pull image 'registry.company.com/user-svc:v3.2.1':
+     unauthorized"
+     The registry token expired overnight.
+  3. order-svc — Pending
+     Events: "0/3 nodes are available: 3 Insufficient memory"
+     Requests 4Gi memory but largest available is 2Gi.
+  4. payment-svc — Running but 0/1 Ready? No, it shows 1/1, but the
+     Service has no endpoints. Label mismatch: Service selector
+     app=payment, pod label app=payment-svc.
+  5. notification-svc — Running, 0/1 Ready
+     Readiness probe failing: HTTP 503 on /ready
+     The downstream email service is unreachable.
+  Task: Walk through diagnosing all five issues. Write: the triage
+  approach (start with kubectl get pods, identify status patterns), the
+  debugging steps for each issue type (CrashLoopBackOff → logs,
+  ImagePullBackOff → events, Pending → describe, Service issues →
+  endpoints, Readiness → probe config), and how to prioritize which to
+  fix first in a real incident.
+assertions:
+  - type: llm_judge
+    criteria: "All five issues are correctly diagnosed — (1) CrashLoopBackOff from missing ConfigMap causing missing env var, (2) ImagePullBackOff from expired registry credentials, (3) Pending from insufficient memory on nodes, (4) Service-pod label mismatch causing empty endpoints despite pod running, (5) Readiness probe failing due to downstream dependency. Each diagnosis maps to the correct kubectl command"
+    weight: 0.35
+    description: "All issues diagnosed"
+  - type: llm_judge
+    criteria: "Triage methodology is systematic — start with kubectl get pods for overview, group by status type, check most impactful services first. Use kubectl describe for events, kubectl logs for application errors, kubectl get endpoints for service connectivity. Prioritize: fix what unblocks the most services first (e.g., ConfigMap might affect multiple services)"
+    weight: 0.35
+    description: "Triage methodology"
+  - type: llm_judge
+    criteria: "Fixes and prioritization are practical — prioritize payment-svc (revenue-critical, quick label fix), then api-gw (gateway affects all downstream, recreate ConfigMap), then user-svc (refresh registry token), then order-svc (scale down request or add capacity), then notification-svc (investigate downstream dependency). Shows the actual kubectl fix commands"
+    weight: 0.30
+    description: "Fixes and prioritization"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/health-probe-failures.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: health-probe-failures
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug health probe failures — understand liveness, readiness, and startup probes and why misconfigured probes cause restarts or traffic loss"
+  tags: [Kubernetes, health-probes, liveness, readiness, startup, beginner]
+state: {}
+trigger: |
+  Your application is running but keeps getting restarted, and users
+  report intermittent 503 errors:
+  $ kubectl get pods
+  NAME                       READY   STATUS    RESTARTS   AGE
+  web-app-4a5b6c7d-mn01    1/1     Running   12         1h
+  web-app-4a5b6c7d-op23    0/1     Running   0          30s
+  web-app-4a5b6c7d-qr45    1/1     Running   8          1h
+  $ kubectl describe pod web-app-4a5b6c7d-mn01
+  Containers:
+    web-app:
+      Liveness:   http-get http://:8080/healthz delay=5s timeout=1s
+                  period=10s #success=1 #failure=3
+      Readiness:  http-get http://:8080/ready delay=5s timeout=1s
+                  period=10s #success=1 #failure=3
+  Events:
+    Warning  Unhealthy  2m   kubelet  Liveness probe failed: HTTP probe
+             failed with statuscode: 503
+    Normal   Killing    2m   kubelet  Container web-app failed liveness
+             check, will be restarted
+    Warning  Unhealthy  30s  kubelet  Readiness probe failed: HTTP probe
+             failed with statuscode: 503
+  The application takes 45 seconds to fully start up (loading caches,
+  warming connections). But the liveness probe starts checking at 5
+  seconds with only 3 failures allowed (5s + 3*10s = 35s), so it kills
+  the container before it's ready.
+  Additionally, the readiness probe uses the same aggressive timing,
+  so during startup the pod is removed from the Service endpoints,
+  causing 503s for users.
+  Problems:
+  1. Liveness probe starts too early — kills the container during startup
+  2. No startup probe — would protect the container during initialization
+  3. Readiness probe timing causes premature endpoint removal
+  Task: Explain Kubernetes health probes. Write: the difference between
+  liveness, readiness, and startup probes, what each one controls (restart
+  vs traffic vs startup protection), how to configure timing parameters
+  (initialDelaySeconds, periodSeconds, failureThreshold, timeoutSeconds),
+  common misconfiguration patterns, and best practices for slow-starting
+  applications.
+assertions:
+  - type: llm_judge
+    criteria: "All three probe types are explained — liveness: kills and restarts the container if it fails (detects deadlocks/hangs), readiness: removes pod from Service endpoints if it fails (controls traffic routing), startup: disables liveness/readiness until it succeeds (protects slow-starting containers). Each probe serves a different purpose and should not use the same endpoint"
+    weight: 0.35
+    description: "Probe types explained"
+  - type: llm_judge
+    criteria: "Timing parameters are explained with the math — initialDelaySeconds (wait before first check), periodSeconds (interval between checks), failureThreshold (consecutive failures before action), timeoutSeconds (per-check timeout). Total startup tolerance = initialDelaySeconds + (failureThreshold * periodSeconds). In this case: 5 + (3*10) = 35s < 45s startup time, so the container gets killed"
+    weight: 0.35
+    description: "Timing parameters explained"
+  - type: llm_judge
+    criteria: "Fix and best practices are practical — add a startup probe with generous timeout (e.g., failureThreshold=30, periodSeconds=5 = 150s window), keep liveness probe for runtime health only (not startup), make readiness check actual dependency availability, never use the same endpoint for all three probes if they serve different purposes"
+    weight: 0.30
+    description: "Fix and best practices"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-1/imagepullbackoff.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: imagepullbackoff
+  level: 1
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Debug ImagePullBackOff — diagnose why Kubernetes can't pull a container image from the registry"
+  tags: [Kubernetes, ImagePullBackOff, container-registry, authentication, beginner]
+state: {}
+trigger: |
+  You deployed a new service but pods are stuck in ImagePullBackOff:
+  $ kubectl get pods
+  NAME                          READY   STATUS             RESTARTS   AGE
+  payment-svc-5c6d7e8f-abc12   0/1     ImagePullBackOff   0          5m
+  payment-svc-5c6d7e8f-def34   0/1     ImagePullBackOff   0          5m
+  $ kubectl describe pod payment-svc-5c6d7e8f-abc12
+  Events:
+    Normal   Scheduled  5m  default-scheduler  Successfully assigned...
+    Normal   Pulling    5m  kubelet  Pulling image "gcr.io/mycompany/payment:v2.1.0"
+    Warning  Failed     5m  kubelet  Failed to pull image "gcr.io/mycompany/payment:v2.1.0":
+             rpc error: code = Unknown desc = Error response from daemon:
+             unauthorized: You don't have the needed permissions to perform
+             this operation, and you may have invalid credentials.
+    Warning  Failed     5m  kubelet  Error: ImagePullBackOff
+  The v2.0.0 tag works fine. v2.1.0 was just pushed to GCR.
+  Possible causes to investigate:
+  1. Image doesn't exist (typo in tag or repository)
+  2. Registry authentication expired or misconfigured
+  3. Image was pushed to a different registry/project
+  4. imagePullSecrets not configured on the ServiceAccount
+  5. Network issue between cluster and registry
+  6. Image pull policy: IfNotPresent with a mutable tag
+  Task: Explain how to debug ImagePullBackOff. Write: the common
+  causes (authentication, typo, network, pull policy), the debugging
+  steps (verify image exists, check secrets, test pull manually), how
+  imagePullSecrets work, image pull policies (Always, IfNotPresent,
+  Never), and best practices for image management.
+assertions:
+  - type: llm_judge
+    criteria: "Common causes are listed — authentication failure (expired token, missing imagePullSecrets), image tag doesn't exist (typo, not pushed yet), registry network issue, wrong image pull policy. The error message 'unauthorized' points to authentication as the likely cause in this scenario"
+    weight: 0.35
+    description: "Common causes listed"
+  - type: llm_judge
+    criteria: "Debugging steps are actionable — verify image exists (docker pull or crane/skopeo), check imagePullSecrets (kubectl get sa default -o yaml, kubectl get secret), verify secret has correct credentials (kubectl get secret -o jsonpath), test from node directly. Shows the kubectl commands for each step"
+    weight: 0.35
+    description: "Actionable debugging steps"
+  - type: llm_judge
+    criteria: "Pull policies and best practices are explained — Always (re-pull every time, good for mutable tags like :latest), IfNotPresent (use cached if available, good for immutable tags), Never (only use locally loaded). Best practices: use immutable tags (SHA or semver, not :latest), rotate registry credentials, set imagePullSecrets on ServiceAccount rather than per-pod"
+    weight: 0.30
+    description: "Pull policies and best practices"