npm - dojo.md - Versions diffs - 0.2.0 → 0.2.2 - Mend

dojo.md 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (225) hide show

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/incident-management-process.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: incident-management-process
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes incident management process — escalation procedures, communication templates, postmortem culture, and SLA/SLO framework"
+  tags: [Kubernetes, incident-management, SRE, SLO, postmortem, process, expert]
+state: {}
+trigger: |
+  Your organization experienced 3 major Kubernetes incidents in the
+  past month, each lasting 2+ hours. Post-incident reviews revealed
+  systemic issues:
+  Incident 1 — Database Outage (3.5 hours):
+  - Root cause: Accidental ConfigMap update pushed to production
+  - Detection: Customer complaint after 45 minutes
+  - Impact: 100% of write operations failed
+  - Issue: No one knew who to page, no runbook existed
+  Incident 2 — Memory Leak Causing Cascading Failures (2.5 hours):
+  - Root cause: New deployment had memory leak, OOMKilled repeatedly
+  - Detection: Prometheus alert fired but went to wrong Slack channel
+  - Impact: 3 dependent services degraded
+  - Issue: No circuit breaker, no isolation between services
+  Incident 3 — Node Failure During Cluster Upgrade (2 hours):
+  - Root cause: Drain failed due to PDB, manual override corrupted state
+  - Detection: Immediate (during planned maintenance)
+  - Impact: 30% of pods evicted uncontrollably
+  - Issue: No change management process, no rollback plan
+  Common themes:
+  - No standardized severity levels
+  - No on-call rotation for Kubernetes platform
+  - No incident commander role defined
+  - Postmortems are blame-focused, findings not actioned
+  - No SLOs defined for the platform
+  Task: Design a Kubernetes incident management process. Write: severity
+  classification (SEV1-4), on-call structure and escalation paths,
+  incident response playbook for common Kubernetes failures, SLO/SLI
+  framework for the platform, blameless postmortem template, and how
+  to build a reliability culture.
+assertions:
+  - type: llm_judge
+    criteria: "Severity classification is defined — SEV1: full production outage affecting all users (15 min response, incident commander, all-hands), SEV2: partial outage affecting subset of users (30 min response, on-call lead), SEV3: degraded performance but functional (2h response, next business day), SEV4: minor issue no user impact (ticket-based). Clear criteria for each level based on user impact, revenue impact, and data integrity"
+    weight: 0.35
+    description: "Severity classification"
+  - type: llm_judge
+    criteria: "Incident response process is structured — on-call rotation for platform team (primary + secondary), incident commander role (coordinates communication, not debugging), dedicated incident Slack channel per SEV1/2, status page updates every 15 minutes, stakeholder communication templates. Playbooks for common Kubernetes failures: CrashLoopBackOff, node failure, storage issues, network partitions — with decision trees and kubectl commands"
+    weight: 0.35
+    description: "Incident response"
+  - type: llm_judge
+    criteria: "SLOs and postmortems are practical — SLIs: availability (% of successful requests), latency (P99 < Xms), error rate. SLOs: 99.9% availability = 43.8 min/month downtime budget. Error budget policy: when budget exhausted, freeze deployments until reliability improves. Blameless postmortem template: timeline, root cause, contributing factors, action items with owners and deadlines. Schedule postmortem within 48 hours, track action items to completion"
+    weight: 0.30
+    description: "SLOs and postmortems"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-cluster-operations.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: multi-cluster-operations
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design multi-cluster troubleshooting strategy — cross-cluster service discovery, federated monitoring, and disaster recovery across regions"
+  tags: [Kubernetes, multi-cluster, federation, disaster-recovery, cross-region, expert]
+state: {}
+trigger: |
+  Your organization runs 5 Kubernetes clusters across 3 regions for a
+  global e-commerce platform:
+  | Cluster      | Region      | Purpose           | Nodes | Workloads |
+  |-------------|-------------|-------------------|-------|-----------|
+  | prod-us-1   | us-east-1   | Primary US        | 50    | 200+      |
+  | prod-us-2   | us-west-2   | Secondary US      | 30    | 150+      |
+  | prod-eu-1   | eu-west-1   | EU primary        | 40    | 180+      |
+  | prod-apac-1 | ap-south-1  | APAC primary      | 25    | 100+      |
+  | staging     | us-east-1   | Pre-production    | 15    | 200+      |
+  Current challenges:
+  1. Cross-cluster service discovery — services in prod-us-1 need to
+     call services in prod-eu-1 for GDPR compliance (EU user data stays
+     in EU). Currently using hardcoded LoadBalancer IPs.
+  2. Monitoring fragmentation — each cluster has its own Prometheus,
+     but no unified view. Alert fatigue from 5 separate Alertmanagers.
+  3. Disaster recovery — when prod-us-1 had a control plane issue last
+     month, failover to prod-us-2 took 45 minutes (manual DNS switch,
+     database failover, config updates). Target RTO is 5 minutes.
+  4. Configuration drift — clusters have diverged. Same app deployed
+     with different resource limits, different RBAC policies, different
+     network policies across clusters.
+  5. Upgrade coordination — upgrading all clusters takes weeks of
+     sequential work, and version skew between clusters causes API
+     compatibility issues.
+  Task: Design a multi-cluster troubleshooting and operations strategy.
+  Write: cross-cluster service discovery approaches (Istio multi-cluster,
+  Submariner, DNS-based), federated monitoring architecture (Thanos/
+  Cortex for global Prometheus view), automated failover design, GitOps
+  for multi-cluster consistency, and the operational runbook for
+  cross-cluster incidents.
+assertions:
+  - type: llm_judge
+    criteria: "Cross-cluster connectivity is addressed — options: Istio multi-cluster mesh (automatic cross-cluster service discovery via shared control plane or remote clusters), Submariner (L3 connectivity between clusters), DNS-based with external-dns + global load balancer, or service export/import API. Trade-offs between complexity, latency, and feature richness. Istio provides mTLS across clusters but adds operational overhead"
+    weight: 0.35
+    description: "Cross-cluster connectivity"
+  - type: llm_judge
+    criteria: "Federated monitoring is designed — use Thanos or Cortex for global Prometheus query layer across all clusters. Each cluster runs local Prometheus, Thanos sidecar uploads to object storage, Thanos Query aggregates across clusters. Unified Alertmanager with deduplication to prevent alert fatigue. Grafana dashboards with cluster selector for unified view"
+    weight: 0.35
+    description: "Federated monitoring"
+  - type: llm_judge
+    criteria: "Failover and consistency are practical — automated failover: global load balancer (Route53/CloudFlare) with health checks, pre-provisioned standby capacity, database replication with automatic promotion. GitOps: single Git repo with cluster-specific overlays (Kustomize/Helm), ArgoCD ApplicationSets for multi-cluster sync, policy-as-code (OPA/Kyverno) for consistency enforcement. Upgrade strategy: rolling cluster upgrades starting with staging"
+    weight: 0.30
+    description: "Failover and consistency"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/multi-tenancy-design.yaml ADDED Viewed

@@ -0,0 +1,55 @@
+meta:
+  id: multi-tenancy-design
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes multi-tenancy strategy — namespace isolation, RBAC, NetworkPolicy, ResourceQuota, and tenant troubleshooting workflows"
+  tags: [Kubernetes, multi-tenancy, isolation, RBAC, namespace, security, expert]
+state: {}
+trigger: |
+  Your company is consolidating from 12 separate Kubernetes clusters
+  (one per team) to 3 shared clusters. Each team currently has full
+  cluster-admin access. Leadership wants to reduce infrastructure costs
+  by 40% through better utilization.
+  Teams and their requirements:
+  | Team        | Services | CPU/Mem  | Compliance    | Access Level  |
+  |------------|----------|----------|---------------|---------------|
+  | Platform   | 15       | 32c/64G  | SOC2          | Cluster admin |
+  | Payments   | 8        | 16c/32G  | PCI-DSS       | Namespace     |
+  | ML/Data    | 6        | 64c/256G | None          | Namespace+GPU |
+  | Frontend   | 12       | 8c/16G   | None          | Namespace     |
+  | Backend    | 20       | 24c/48G  | None          | Namespace     |
+  | Mobile API | 10       | 16c/32G  | None          | Namespace     |
+  Concerns raised by teams:
+  1. Payments team: "PCI-DSS requires network isolation. How do we
+     ensure no other team's pods can reach our services?"
+  2. ML team: "We need GPU nodes. Other teams shouldn't schedule on them."
+  3. All teams: "We need to deploy without waiting for a platform team
+     ticket. We currently have cluster-admin."
+  4. Platform team: "How do we prevent one team from consuming all
+     cluster resources during a traffic spike?"
+  5. Security: "How do we audit who did what across all tenants?"
+  Task: Design a multi-tenancy strategy addressing all concerns. Write:
+  namespace isolation architecture (RBAC, NetworkPolicy, ResourceQuota
+  per team), PCI-DSS compliance approach for payments namespace, GPU
+  node isolation with taints/tolerations, self-service deployment model
+  (GitOps per team), resource fairness and priority, and audit logging.
+assertions:
+  - type: llm_judge
+    criteria: "Namespace isolation is comprehensive — each team gets dedicated namespace(s) with: RBAC roles scoped to their namespace (not cluster-admin), NetworkPolicy default-deny with allow rules only for intra-team and approved cross-team traffic, ResourceQuotas matching their allocation, LimitRanges for per-pod defaults. Platform team retains cluster-level access for infrastructure management"
+    weight: 0.35
+    description: "Namespace isolation"
+  - type: llm_judge
+    criteria: "Specific requirements are addressed — PCI-DSS: dedicated node pool with taints, strict NetworkPolicy (no ingress from non-payment namespaces), encrypted secrets, audit logging on all API access. GPU isolation: taints on GPU nodes (dedicated=gpu:NoSchedule), only ML namespace pods have tolerations. Self-service: ArgoCD AppProjects per team (can deploy to their namespace only). Priority classes for critical workloads"
+    weight: 0.35
+    description: "Specific requirements"
+  - type: llm_judge
+    criteria: "Operational model is practical — GitOps per team: each team owns their manifests in Git, ArgoCD syncs to their namespace. Resource fairness: ResourceQuotas prevent resource hogging, PriorityClasses ensure critical workloads (payments) aren't preempted. Audit: enable Kubernetes audit logging, ship to central SIEM, configure audit policy to log all write operations. Consider vCluster for teams needing CRDs or more isolation"
+    weight: 0.30
+    description: "Operational model"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/platform-engineering.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: platform-engineering
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design an internal developer platform on Kubernetes — self-service deployment, golden paths, and developer experience optimization"
+  tags: [Kubernetes, platform-engineering, developer-experience, IDP, self-service, expert]
+state: {}
+trigger: |
+  Your VP of Engineering asks: "Developers spend 40% of their time on
+  Kubernetes YAML, debugging deployments, and waiting for platform
+  tickets. How do we fix this?"
+  Current developer experience survey results:
+  - "Writing Kubernetes manifests is my least favorite part of the job"
+  - "I don't understand why my deployment failed, the error messages are cryptic"
+  - "I need to wait 3 days for a platform team ticket to create a new namespace"
+  - "Every team has different deployment patterns, no consistency"
+  - "I can't run my service locally in a way that mimics production"
+  Current state:
+  - 50 developers, 12 teams
+  - Each team writes raw Kubernetes YAML
+  - No standardized CI/CD pipeline
+  - Manual namespace provisioning (platform team ticket)
+  - No local development environment that mirrors Kubernetes
+  - Average time from code merge to production: 4 hours
+  - 30% of deployments require rollback
+  Target state:
+  - Developers focus on code, not Kubernetes primitives
+  - Self-service for common operations (new namespace, new service)
+  - Standardized deployment pipeline with guardrails
+  - Local development parity with production
+  - Merge to production in 15 minutes
+  - Rollback rate < 5%
+  Task: Design an internal developer platform (IDP) on Kubernetes.
+  Write: the abstraction layers (what developers see vs what Kubernetes
+  sees), golden paths for common patterns (web service, worker, cron job),
+  self-service capabilities (Backstage/Port catalog), CI/CD standardization,
+  local development approach (Tilt/Skaffold/DevSpace), and how to measure
+  developer productivity improvements.
+assertions:
+  - type: llm_judge
+    criteria: "Abstraction layers are defined — developers interact with simplified service definitions (not raw YAML), platform team maintains templates/operators that translate to Kubernetes resources. Options: Backstage service catalog with templates, Crossplane compositions, custom CRDs with operators, or Helm chart library. Developer writes: service name, image, port, resources, env vars. Platform generates: Deployment, Service, Ingress, HPA, PDB, NetworkPolicy, ServiceMonitor"
+    weight: 0.35
+    description: "Abstraction layers"
+  - type: llm_judge
+    criteria: "Golden paths and CI/CD are practical — standardized templates for web service (Deployment + HPA + Ingress), worker (Deployment without Ingress), cron job (CronJob with monitoring), stateful service (StatefulSet + PVC). CI/CD pipeline: build → test → security scan → deploy to staging → automated tests → deploy to production. Progressive delivery with canary deployments. Rollback automation on error rate increase"
+    weight: 0.35
+    description: "Golden paths and CI/CD"
+  - type: llm_judge
+    criteria: "Local development and metrics are covered — local development: Tilt (live reload Kubernetes dev), DevSpace, or Telepresence (route cluster traffic to local machine). Developer productivity metrics: DORA metrics (deployment frequency, lead time, change failure rate, MTTR), developer satisfaction surveys, time-to-first-deployment for new services, percentage of deployments using golden paths"
+    weight: 0.30
+    description: "Local dev and metrics"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-4/security-hardening.yaml ADDED Viewed

@@ -0,0 +1,58 @@
+meta:
+  id: security-hardening
+  level: 4
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design Kubernetes security hardening strategy — Pod Security Standards, supply chain security, runtime protection, and compliance"
+  tags: [Kubernetes, security, Pod-Security-Standards, supply-chain, compliance, expert]
+state: {}
+trigger: |
+  A security audit of your Kubernetes platform revealed critical findings:
+  CRITICAL findings:
+  1. 65% of pods run as root
+  2. 40% of containers have privileged=true
+  3. No pod security admission is enforced
+  4. Container images pulled from public registries without verification
+  5. Secrets stored as base64 in etcd (not encrypted at rest)
+  HIGH findings:
+  6. No network segmentation between namespaces
+  7. ServiceAccount tokens auto-mounted in all pods
+  8. No image vulnerability scanning in CI/CD pipeline
+  9. Kubernetes API server accessible from all pods
+  10. No audit logging enabled
+  MEDIUM findings:
+  11. No resource limits on 30% of pods (DoS risk)
+  12. ConfigMaps contain credentials that should be in Secrets
+  13. Helm charts pulled from untrusted repositories
+  14. No RBAC review process (stale permissions accumulate)
+  15. Kubernetes dashboard exposed without authentication
+  The security team requires all CRITICAL and HIGH findings remediated
+  within 90 days for SOC2 compliance. You can't just enforce everything
+  at once — that would break 200+ running services.
+  Task: Design a phased security hardening plan. Write: the remediation
+  priority and timeline, Pod Security Standards rollout strategy (audit
+  → warn → enforce), supply chain security (image signing, scanning,
+  admission control), secrets management (external vault, encryption at
+  rest), network segmentation approach, and how to balance security
+  with developer velocity.
+assertions:
+  - type: llm_judge
+    criteria: "Phased rollout is realistic — Phase 1 (weeks 1-4): enable audit logging, encrypt secrets at rest, disable auto-mounting SA tokens, scan existing images. Phase 2 (weeks 5-8): Pod Security Standards in audit mode, add NetworkPolicies in monitoring mode, image signing. Phase 3 (weeks 9-12): enforce Pod Security Standards (warn → enforce), block unsigned images, RBAC review. The progression audit → warn → enforce prevents breaking running services"
+    weight: 0.35
+    description: "Phased rollout"
+  - type: llm_judge
+    criteria: "Supply chain security is addressed — image scanning in CI/CD (Trivy/Grype), admission control (Kyverno/OPA Gatekeeper) to reject vulnerable or unsigned images, private registry as proxy cache for public images, image signing with Cosign/Notation, SBOM generation, base image standardization. Supply chain attacks are a top Kubernetes risk — images are the primary attack vector"
+    weight: 0.35
+    description: "Supply chain security"
+  - type: llm_judge
+    criteria: "Operational security balance is practical — developer velocity: provide pre-hardened base images (non-root, minimal), self-service exception process for privileged containers (approval workflow), automated RBAC provisioning tied to team membership, policy-as-code in Git (Kyverno/OPA policies reviewed via PR). Monitoring: runtime security (Falco) for detecting anomalous behavior, regular penetration testing"
+    weight: 0.30
+    description: "Security-velocity balance"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/behavioral-science.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: behavioral-science
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Behavioral science of Kubernetes troubleshooting — cognitive biases in incident response, team dynamics under pressure, and building a learning organization"
+  tags: [Kubernetes, behavioral-science, cognitive-bias, incident-response, learning-organization, master]
+state: {}
+trigger: |
+  Your post-incident reviews reveal recurring human factors that
+  contribute to incident severity and duration:
+  Pattern 1 — Anchoring Bias:
+  During a cascading failure, the on-call engineer focused on a
+  CrashLoopBackOff pod (the most visible symptom) for 45 minutes,
+  ignoring the root cause: a network partition isolating the database.
+  They anchored on the first symptom they saw.
+  Pattern 2 — Escalation Commitment:
+  An engineer spent 2 hours manually patching pods during a cluster
+  upgrade failure instead of rolling back. They had already invested
+  time and didn't want to "lose progress." The rollback would have
+  taken 5 minutes.
+  Pattern 3 — Diffusion of Responsibility:
+  With 8 people in the incident channel, nobody took ownership.
+  Everyone assumed someone else was investigating. The MTTR was 3x
+  longer than incidents with a clear owner.
+  Pattern 4 — Confirmation Bias:
+  The team suspected a recent deployment caused an outage. They spent
+  an hour analyzing the deployment diff. The actual cause was a node
+  hardware failure unrelated to any deployment. They only looked for
+  evidence that confirmed their initial hypothesis.
+  Pattern 5 — Normalcy Bias:
+  Warning alerts fired for disk pressure for 3 days before the node
+  went down. The team ignored them because "it always warns and never
+  actually fails." Until it did.
+  Task: Analyze the behavioral science of Kubernetes troubleshooting.
+  Write: the cognitive biases most common in incident response, how
+  each manifests in Kubernetes contexts, countermeasures for each bias
+  (checklists, structured investigation, time-boxing), how to build a
+  blameless learning culture, and how incident command structure
+  mitigates human factors.
+assertions:
+  - type: llm_judge
+    criteria: "Cognitive biases are identified with K8s context — anchoring (focus on first symptom not root cause), escalation commitment (continuing failing approach instead of rollback), diffusion of responsibility (no clear owner in group incidents), confirmation bias (seeking evidence for initial hypothesis only), normalcy bias (ignoring recurring warnings). Each pattern maps to specific Kubernetes troubleshooting scenarios"
+    weight: 0.35
+    description: "Biases identified"
+  - type: llm_judge
+    criteria: "Countermeasures are practical and specific — anchoring: use structured investigation checklist (always check events, logs, nodes, network before deep-diving), time-box symptom investigation (15 min max before stepping back). Escalation commitment: establish rollback thresholds upfront ('if not fixed in 20 min, rollback'). Diffusion: incident commander assigns explicit owners. Confirmation bias: devil's advocate role, check alternative hypotheses. Normalcy: auto-escalate warnings that persist > threshold"
+    weight: 0.35
+    description: "Countermeasures"
+  - type: llm_judge
+    criteria: "Learning organization culture is addressed — blameless postmortems (focus on systemic factors, not individual blame), share incident learnings broadly (incident review meetings), celebrate good catches (not just response), invest in simulation and game days (practice under controlled stress), create psychological safety so engineers escalate early rather than hiding problems. Incident command: clear roles reduce cognitive load during high-stress situations"
+    weight: 0.30
+    description: "Learning culture"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/board-strategy.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: board-strategy
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Board-level Kubernetes strategy — present cloud-native infrastructure as competitive advantage to board of directors"
+  tags: [Kubernetes, board-strategy, cloud-native, competitive-advantage, leadership, master]
+state: {}
+trigger: |
+  You're the CTO preparing a board presentation on the company's
+  cloud-native infrastructure strategy. The board includes a mix of
+  technical (2 former CTOs) and non-technical (3 finance/operations)
+  members.
+  Context:
+  - Company: B2B SaaS, $500M ARR, 2000 employees
+  - Current platform: Kubernetes across 3 cloud providers
+  - Annual cloud spend: $18M (3.6% of revenue)
+  - Platform team: 25 engineers
+  - Supporting 200+ microservices, 150 engineers
+  Board questions from pre-read:
+  1. "Cloud spend grew 40% last year but revenue grew 25%. When does
+     the infrastructure investment start paying dividends?"
+  2. "A competitor claims they run on serverless and their infrastructure
+     team is 5 people. Are we over-engineering?"
+  3. "If we acquired a company running on Azure, how hard is it to
+     integrate their services with our platform?"
+  4. "What's our risk exposure if AWS has a major outage? We had 2 hours
+     of downtime last quarter from a cloud provider issue."
+  5. "The engineering team is asking for $5M to build an 'internal
+     developer platform.' What does that even mean and why should we
+     fund it?"
+  Task: Prepare the board presentation. Write: the infrastructure as
+  competitive advantage narrative (speed-to-market, reliability, multi-
+  cloud flexibility), the cloud spend efficiency analysis (cost per
+  transaction, cost as % of revenue trends), serverless vs Kubernetes
+  trade-off for the board, M&A integration capabilities, the $5M IDP
+  investment business case, and a risk framework for infrastructure
+  decisions.
+assertions:
+  - type: llm_judge
+    criteria: "Infrastructure as competitive advantage is framed for board — cloud-native platform enables: 50x more frequent deployments (speed-to-market), 99.95% availability (customer trust), multi-cloud capability (negotiation leverage, M&A readiness). Cloud spend at 3.6% of revenue is below SaaS median (5-8%). The 40% cost growth funded a 3x increase in deployment frequency — cost per deployment actually decreased 60%"
+    weight: 0.35
+    description: "Competitive advantage"
+  - type: llm_judge
+    criteria: "Board questions are answered directly — serverless competitor: they likely have hidden costs (Lambda pricing at scale, vendor lock-in premium, limited workload types). Their 5-person team works because they outsource to the cloud provider — we trade cost for control and flexibility. M&A integration: Kubernetes standardizes deployment regardless of cloud — integration timeline weeks vs months for non-Kubernetes. AWS outage: multi-cloud strategy limits blast radius, DR investment reduces RTO"
+    weight: 0.35
+    description: "Board answers"
+  - type: llm_judge
+    criteria: "IDP investment case is compelling — $5M IDP investment yields: 35% developer productivity gain (equivalent to hiring 50 engineers at $150K = $7.5M value), reduce change failure rate from 12% to 5% (fewer customer-impacting incidents), enable self-service reducing platform team ticket backlog 80%. ROI: 150%+ in year 1. Frame as 'investing in engineering leverage' not 'building infrastructure.' Show payback period and comparison to hiring equivalent headcount"
+    weight: 0.30
+    description: "IDP business case"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/cloud-native-future.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: cloud-native-future
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Future of Kubernetes and cloud-native — WebAssembly, AI/ML workloads, edge computing, eBPF, and the evolution of container orchestration"
+  tags: [Kubernetes, future, WebAssembly, AI-ML, edge, eBPF, cloud-native, master]
+state: {}
+trigger: |
+  Your VP of Engineering asks you to present a technology radar for the
+  next 3 years: "Where is Kubernetes headed, and how should we position
+  our platform for the future?"
+  Current technology landscape (2026):
+  Emerging trends:
+  1. WebAssembly (Wasm) on Kubernetes — SpinKube and containerd-wasm
+     allow running Wasm workloads alongside containers. Cold start: <1ms
+     vs 1-5s for containers. Memory: 1-10MB vs 50-500MB per container.
+     Use case: edge computing, serverless functions, plugin systems.
+  2. AI/ML workload orchestration — GPU scheduling, model serving
+     (KServe, vLLM), training pipelines (Kubeflow). GPU sharing and
+     fractional GPU allocation. Multi-tenant GPU clusters.
+  3. eBPF for networking and observability — Cilium replacing kube-proxy
+     and iptables. Kernel-level observability without sidecars. Tetragon
+     for runtime security. 40% less network overhead than traditional CNI.
+  4. Edge Kubernetes — K3s, KubeEdge for running Kubernetes at the edge
+     (retail stores, IoT gateways, vehicles). Challenges: intermittent
+     connectivity, limited resources, fleet management.
+  5. Platform engineering maturity — Backstage, Kratix, Crossplane
+     abstracting Kubernetes for developers. The trend toward "Kubernetes
+     disappears" — developers never see YAML.
+  6. Sustainable computing — carbon-aware scheduling, right-sizing for
+     energy efficiency, cloud carbon footprint tracking.
+  Your organization's exposure:
+  - Currently: no Wasm, no AI/ML on K8s, basic CNI, no edge deployments
+  - Planned: AI feature development, potential retail edge deployment
+  - Risk: falling behind competitors on AI inference speed
+  Task: Write the technology radar and positioning strategy. Include:
+  assessment of each trend (adopt/trial/assess/hold), impact on current
+  platform architecture, investment recommendations, skills development
+  plan, migration paths from current state, and risk of not adopting.
+assertions:
+  - type: llm_judge
+    criteria: "Technology assessment is balanced — Adopt: eBPF/Cilium (proven, measurable benefits, drop-in replacement). Trial: AI/ML workloads on K8s (growing need, evaluate KServe/vLLM). Assess: WebAssembly (promising but ecosystem still maturing), edge Kubernetes (depends on business use case). Hold: nothing to abandon, but evaluate sidecar-based service mesh vs eBPF alternatives. Each assessment includes reasoning and timeline"
+    weight: 0.35
+    description: "Technology assessment"
+  - type: llm_judge
+    criteria: "Platform impact analysis is specific — eBPF: replace kube-proxy with Cilium, remove service mesh sidecars (reduce resource overhead 30-40%), gain kernel-level observability. AI/ML: add GPU node pools, implement fractional GPU sharing, deploy model serving infrastructure (KServe), integrate with training pipelines. Wasm: start with edge/serverless experiments, evaluate for latency-sensitive workloads. Each has clear migration path"
+    weight: 0.35
+    description: "Platform impact"
+  - type: llm_judge
+    criteria: "Investment and skills recommendations are actionable — prioritize eBPF (immediate ROI, Q1), then AI/ML infrastructure (business demand, Q2-Q3), then evaluate Wasm for edge (Q4). Skills: send 2 engineers for Cilium certification, hire ML platform engineer, establish innovation lab for Wasm experimentation. Risk of not adopting: competitors with faster AI inference will win deals, edge deployment delays lose retail opportunity"
+    weight: 0.30
+    description: "Investment recommendations"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/comprehensive-platform.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: comprehensive-platform
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Design a comprehensive Kubernetes platform — architect an enterprise-grade platform from scratch covering all operational concerns"
+  tags: [Kubernetes, platform-architecture, enterprise, comprehensive, design, master]
+state: {}
+trigger: |
+  You're hired as the founding platform architect for a rapidly growing
+  startup that's about to hit $100M ARR. They have 100 engineers, 40
+  microservices running on Heroku, and are hitting Heroku's limits
+  (cost, performance, flexibility).
+  The CEO says: "We're migrating to Kubernetes. Design the platform
+  that will carry us to $1B ARR."
+  Requirements gathered from stakeholders:
+  - CTO: "Multi-region, 99.99% availability, sub-100ms latency globally"
+  - VP Engineering: "Developers should deploy in < 10 minutes with zero
+    Kubernetes knowledge"
+  - CISO: "SOC2, GDPR, PCI-DSS compliance. Zero-trust networking."
+  - CFO: "Cloud costs should be < 5% of revenue at scale"
+  - Head of Data: "Support ML training workloads alongside production"
+  - VP Product: "A/B testing infrastructure, feature flags, canary
+    deployments as first-class features"
+  Constraints:
+  - 6-month timeline to migrate first 10 services
+  - 18 months to complete full migration
+  - Platform team budget: 8 engineers (hiring from 0)
+  - Prefer AWS but must be portable enough for multi-cloud if needed
+  - Must support Python, Go, Java, Node.js, and Rust services
+  Task: Design the comprehensive platform architecture. Write: the
+  infrastructure architecture (multi-region EKS, networking, storage),
+  developer experience layer (IDP, CI/CD, local dev), operational
+  excellence (monitoring, alerting, incident response), security and
+  compliance layer, data platform integration (ML workloads, GPU
+  scheduling), the migration strategy from Heroku, hiring plan for
+  the platform team, and the 18-month execution roadmap.
+assertions:
+  - type: llm_judge
+    criteria: "Infrastructure architecture is production-grade — multi-region EKS (us-east-1, eu-west-1, ap-southeast-1) with global load balancing. Networking: VPC per region with transit gateway, Cilium CNI with eBPF, Istio or Cilium service mesh. Storage: EBS for databases, EFS for shared storage, S3 for objects. Cluster design: separate clusters for production, staging, data/ML workloads. Karpenter for node autoscaling, spot instances for non-critical workloads"
+    weight: 0.35
+    description: "Infrastructure architecture"
+  - type: llm_judge
+    criteria: "Developer experience and operational layers are comprehensive — IDP: Backstage catalog with service templates (golden paths), ArgoCD for GitOps, automated CI/CD (build → test → security scan → canary deploy → promote). Local dev: Tilt or DevSpace. Monitoring: Prometheus + Thanos (cross-cluster), Grafana dashboards, Loki for logs, Tempo for traces. Alerting: Alertmanager with PagerDuty + Slack, SLO-based alerting. Incident response: documented runbooks, game day program"
+    weight: 0.35
+    description: "DevEx and operations"
+  - type: llm_judge
+    criteria: "Migration and hiring are realistic — Heroku migration: parallel-run strategy (run on both Heroku and K8s, shift traffic gradually). Start with stateless services, migrate stateful last. Hiring plan: hire platform lead month 1, then 2 engineers/month. Skills needed: Kubernetes, Terraform, Go, SRE. 18-month roadmap: months 1-3 (foundation: clusters, CI/CD, first service), months 4-9 (migrate 50% services, build IDP), months 10-15 (complete migration, ML platform), months 16-18 (optimization, multi-region active-active)"
+    weight: 0.30
+    description: "Migration and hiring"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/consulting-engagement.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: consulting-engagement
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Kubernetes consulting engagement — assess a client's troubled Kubernetes migration, design remediation roadmap, and present recommendations to their leadership"
+  tags: [Kubernetes, consulting, migration, assessment, remediation, master]
+state: {}
+trigger: |
+  You're a Kubernetes platform consultant hired by a mid-size fintech
+  company (500 engineers, $200M revenue). Their Kubernetes migration
+  started 18 months ago and is "failing." The CTO says: "We spent $3M
+  on this migration and we're worse off than before."
+  Your assessment findings after 2 weeks:
+  Technical:
+  - 30% of services migrated to Kubernetes, 70% still on EC2
+  - The 30% on Kubernetes have MORE downtime than EC2 services
+  - 15 critical incidents in the last quarter (vs 3 for EC2 services)
+  - No centralized monitoring (each team runs their own Prometheus)
+  - No standardized deployment pipeline (each team deploys differently)
+  - RBAC is cluster-admin for everyone
+  - No network policies, no pod security enforcement
+  - 85% of pods have no resource limits
+  - Persistent storage on local SSDs (no replication, no backup)
+  Organizational:
+  - No platform team exists — "everyone owns Kubernetes"
+  - Each of 8 dev teams learned Kubernetes independently
+  - No internal documentation or runbooks
+  - Engineers spend ~30% of time on Kubernetes ops instead of features
+  - Leadership measured success by "% of services migrated" not outcomes
+  - The migration was mandated top-down without training plan
+  Cultural:
+  - Engineers are frustrated and some are leaving
+  - "Kubernetes is too complex" sentiment is growing
+  - Shadow IT: some teams secretly running new services on EC2
+  Task: Write the consulting report and remediation roadmap. Include:
+  root cause analysis (why the migration is failing), the immediate
+  stabilization plan (first 30 days), the 90-day remediation roadmap,
+  the 12-month platform maturity plan, organizational changes needed
+  (form platform team, training program), success metrics, and the
+  executive presentation framework.
+assertions:
+  - type: llm_judge
+    criteria: "Root cause analysis is accurate — the migration failed because of organizational and process gaps, not Kubernetes itself. Key failures: no platform team to provide expertise, no standardization (each team reinvented), no training (mandated without enablement), wrong success metric (% migrated vs reliability), no shared infrastructure (monitoring, CI/CD, security). Technical debt compounded because nobody owned the platform"
+    weight: 0.35
+    description: "Root cause analysis"
+  - type: llm_judge
+    criteria: "Stabilization and remediation are phased — 30-day stabilization: fix the worst reliability issues (add resource limits, deploy centralized monitoring, create incident response process, pause migration of new services). 90-day remediation: form platform team (3-4 engineers), deploy standardized CI/CD, implement RBAC and network policies, centralize logging. 12-month maturity: complete migration with golden paths, implement DR, achieve SLO targets"
+    weight: 0.35
+    description: "Phased remediation"
+  - type: llm_judge
+    criteria: "Organizational and metrics recommendations are practical — form a dedicated platform team (hire or reassign), invest in training program (certifications, workshops, pair programming with platform team), redefine success metrics (availability, MTTR, developer satisfaction, deployment frequency — not % migrated), create internal documentation and runbooks, establish architecture review board. Executive framing: this is a recovery, not a restart — the $3M invested built valuable experience"
+    weight: 0.30
+    description: "Organizational recommendations"