npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/docker-container-debugging/scenarios/level-3/production-container-ops.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: production-container-ops
+  level: 3
+  course: docker-container-debugging
+  type: output
+  description: "Debug production container operations — diagnose zero-downtime deployment failures, container update strategies, and production runtime issues"
+  tags: [Docker, production, zero-downtime, rolling-update, deployment, advanced]
+state: {}
+trigger: |
+  Your production Docker deployment experiences issues during updates:
+  Deployment strategy uses Docker Compose with rolling updates:
+  $ docker compose up -d --no-deps --build api
+  Problem 1 — Downtime during update:
+  The old container is killed before the new one is ready. For 5-10
+  seconds, requests to the API return "connection refused":
+  $ docker compose up -d api
+  Recreating api ... done
+  The default behavior: stop old → start new. No overlap period.
+  Health check is configured but Compose doesn't wait for it by
+  default during updates.
+  Problem 2 — Container doesn't stop gracefully:
+  $ docker stop api
+  (waits 10 seconds, then kills with SIGKILL)
+  The application doesn't handle SIGTERM. Running connections are
+  severed. In-progress requests return errors. The app needs a
+  graceful shutdown handler that:
+  - Stops accepting new connections
+  - Finishes processing current requests
+  - Closes database connections
+  - Exits cleanly
+  Problem 3 — Rollback needed but old image was overwritten:
+  $ docker compose up -d  # deploys broken version
+  # Need to roll back but :latest was overwritten
+  # No way to get the previous version!
+  Problem 4 — Resource leak over time:
+  $ docker stats api
+  CONTAINER   CPU%    MEM USAGE / LIMIT     MEM%    NET I/O
+  api         145%    1.8GiB / 2GiB         90%     45GB / 12GB
+  Memory at 90% and climbing. The container has a memory leak and
+  will eventually OOM. Restart is a band-aid, not a fix.
+  Task: Explain production container operations. Write: zero-downtime
+  deployment strategies (blue-green, rolling), graceful shutdown
+  (signal handling, stop_grace_period), image tagging strategies (never
+  use :latest in production), resource monitoring with docker stats,
+  handling memory leaks, and production readiness checklist.
+assertions:
+  - type: llm_judge
+    criteria: "Zero-downtime deployment is explained — blue-green: run new version alongside old, switch traffic after health check passes. Rolling update: gradually replace instances. With Compose: use health checks + depends_on condition: service_healthy. deploy.update_config in Swarm: parallelism, delay, order (start-first vs stop-first). start-first ensures new container is healthy before stopping old one. Without orchestration: use a reverse proxy (nginx, traefik) to manage traffic switching"
+    weight: 0.35
+    description: "Zero-downtime deployment"
+  - type: llm_judge
+    criteria: "Graceful shutdown is covered — Docker sends SIGTERM, waits stop_grace_period (default 10s), then SIGKILL. Application must handle SIGTERM: stop accepting new connections, drain existing connections, close resources, exit 0. Node.js: process.on('SIGTERM', ...). Python: signal.signal(SIGTERM, ...). stop_grace_period in Compose should match the app's drain time. PID 1 problem: shell form CMD doesn't forward signals — use exec form or tini"
+    weight: 0.35
+    description: "Graceful shutdown"
+  - type: llm_judge
+    criteria: "Production best practices are practical — never use :latest in production (not reproducible). Tag with git SHA or semantic version. Keep previous image tags for rollback. Monitor with docker stats (CPU, memory, I/O, network). Set resource limits to prevent runaway containers from affecting host. Memory leaks: monitor trend over time, set restart policy (--restart unless-stopped) as safety net, fix root cause. Production checklist: health checks, log rotation, resource limits, graceful shutdown, backup strategy, monitoring alerts"
+    weight: 0.30
+    description: "Production practices"

package/courses/docker-container-debugging/scenarios/level-4/cicd-pipeline-design.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: cicd-pipeline-design
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container CI/CD pipelines — implement build, test, scan, sign, and deploy workflows for containerized applications at scale"
+  tags: [Docker, CI/CD, pipeline, deployment, automation, GitOps, expert]
+state: {}
+trigger: |
+  Your organization is standardizing its container CI/CD pipeline.
+  Currently, each team has ad-hoc scripts:
+  Team A: docker build && docker push && ssh prod docker pull && restart
+  Team B: Jenkins pipeline with docker-compose up on build agent
+  Team C: GitHub Actions with manual approval for production
+  None have: security scanning, image signing, staged rollouts,
+  automated rollback, or audit trails.
+  Design a standard pipeline:
+  Stage 1 — Build:
+  - BuildKit with layer caching (registry-based cache)
+  - Multi-architecture builds (amd64 + arm64) via docker buildx
+  - Deterministic builds: pinned base image digests, lock files
+  - Build metadata: git SHA, build timestamp, CI job URL as labels
+  Stage 2 — Test:
+  - Unit tests in build stage (fail fast)
+  - Integration tests with docker compose (spin up dependencies)
+  - Smoke tests against built image (start container, hit health endpoint)
+  - Test containers clean up after themselves (--rm, compose down -v)
+  Stage 3 — Security:
+  - Trivy scan: block CRITICAL, warn HIGH
+  - SBOM generation (syft) attached as attestation
+  - Image signing (cosign) with CI identity
+  - License compliance check
+  Stage 4 — Deploy:
+  - Push to registry with git SHA tag + branch tag
+  - Staging: automatic deploy, run integration tests
+  - Production: manual approval gate, canary deployment (10% → 50% → 100%)
+  - Automated rollback if error rate exceeds threshold
+  Task: Design the container CI/CD pipeline. Write: each stage in
+  detail, caching strategies for fast builds, testing containers in
+  CI (compose-based integration tests), security gates, deployment
+  strategies (canary, blue-green), rollback automation, and how to
+  standardize across teams without being overly rigid.
+assertions:
+  - type: llm_judge
+    criteria: "Build and test stages are thorough — BuildKit cache: --cache-from type=registry for cross-CI cache sharing. Multi-arch: docker buildx build --platform linux/amd64,linux/arm64. Pin base images by digest for reproducibility. Test in CI: docker compose up -d to start dependencies, run tests, compose down -v to clean up. Smoke test: start the built image, wait for health check, hit /health endpoint. All test containers must clean up (no orphaned resources in CI)"
+    weight: 0.35
+    description: "Build and test"
+  - type: llm_judge
+    criteria: "Security and deployment gates are covered — scan before push (shift left). Trivy with --exit-code 1 --severity CRITICAL fails the pipeline. SBOM with syft for supply chain transparency. Image signing with cosign (keyless via OIDC in CI). Deploy: push to registry with immutable tags (git SHA, never :latest). Canary: deploy to subset, monitor error rate, auto-promote or rollback. Blue-green: run both versions, switch traffic at load balancer. Rollback: keep previous N images, automate based on health metrics"
+    weight: 0.35
+    description: "Security and deployment"
+  - type: llm_judge
+    criteria: "Standardization approach is practical — provide a shared pipeline template/library that teams extend. Core stages (build, scan, sign) are mandatory and team-managed. Testing and deployment stages are customizable per team. Platform team maintains the template, teams consume via CI includes. Don't enforce identical pipelines — allow team-specific test suites and deployment strategies. Measure: build times, deployment frequency, change failure rate, mean time to recovery (DORA metrics)"
+    weight: 0.30
+    description: "Standardization"

package/courses/docker-container-debugging/scenarios/level-4/container-monitoring-observability.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: container-monitoring-observability
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container monitoring and observability — implement metrics, logging, and tracing for containerized applications at enterprise scale"
+  tags: [Docker, monitoring, observability, Prometheus, logging, tracing, expert]
+state: {}
+trigger: |
+  Your containerized platform has 100+ services. The current monitoring
+  is "docker stats on the host and hope for the best." Last month,
+  three production incidents were discovered by customers, not by your
+  team. You need to design a comprehensive observability strategy.
+  Current state:
+  - No centralized metrics — each team checks docker stats manually
+  - Logs go to docker logs (json-file driver) — no aggregation
+  - No distributed tracing — debugging request flows across services
+    requires correlating timestamps across multiple docker logs outputs
+  - Alert on "server down" only — no application-level alerts
+  - Post-incident: "we don't know what happened because logs rotated"
+  Target state (three pillars of observability):
+  1. Metrics — Prometheus + cAdvisor + Grafana:
+     - cAdvisor exports container metrics (CPU, memory, disk, network)
+     - Application metrics via /metrics endpoints (RED method)
+     - Grafana dashboards per service and per host
+     - Alerts: container restart > 3/hour, memory > 80%, error rate > 1%
+  2. Logging — Fluentd/Fluent Bit + Elasticsearch + Kibana:
+     - Structured JSON logging from applications
+     - Docker logging driver: fluentd (forward to aggregator)
+     - Centralized search, retention policies, log correlation
+     - Keep 30 days hot, 90 days warm, 1 year cold storage
+  3. Tracing — OpenTelemetry + Jaeger:
+     - Distributed trace propagation across services
+     - Trace ID in logs for correlation
+     - Latency analysis, dependency mapping
+     - Sampling strategy: 100% for errors, 10% for normal traffic
+  Task: Design the container observability strategy. Write: the three
+  pillars approach (metrics, logs, traces), tooling selection and
+  architecture, Docker-specific considerations (logging drivers, cAdvisor,
+  container labels for discovery), alert design (what to alert on),
+  and cost/complexity trade-offs for different organization sizes.
+assertions:
+  - type: llm_judge
+    criteria: "Three pillars architecture is explained — Metrics: Prometheus scrapes cAdvisor (container metrics) and application /metrics endpoints. Service discovery via Docker labels or DNS. Grafana for visualization. Logging: structured JSON from apps, collected via Docker logging driver (fluentd) or sidecar, shipped to Elasticsearch/Loki. Tracing: OpenTelemetry SDK in applications, export to Jaeger/Tempo. Correlation: trace ID embedded in logs and metrics labels connects all three"
+    weight: 0.35
+    description: "Three pillars"
+  - type: llm_judge
+    criteria: "Docker-specific monitoring is covered — cAdvisor runs as privileged container, mounts /var/lib/docker and /sys for metrics. Docker logging drivers determine log pipeline: json-file (local only), fluentd (forward), journald (systemd). Dual logging (Docker 20.10+) allows docker logs to work alongside remote driver. Container labels for Prometheus service discovery. Health check status as a metric. docker events stream for container lifecycle tracking. Monitor the Docker daemon itself (memory, goroutines)"
+    weight: 0.35
+    description: "Docker-specific monitoring"
+  - type: llm_judge
+    criteria: "Alerting and scaling are practical — alert on symptoms not causes: error rate, latency percentile (p99), saturation (CPU/memory %). Avoid alert fatigue: page only for customer-impacting issues. Container-specific alerts: restart loops, OOM kills, health check failures, disk pressure. Scaling considerations: Prometheus needs storage planning, Elasticsearch is resource-intensive. For smaller orgs: Grafana Loki (lighter than Elasticsearch), Grafana Tempo (lighter than Jaeger). Cost grows with retention and cardinality"
+    weight: 0.30
+    description: "Alerting and scaling"

package/courses/docker-container-debugging/scenarios/level-4/container-orchestration-strategy.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: container-orchestration-strategy
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container orchestration strategy — evaluate Docker Swarm vs Kubernetes, define deployment architectures, and plan migration paths for enterprise containerized applications"
+  tags: [Docker, orchestration, Swarm, Kubernetes, architecture, strategy, expert]
+state: {}
+trigger: |
+  Your company runs 50+ microservices on Docker Compose across 8
+  bare-metal servers. The setup works but pain points are growing:
+  - Manual deployment: SSH into each server, docker compose pull && up
+  - No automatic failover: if a server dies, its services are down
+  - Scaling: manually adding containers and updating nginx upstream
+  - Secret management: .env files on each server, rotated manually
+  - No resource governance: containers compete for CPU/memory
+  Management wants "orchestration" and asks you to evaluate options:
+  Option A — Docker Swarm:
+  Pros: Built into Docker Engine, minimal learning curve, uses existing
+  docker-compose.yaml (with minor changes to deploy section), simple
+  setup (docker swarm init, docker swarm join).
+  Cons: Smaller community, fewer features, no auto-scaling, limited
+  ecosystem, uncertain future (Docker Inc. focus shifted to Desktop).
+  Option B — Kubernetes:
+  Pros: Industry standard, massive ecosystem, advanced scheduling,
+  auto-scaling (HPA/VPA), extensive networking (CNI plugins), strong
+  secret management, RBAC, namespace isolation.
+  Cons: Steep learning curve, complex operations, requires dedicated
+  team, YAML complexity, overkill for small deployments.
+  Option C — Managed Kubernetes (EKS/GKE/AKS):
+  Pros: Control plane managed by cloud provider, integrated with cloud
+  services, automatic upgrades, SLA-backed.
+  Cons: Cloud vendor dependency, cost, networking complexity with
+  existing on-prem services, data sovereignty concerns.
+  The decision affects 3+ years of infrastructure investment.
+  Task: Evaluate container orchestration options. Write: comparison
+  matrix (Swarm vs K8s vs managed K8s), migration path from Compose
+  to orchestration, decision criteria (team size, scale, budget,
+  compliance), rollout strategy, and risk mitigation plan.
+assertions:
+  - type: llm_judge
+    criteria: "Orchestration comparison is thorough — Docker Swarm: simple, uses Compose files, good for < 20 services and small teams. Kubernetes: complex but standard, good for > 20 services, multi-team, auto-scaling needs. Managed K8s: removes operational burden of K8s control plane. Decision factors: team expertise, number of services, scaling requirements, compliance needs, budget (managed K8s has cloud costs), existing infrastructure (on-prem favors Swarm or self-hosted K8s)"
+    weight: 0.35
+    description: "Orchestration comparison"
+  - type: llm_judge
+    criteria: "Migration strategy is covered — phased approach: (1) containerize properly first (health checks, graceful shutdown, 12-factor), (2) start with non-critical services, (3) migrate stateless before stateful, (4) run hybrid during transition. Compose to Swarm: add deploy section, docker stack deploy. Compose to K8s: use kompose for initial conversion, then refine. Keep Compose for local development regardless of production orchestration"
+    weight: 0.35
+    description: "Migration strategy"
+  - type: llm_judge
+    criteria: "Risk mitigation is practical — run parallel environments during migration, blue-green at infrastructure level. Keep rollback path to Compose for 6+ months. Staff training before migration (K8s has steep curve). Start with a platform team of 2-3 dedicated engineers for K8s. Budget for monitoring/observability platform. Document runbooks for new platform. Consider: is the complexity justified? Many companies successfully run on Compose/Swarm at significant scale"
+    weight: 0.30
+    description: "Risk mitigation"

package/courses/docker-container-debugging/scenarios/level-4/container-performance-engineering.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: container-performance-engineering
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container performance engineering — implement performance testing, profiling, resource optimization, and capacity planning for containerized applications"
+  tags: [Docker, performance, profiling, resource-optimization, capacity-planning, expert]
+state: {}
+trigger: |
+  Your containerized API handles 10,000 requests/second in production.
+  After a deployment, latency increased from p99 of 50ms to 800ms.
+  Docker stats shows:
+  $ docker stats --no-stream
+  CONTAINER    CPU%    MEM USAGE/LIMIT    MEM%    NET I/O         BLOCK I/O
+  api-1        195%    1.8GiB/2GiB        90%     2.1GB/800MB     500MB/2.3GB
+  api-2        190%    1.7GiB/2GiB        85%     2.0GB/790MB     480MB/2.1GB
+  api-3        45%     900MiB/2GiB        44%     200MB/80MB      50MB/100MB
+  Observations:
+  - api-1 and api-2 are CPU-saturated and near memory limit
+  - api-3 has low utilization — load balancing is uneven
+  - Block I/O is suspiciously high for an API service
+  Investigation:
+  1. CPU: container has --cpus=2 but the Go application spawns
+     GOMAXPROCS=runtime.NumCPU() = 32 (host cores). Goroutines
+     compete for 2 CPU cores, causing excessive context switching.
+     Fix: set GOMAXPROCS to match container CPU limit.
+  2. Memory: the application uses an in-memory cache that grows
+     unbounded. Near the 2GiB limit, the Go GC runs constantly
+     (GC pressure), consuming CPU cycles for garbage collection.
+  3. Block I/O: the application writes temporary files to the
+     container's writable layer (overlay2). This goes through the
+     storage driver with copy-on-write overhead. Should use tmpfs.
+  4. Load balancing: Docker's internal DNS round-robin sends requests
+     to all instances equally, but api-3 was just restarted and its
+     JIT/cache is cold. Need weighted or least-connections balancing.
+  Task: Design container performance engineering. Write: how containers
+  affect application performance (cgroups, namespace overhead, storage
+  drivers), profiling containers (docker stats, nsenter, perf), common
+  performance pitfalls (CPU throttling, memory pressure, I/O through
+  overlay2), resource right-sizing, and load testing containerized apps.
+assertions:
+  - type: llm_judge
+    criteria: "Container performance impact is explained — cgroups enforce CPU and memory limits but applications may not be aware of them. CPU: --cpus=2 means CFS quota, not dedicated cores. Applications should read cgroup limits, not /proc/cpuinfo (Java: -XX:+UseContainerSupport, Go: GOMAXPROCS=container limit, Node: --max-old-space-size). Memory: OOM killer triggers at cgroup limit. GC-heavy languages suffer near the limit. Storage driver adds I/O overhead for writable layer. Network: NAT overhead for published ports"
+    weight: 0.35
+    description: "Performance impact"
+  - type: llm_judge
+    criteria: "Profiling techniques are covered — docker stats: real-time CPU, memory, I/O, network per container. nsenter: enter container namespaces for host-level tools (perf, strace, tcpdump). docker exec with profiling tools (if available in image). cAdvisor: detailed container metrics over time. Application-level: pprof (Go), async-profiler (Java), py-spy (Python). Flame graphs for CPU analysis. Memory profiling to identify leaks. Compare: container metrics vs application metrics to identify if bottleneck is container-level or application-level"
+    weight: 0.35
+    description: "Profiling techniques"
+  - type: llm_judge
+    criteria: "Resource right-sizing is practical — start with generous limits, monitor actual usage, tighten. CPU: observe throttling (nr_throttled in cgroup stats). Memory: set limit 20-30% above normal usage to handle spikes and GC. Use tmpfs for temporary files (avoid overlay2 write overhead). Load testing: use tools like k6/wrk against containerized app with production-like resource limits. Capacity planning: requests per container × containers = total capacity. Account for startup latency (cold JIT, cache warming) in scaling calculations"
+    weight: 0.30
+    description: "Resource right-sizing"

package/courses/docker-container-debugging/scenarios/level-4/container-security-architecture.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: container-security-architecture
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container security architecture — implement defense-in-depth for containerized applications including runtime security, network policies, and incident response"
+  tags: [Docker, security, architecture, defense-in-depth, runtime, compliance, expert]
+state: {}
+trigger: |
+  After a security breach where an attacker gained access to a
+  container and attempted lateral movement, your CISO requests a
+  comprehensive container security architecture. The current state:
+  - All containers run as root
+  - No network segmentation — all containers on default bridge network
+  - Images pulled from public Docker Hub without scanning
+  - Docker socket mounted into several containers "for monitoring"
+  - No runtime threat detection
+  - Secrets in environment variables visible via docker inspect
+  Attack path reconstruction:
+  1. Attacker exploited an application vulnerability (RCE)
+  2. Gained shell inside container (as root)
+  3. Found Docker socket mounted → created a privileged container
+  4. Mounted host filesystem from privileged container
+  5. Accessed other containers' secrets via docker inspect
+  6. Exfiltrated data from database container
+  Defense-in-depth layers needed:
+  Layer 1 — Build time: scan images, use minimal base images, no
+  secrets in images, sign images
+  Layer 2 — Configuration: non-root, drop capabilities, read-only
+  filesystem, resource limits, no privileged containers
+  Layer 3 — Network: network segmentation, firewall rules between
+  service tiers (frontend can't reach database directly)
+  Layer 4 — Runtime: anomaly detection (unexpected processes,
+  network connections, file modifications), audit logging
+  Layer 5 — Secrets: Docker secrets or external vault, rotated
+  automatically, never in environment variables or images
+  Task: Design the container security architecture. Write: each
+  defense layer with specific controls, the Docker socket security
+  problem and alternatives, network segmentation for containers,
+  secrets management, runtime security monitoring, and compliance
+  requirements (SOC2, PCI-DSS) for containerized environments.
+assertions:
+  - type: llm_judge
+    criteria: "Defense-in-depth layers are explained — Build: scan with Trivy/Scout, use distroless/alpine bases, multi-stage builds, sign images. Config: USER directive, --cap-drop ALL --cap-add <specific>, --read-only --tmpfs /tmp, no --privileged ever. Network: custom bridge networks per tier, frontend ↔ api ↔ database (no frontend → database). Runtime: Falco/Sysdig for anomaly detection (unexpected exec, network, file access). Secrets: Docker secrets, HashiCorp Vault, never env vars for sensitive data"
+    weight: 0.35
+    description: "Defense layers"
+  - type: llm_judge
+    criteria: "Docker socket and lateral movement prevention are covered — Docker socket = root access to host. NEVER mount into application containers. Alternatives: Docker API proxy with authorization (Tecnativa docker-socket-proxy), rootless Docker, monitoring via cAdvisor (read-only metrics without socket). Lateral movement prevention: network segmentation (containers can only reach needed services), no shared volumes between security tiers, separate Docker networks per service group"
+    weight: 0.35
+    description: "Socket and lateral movement"
+  - type: llm_judge
+    criteria: "Compliance and incident response are practical — SOC2/PCI-DSS requirements: audit logging of all container operations, access control (who can deploy, exec into containers), encryption at rest and in transit, vulnerability management program. Incident response for containers: preserve container (don't delete), capture filesystem (docker export), collect logs, analyze with forensic tools. Runtime monitoring: alert on docker exec in production, unexpected outbound connections, privilege escalation attempts"
+    weight: 0.30
+    description: "Compliance and response"

package/courses/docker-container-debugging/scenarios/level-4/enterprise-image-management.yaml ADDED Viewed

@@ -0,0 +1,58 @@
+meta:
+  id: enterprise-image-management
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design enterprise container image management — implement private registries, image signing, vulnerability policies, and golden image pipelines"
+  tags: [Docker, registry, image-signing, vulnerability, enterprise, governance, expert]
+state: {}
+trigger: |
+  Your organization has 200+ developers building Docker images with
+  no governance. A security audit reveals:
+  1. Developers pull random base images from Docker Hub — some contain
+     known vulnerabilities, some are abandoned/unmaintained.
+  2. No image provenance — can't verify who built an image or if it
+     was tampered with between build and deployment.
+  3. Production runs images with CRITICAL vulnerabilities because
+     there's no gate between build and deploy.
+  4. Multiple teams independently build similar base images with
+     different security configurations.
+  5. Docker Hub rate limits (100 pulls/6hrs for anonymous, 200 for
+     free accounts) cause CI failures during peak hours.
+  Proposed enterprise image management architecture:
+  - Private registry (Harbor) as pull-through cache and primary store
+  - Golden base images maintained by the platform team, pre-hardened
+    and scanned, rebuilt weekly with latest patches
+  - Image signing with Docker Content Trust / cosign
+  - Admission controller that rejects unsigned/unscanned images
+  - Vulnerability policy: block CRITICAL, alert HIGH, log MEDIUM
+  - Image retention policy: keep 10 latest tags, delete untagged after 7 days
+  Task: Design enterprise image management. Write: private registry
+  architecture (Harbor, ECR, GCR), golden base image strategy, image
+  signing and verification (cosign, Notary), vulnerability gating in
+  CI/CD, image retention and garbage collection, and developer
+  experience considerations (don't slow down development).
+assertions:
+  - type: llm_judge
+    criteria: "Registry architecture is explained — private registry serves as: pull-through cache (reduces Docker Hub dependency, avoids rate limits), primary store for internal images, security scan integration point. Harbor: open-source, includes scanning (Trivy), signing (cosign/Notary), replication, RBAC, audit logging. Cloud registries (ECR, GCR, ACR): managed, integrated with cloud IAM. Registry should be highly available and backed up"
+    weight: 0.35
+    description: "Registry architecture"
+  - type: llm_judge
+    criteria: "Golden images and signing are covered — golden base images: maintained by platform team, pre-hardened (non-root, minimal packages, security config), scanned and signed, rebuilt on schedule (weekly or on CVE). Image signing: cosign for keyless signing (works with Sigstore), Docker Content Trust (DCT) for Docker-native. Admission control: reject unsigned images in production. Supply chain: SBOM generation (syft), provenance attestation (SLSA)"
+    weight: 0.35
+    description: "Golden images and signing"
+  - type: llm_judge
+    criteria: "Developer experience is balanced with security — developers should be able to build and test locally without friction. CI pipeline handles scanning/signing automatically. Clear documentation on approved base images. Fast feedback: scan in CI before merge, not just at deploy. Escape valve for urgent deployments (with audit trail). Self-service: developers can request new base images through a defined process. Metrics: track mean time from vulnerability disclosure to patched image deployment"
+    weight: 0.30
+    description: "Developer experience"

package/courses/docker-container-debugging/scenarios/level-4/expert-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: expert-debugging-shift
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Combined expert debugging shift — diagnose and design solutions for a production container platform with orchestration, security, performance, and operational challenges"
+  tags: [Docker, troubleshooting, combined, shift-simulation, expert]
+state: {}
+trigger: |
+  You're the newly hired container platform lead. On day one, you
+  discover the platform has accumulated significant technical debt.
+  The CTO wants a 90-day improvement plan.
+  Current state assessment:
+  Infrastructure: 20 Docker hosts, 150+ containers, Docker Compose
+  on each host, no orchestration. Manual deployments via SSH scripts.
+  Security audit findings:
+  - 40% of containers run as root with --privileged
+  - Docker socket mounted in 12 containers
+  - No image scanning — 3 containers have CRITICAL CVEs from 2023
+  - Secrets in docker-compose.yml files committed to git
+  - No network segmentation — all containers on default bridge
+  Performance issues:
+  - 5 containers with memory leaks, restarted nightly via cron
+  - No resource limits on 80% of containers
+  - Disk fills up monthly — manual cleanup each time
+  - No monitoring beyond Nagios ping checks
+  Operational gaps:
+  - No centralized logging — engineers SSH to each host for logs
+  - Deployments take 2 hours (manual process on 20 hosts)
+  - Rollback requires re-deploying the previous version manually
+  - Last week's incident: deployed wrong image tag to production,
+    took 4 hours to detect because no health checks
+  Team: 3 DevOps engineers, 30 developers, 0 security engineers
+  Budget: Can hire 2 more people, $50K/year for tooling
+  Task: Design the 90-day improvement plan. Write: the priority
+  ranking (what to fix first and why), quick wins vs long-term
+  improvements, orchestration decision, security remediation plan,
+  observability implementation, deployment automation strategy, team
+  structure and hiring priorities, and success metrics.
+assertions:
+  - type: llm_judge
+    criteria: "Priority ranking is justified — Phase 1 (days 1-30): security remediation (remove --privileged, remove socket mounts, rotate exposed secrets, scan images) and quick wins (health checks, resource limits, log rotation, automated disk cleanup). Phase 2 (days 31-60): deployment automation (CI/CD pipeline, image registry, automated rollouts) and observability (centralized logging, container monitoring). Phase 3 (days 61-90): orchestration evaluation (Swarm vs K8s based on team and scale), network segmentation, performance optimization"
+    weight: 0.35
+    description: "Priority ranking"
+  - type: llm_judge
+    criteria: "Resource and team strategy is realistic — $50K budget allocation: private registry (Harbor, free), monitoring (Prometheus + Grafana, free), CI/CD (GitLab CI or GitHub Actions, existing). Hiring: 1 security-focused DevOps + 1 platform engineer. Team structure: platform team (3 existing + 2 new) owns infrastructure, provides self-service to 30 developers. Developer enablement: standardized Dockerfiles, CI templates, documentation. Don't try to do everything at once — incremental improvements with measurable outcomes"
+    weight: 0.35
+    description: "Resource strategy"
+  - type: llm_judge
+    criteria: "Success metrics are measurable — security: 0 CRITICAL CVEs, 0 --privileged containers, 0 exposed secrets. Performance: all containers with resource limits, automated disk management, no manual restarts. Operations: deployment time < 15 minutes (from 2 hours), rollback < 5 minutes, MTTR < 30 minutes. Observability: centralized logs for all containers, alerting on container health/resources, dashboard coverage. Track DORA metrics: deployment frequency, lead time, change failure rate, MTTR"
+    weight: 0.30
+    description: "Success metrics"

package/courses/docker-container-debugging/scenarios/level-4/incident-response-containers.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: incident-response-containers
+  level: 4
+  course: docker-container-debugging
+  type: output
+  description: "Design container incident response — implement forensic procedures, evidence preservation, root cause analysis, and post-incident improvements for container security events"
+  tags: [Docker, incident-response, forensics, security, post-incident, expert]
+state: {}
+trigger: |
+  Your security team detects anomalous behavior from a production
+  container: unexpected outbound connections to an unknown IP,
+  unusual processes running, and a spike in CPU usage.
+  Alert timeline:
+  10:15 — Falco alert: "Shell spawned in container api-prod-3"
+  10:16 — Network alert: Outbound connection to 198.51.100.42:4444
+  10:17 — Falco alert: "Sensitive file opened: /etc/shadow"
+  10:18 — CPU spike to 400% in api-prod-3
+  You are the incident commander. What do you do?
+  WRONG approach:
+  $ docker stop api-prod-3
+  $ docker rm api-prod-3
+  # Evidence destroyed! Can't determine what happened.
+  CORRECT approach:
+  Step 1 — Isolate without destroying:
+  $ docker network disconnect production-net api-prod-3
+  # Container still running but network-isolated
+  Step 2 — Preserve evidence:
+  $ docker export api-prod-3 > api-prod-3-filesystem.tar
+  $ docker logs api-prod-3 > api-prod-3-logs.txt
+  $ docker inspect api-prod-3 > api-prod-3-inspect.json
+  $ docker diff api-prod-3 > api-prod-3-diff.txt
+  $ docker top api-prod-3 > api-prod-3-processes.txt
+  Step 3 — Analyze:
+  $ docker exec api-prod-3 cat /proc/net/tcp  # active connections
+  $ docker exec api-prod-3 find / -newer /app/server.js  # recently modified files
+  $ docker exec api-prod-3 cat /proc/1/environ  # check for injected env vars
+  Step 4 — Determine blast radius:
+  - What other containers could this container reach?
+  - Were any secrets or tokens accessible?
+  - What data was in the container's network segment?
+  Task: Design container incident response procedures. Write: the
+  isolation strategy (network disconnect vs stop), evidence collection
+  (export, logs, inspect, diff), forensic analysis techniques, blast
+  radius assessment, communication plan (who to notify), and
+  post-incident improvements to prevent recurrence.
+assertions:
+  - type: llm_judge
+    criteria: "Isolation and evidence preservation are explained — ISOLATE FIRST: docker network disconnect removes network access while keeping container running for investigation. Do NOT docker rm — this destroys the writable layer and all evidence. Preserve: docker export (full filesystem tar), docker logs (stdout/stderr), docker inspect (full container config including env vars, mounts, network), docker diff (filesystem changes), docker top (running processes). Chain of custody: hash all evidence files, timestamp collection"
+    weight: 0.35
+    description: "Isolation and evidence"
+  - type: llm_judge
+    criteria: "Forensic analysis is covered — analyze filesystem changes: docker diff shows what was modified/added. Look for: new binaries, modified configs, dropped tools, cryptocurrency miners. Check /proc: /proc/net/tcp for connections, /proc/*/cmdline for processes, /proc/*/environ for environment. Network forensics: captured packets if tcpdump was running. Image comparison: diff the running container against the original image to find all attacker modifications. Timeline reconstruction from logs and filesystem timestamps"
+    weight: 0.35
+    description: "Forensic analysis"
+  - type: llm_judge
+    criteria: "Blast radius and post-incident are practical — blast radius: check what networks the container was on (docker network inspect), what secrets it had access to (docker inspect environment), what volumes were shared, what other services it could reach. Rotate all credentials the container had access to. Post-incident: add runtime monitoring (Falco), implement network segmentation, remove unnecessary capabilities, add read-only filesystem, review image for vulnerabilities that enabled the initial compromise. Write an incident report with timeline, impact, root cause, and remediation actions"
+    weight: 0.30
+    description: "Blast radius and post-incident"