npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/industry-benchmarks.yaml ADDED Viewed

@@ -0,0 +1,58 @@
+meta:
+  id: industry-benchmarks
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Kubernetes platform benchmarking — compare organizational maturity against industry standards, DORA metrics, and CNCF maturity model"
+  tags: [Kubernetes, benchmarking, DORA, CNCF, maturity-model, industry-standards, master]
+state: {}
+trigger: |
+  You're asked to benchmark your organization's Kubernetes platform
+  maturity against industry standards. The board wants to know: "Are
+  we best-in-class, average, or behind?"
+  Your organization's metrics:
+  | Metric                        | Your Org  | Industry Avg | Elite    |
+  |-------------------------------|-----------|--------------|----------|
+  | Deployment frequency          | 50/week   | 10/week      | 500/week |
+  | Lead time (commit to prod)    | 4 hours   | 1 week       | < 1 hour |
+  | Change failure rate           | 12%       | 15%          | < 5%     |
+  | Mean time to recovery (MTTR)  | 30 min    | 4 hours      | < 10 min |
+  | Availability                  | 99.9%     | 99.5%        | 99.99%   |
+  | Container adoption            | 75%       | 60%          | 95%      |
+  | GitOps adoption              | 40%       | 25%          | 90%      |
+  | Automated canary deployments  | 10%       | 5%           | 60%      |
+  | Security scan in CI/CD        | 30%       | 20%          | 95%      |
+  | Platform team ratio           | 1:50      | 1:25         | 1:75     |
+  Your organization is above average but significantly behind elite
+  performers in several areas. The biggest gaps are in security scanning,
+  GitOps adoption, automated canary deployments, and MTTR.
+  The board asks:
+  1. What would it take to reach "Elite" in 18 months?
+  2. Where should we focus investment for maximum impact?
+  3. How do we compare to specific competitors?
+  4. What's the business value of each improvement?
+  Task: Write the benchmarking report with recommendations. Include:
+  maturity assessment against CNCF cloud-native maturity model, DORA
+  metrics analysis, gap prioritization (impact vs effort matrix),
+  investment roadmap to reach elite level, competitive positioning
+  analysis, and the business case for each improvement area.
+assertions:
+  - type: llm_judge
+    criteria: "Maturity assessment is structured — use CNCF Cloud Native Maturity Model levels (Level 1-5: Build, Operate, Scale, Improve, Optimize). Organization is at Level 3 (Scale) for most dimensions, Level 2 (Operate) for security and observability. DORA metrics show High performer level (not Elite). Gap analysis identifies specific areas holding back each dimension"
+    weight: 0.35
+    description: "Maturity assessment"
+  - type: llm_judge
+    criteria: "Prioritization uses impact vs effort — highest ROI investments: (1) GitOps adoption (high impact, medium effort — standardizes deployments, reduces MTTR), (2) security scanning (high impact, low effort — integrate Trivy into CI/CD), (3) automated canary deployments (high impact, high effort — reduces change failure rate). Each priority has a business value quantification (reduced downtime $ impact, faster feature delivery)"
+    weight: 0.35
+    description: "Investment prioritization"
+  - type: llm_judge
+    criteria: "Roadmap and competitive framing are practical — 18-month roadmap: Q1 (GitOps + security scanning), Q2 (canary deployments + MTTR reduction), Q3 (platform self-service maturity), Q4-6 (advanced observability + chaos engineering). Competitive positioning: focus on areas that differentiate (deployment velocity directly impacts time-to-market). Don't compare to FAANG — compare to similar-sized companies in the same industry"
+    weight: 0.30
+    description: "Roadmap and framing"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/ma-integration.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: ma-integration
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "M&A Kubernetes platform integration — evaluate and merge acquired company's infrastructure with your platform"
+  tags: [Kubernetes, M&A, integration, migration, platform-consolidation, master]
+state: {}
+trigger: |
+  Your company (running multi-cluster Kubernetes on AWS) just acquired
+  a competitor running on Azure with a completely different stack:
+  Your platform:
+  - AWS EKS across 3 regions, 200+ microservices
+  - GitOps with ArgoCD, Istio service mesh
+  - Go and Java services, gRPC + REST APIs
+  - PostgreSQL on RDS, Redis on ElastiCache
+  - Prometheus + Grafana monitoring stack
+  Acquired company (AcquireCo):
+  - Azure AKS single region, 80 microservices
+  - Jenkins CI/CD, no GitOps
+  - Python and Node.js services, REST-only APIs
+  - MongoDB Atlas, Azure Cache for Redis
+  - Datadog for monitoring
+  Integration challenges:
+  1. Services need to communicate cross-platform during transition
+  2. Different authentication systems (AWS IAM vs Azure AD)
+  3. Different monitoring stacks create visibility gaps
+  4. Cultural: AcquireCo team used to full Azure access, now must follow
+     your RBAC and GitOps policies
+  5. Data residency: AcquireCo has EU customers, data must stay in EU
+  6. Combined customer base requires 99.99% availability (was 99.9%)
+  7. Timeline pressure: CEO wants "fully integrated" in 12 months
+  8. Budget: $2M for integration project
+  The CTO asks you to develop the integration strategy and lead
+  execution. 40 AcquireCo engineers need to be productive on your
+  platform within 3 months.
+  Task: Design the M&A platform integration strategy. Write: the
+  assessment framework for evaluating AcquireCo's platform, the phased
+  migration approach (coexistence → migration → consolidation),
+  cross-cloud connectivity during transition, team onboarding plan,
+  data migration strategy, risk analysis, and timeline with milestones.
+assertions:
+  - type: llm_judge
+    criteria: "Assessment and phased approach are defined — Phase 1 (months 1-3): Coexistence — establish cross-cloud connectivity (VPN/peering), unified monitoring (ship AcquireCo metrics to your Prometheus), single incident management process. Phase 2 (months 4-8): Migration — migrate AcquireCo services to your EKS clusters one by one, starting with stateless services. Phase 3 (months 9-12): Consolidation — decommission AKS, unified platform. Don't rush: premature migration causes outages"
+    weight: 0.35
+    description: "Phased approach"
+  - type: llm_judge
+    criteria: "Cross-cloud and data challenges are addressed — connectivity: site-to-site VPN or cloud interconnect (AWS-Azure), DNS delegation for service discovery. Authentication: federate Azure AD into AWS IAM or use platform-level auth (OIDC). Data residency: maintain EU region for AcquireCo EU data (EU EKS cluster), migrate MongoDB to PostgreSQL or run MongoDB on Kubernetes. Data migration must be carefully planned with rollback capability"
+    weight: 0.35
+    description: "Technical challenges"
+  - type: llm_judge
+    criteria: "Team and risk management are practical — onboarding plan: 2-week intensive on your platform (GitOps, ArgoCD, Istio), pair programming with your engineers, gradual RBAC permission expansion as comfort grows. Risk: rushing migration causes outages affecting combined customer base. Mitigation: feature-flag traffic routing to enable instant rollback to AKS. Budget allocation: $800K infrastructure, $600K team augmentation, $400K tooling, $200K contingency"
+    weight: 0.30
+    description: "Team and risk"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/master-troubleshooting-shift.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+meta:
+  id: master-troubleshooting-shift
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Master-level troubleshooting shift — manage a global multi-cluster crisis combining technical failures, organizational challenges, and strategic decisions"
+  tags: [Kubernetes, troubleshooting, combined, shift-simulation, global-crisis, master]
+state: {}
+trigger: |
+  You're the VP of Platform Engineering. It's 2:00 AM and you receive
+  a call: "We're having a global incident affecting all regions."
+  The situation (unfolding over the next 4 hours):
+  2:00 AM — Initial alert:
+  - prod-us-1: 30% of pods OOMKilled after a memory leak in a shared
+    library updated 6 hours ago. HPA scaling rapidly.
+  - prod-eu-1: healthy but seeing 2x traffic as users retry failed US
+    requests through CDN failover
+  2:15 AM — Escalation:
+  - The shared library was updated by 15 different services via
+    automated dependency bot (Renovate) — all merged without human
+    review
+  - Rollback requires coordinating 15 service teams
+  - The memory leak is slow: pods run fine for 1-2 hours then OOMKill
+  - HPA is scaling to max, exhausting cluster capacity
+  2:45 AM — Cascade begins:
+  - prod-us-1 runs out of cluster capacity. New pods can't schedule.
+  - Cluster autoscaler hits AWS account EC2 limit (quota reached)
+  - prod-eu-1 starting to see memory growth in the same services
+  - EU compliance officer: "US traffic hitting EU cluster may violate
+    GDPR for certain customers"
+  3:15 AM — Decision point:
+  - Board member calls: "I saw on Twitter we're having an outage. What's
+    the customer impact and when will it be resolved?"
+  - Customer success: "Our top 3 enterprise accounts are escalating.
+    One is threatening to invoke the SLA penalty clause ($500K)."
+  - Engineering: "We can rollback the library but it requires CI builds
+    for all 15 services — estimated 90 minutes"
+  - Security: "The dependency bot shouldn't have auto-merged. This is
+    a supply chain risk."
+  4:00 AM — Resolution path:
+  You need to simultaneously: contain the blast radius, communicate
+  to all stakeholders, make strategic decisions about GDPR exposure,
+  and plan the remediation.
+  Task: Walk through managing this global crisis end-to-end. Write:
+  the technical incident response (contain → triage → remediate),
+  the multi-stakeholder communication strategy (board, customers,
+  teams, compliance), the strategic decisions required and their
+  trade-offs, the organizational failures that enabled this incident,
+  the post-incident improvement program, and how this incident shapes
+  future platform architecture and governance.
+assertions:
+  - type: llm_judge
+    criteria: "Technical response is immediate and effective — containment: set memory limits on affected pods (kill early, restart fresh to buy time), disable HPA scaling to prevent cluster exhaustion, request AWS quota increase. Triage: identify the shared library as root cause, determine which services are affected. Remediate: pin the library to previous version in dependency bot, trigger CI builds for all 15 services, deploy rollbacks as they complete. Meanwhile: rate-limit CDN failover to prevent GDPR exposure"
+    weight: 0.35
+    description: "Technical response"
+  - type: llm_judge
+    criteria: "Communication is multi-layered and timely — board member: concise impact statement, ETA, reassurance of response capability (2 min update). Enterprise accounts: personal outreach from customer success with honest ETA, proactive SLA credit offer before they ask. Engineering teams: clear incident channel, assigned rollback owners per service, 15-min status updates. Compliance: document GDPR exposure window, assess if EU data was actually processed in wrong region, prepare breach assessment if needed"
+    weight: 0.35
+    description: "Stakeholder communication"
+  - type: llm_judge
+    criteria: "Organizational failures and improvements are systemic — failures: automated dependency updates without human review for critical libraries, no canary period for shared library updates, no memory limit enforcement, AWS quota not sized for surge, GDPR traffic routing not enforced at CDN level. Improvements: require human review for shared/transitive dependencies, implement dependency update canary (1 service first, monitor 24h), enforce memory limits as admission policy, implement traffic geo-fencing, regular capacity stress tests. Frame as governance evolution, not blame"
+    weight: 0.30
+    description: "Failures and improvements"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/product-development.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: product-development
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Build a Kubernetes troubleshooting product — design a commercial platform for automated root cause analysis and remediation"
+  tags: [Kubernetes, product-development, troubleshooting-tool, automation, AI-ops, master]
+state: {}
+trigger: |
+  You're founding a startup that automates Kubernetes troubleshooting.
+  Your thesis: "80% of Kubernetes incidents have patterns that can be
+  detected and remediated automatically."
+  Market research:
+  - 78% of organizations report Kubernetes troubleshooting as their
+    #1 operational pain point
+  - Average MTTR for Kubernetes incidents: 2 hours
+  - $50K-500K cost per hour of downtime depending on company size
+  - Existing tools (Komodor, Robusta, Kubecost) address pieces but no
+    unified automated remediation platform
+  Product vision: An AI-powered platform that:
+  1. Continuously monitors cluster health
+  2. Detects anomalies before they become incidents
+  3. Automatically diagnoses root cause using knowledge graph
+  4. Suggests or auto-executes remediation
+  5. Learns from past incidents to prevent recurrence
+  Technical challenges:
+  - How to access cluster state without excessive permissions
+  - How to distinguish normal behavior from anomalies
+  - How to safely auto-remediate without making things worse
+  - How to work across different Kubernetes distributions (EKS, GKE, AKS,
+    on-prem)
+  - How to handle the blast radius of automated actions
+  - How to build trust — engineers won't let automation touch production
+    without confidence
+  Your advisor asks:
+  - "What's the minimum viable product?"
+  - "How do you differentiate from kubectl + Prometheus + PagerDuty?"
+  - "What's the GTM strategy for selling to platform teams?"
+  - "How do you handle the 'it deleted my pod' support ticket?"
+  Task: Design the product and go-to-market strategy. Write: the
+  product architecture (data collection, analysis engine, remediation
+  framework), the MVP scope and features, differentiation and moat,
+  the trust-building approach (observe → suggest → auto-remediate
+  progression), pricing model, and go-to-market strategy.
+assertions:
+  - type: llm_judge
+    criteria: "Product architecture is technically sound — data collection: read-only agent (DaemonSet) collecting events, logs, metrics, resource state via K8s API and node-level telemetry. Analysis engine: pattern matching against known failure modes (CrashLoopBackOff + specific log patterns), ML for anomaly detection on resource usage trends, knowledge graph linking symptoms to root causes. Remediation: runbook automation engine with approval workflows and blast radius controls"
+    weight: 0.35
+    description: "Product architecture"
+  - type: llm_judge
+    criteria: "Trust and GTM strategy are realistic — trust progression: Phase 1 (observe) just dashboards and alerts, Phase 2 (suggest) recommended actions with explanation, Phase 3 (auto-remediate) automated actions with undo capability and audit log. MVP: focus on top 10 failure patterns (CrashLoopBackOff, OOMKilled, Pending, ImagePullBackOff, etc.) with automated diagnosis. GTM: target platform engineering teams at 50-500 engineer companies, freemium for < 3 clusters, land-and-expand pricing"
+    weight: 0.35
+    description: "Trust and GTM"
+  - type: llm_judge
+    criteria: "Differentiation and business model are compelling — moat: incident knowledge graph grows with each customer (network effects), cross-cluster pattern detection impossible with single-cluster tools. Differentiation from kubectl+Prometheus: automated root cause analysis (not just alerting), cross-signal correlation (logs + events + metrics), remediation automation. Pricing: per-node or per-cluster SaaS ($X/node/month). Competitive positioning: 'reduce MTTR by 80%' with measurable before/after metrics"
+    weight: 0.30
+    description: "Differentiation and model"

package/courses/kubernetes-deployment-troubleshooting/scenarios/level-5/regulatory-compliance.yaml ADDED Viewed

@@ -0,0 +1,76 @@
+meta:
+  id: regulatory-compliance
+  level: 5
+  course: kubernetes-deployment-troubleshooting
+  type: output
+  description: "Kubernetes regulatory compliance — navigate SOC2, PCI-DSS, HIPAA, and GDPR requirements in Kubernetes environments"
+  tags: [Kubernetes, compliance, SOC2, PCI-DSS, HIPAA, GDPR, regulatory, master]
+state: {}
+trigger: |
+  Your company's Kubernetes platform must pass audits for multiple
+  compliance frameworks simultaneously. The compliance team has flagged
+  gaps that must be addressed before the next audit cycle (90 days).
+  Framework requirements and Kubernetes implications:
+  SOC2 (all services):
+  - Access control: Who can access what in the cluster?
+  - Change management: How are changes tracked and approved?
+  - Monitoring: Is there continuous monitoring and alerting?
+  - Incident response: Is there a documented process?
+  - Data protection: Are secrets and data encrypted?
+  PCI-DSS (payment services):
+  - Network segmentation: Payment services isolated from other workloads
+  - Encryption in transit: All communication encrypted (mTLS)
+  - Access logging: Every API call logged and auditable
+  - Vulnerability scanning: Container images scanned before deployment
+  - Penetration testing: Regular pen testing of the infrastructure
+  HIPAA (health services):
+  - PHI encryption: Patient data encrypted at rest and in transit
+  - Access controls: Minimum necessary access to PHI
+  - Audit trails: All access to PHI logged
+  - BAA with cloud provider: Business Associate Agreement
+  - Breach notification: Process for breach detection and notification
+  GDPR (EU customers):
+  - Data residency: EU customer data processed in EU region
+  - Data minimization: Only collect necessary data
+  - Right to deletion: Ability to purge user data across all services
+  - Data processing records: Document all data flows
+  - DPO appointment: Data Protection Officer
+  Current gaps (from pre-audit assessment):
+  1. RBAC is too permissive (cluster-admin granted broadly)
+  2. No network segmentation between namespaces
+  3. Kubernetes audit logging not enabled
+  4. Secrets not encrypted at rest in etcd
+  5. No image vulnerability scanning
+  6. No documented change management process
+  7. EU data sometimes processed in US clusters (GDPR violation)
+  8. No data deletion workflow across microservices
+  Task: Design the compliance remediation plan. Write: the mapping of
+  each compliance requirement to Kubernetes controls, the technical
+  implementation for each gap (RBAC tightening, NetworkPolicy, audit
+  logging, encryption, scanning), the organizational processes needed
+  (change management, incident response documentation), how to maintain
+  compliance continuously (not just for audits), and the evidence
+  collection automation for audit readiness.
+assertions:
+  - type: llm_judge
+    criteria: "Compliance-to-Kubernetes control mapping is comprehensive — RBAC maps to SOC2 access control + PCI-DSS access + HIPAA minimum necessary. NetworkPolicy maps to PCI-DSS segmentation + HIPAA access controls. Kubernetes audit logging maps to SOC2 monitoring + PCI-DSS logging + HIPAA audit trails. etcd encryption maps to SOC2 + PCI-DSS + HIPAA data protection. Image scanning maps to PCI-DSS vulnerability management. Data residency controls map to GDPR requirements"
+    weight: 0.35
+    description: "Control mapping"
+  - type: llm_judge
+    criteria: "Technical implementations are specific — RBAC: namespace-scoped roles, remove cluster-admin, implement just-in-time access for emergency. NetworkPolicy: default-deny + explicit allow between payment services. Audit logging: configure kube-apiserver audit policy (log all write operations, log read operations on sensitive resources). etcd encryption: enable encryption provider with KMS integration. Image scanning: Trivy in CI/CD + admission controller blocking vulnerable images"
+    weight: 0.35
+    description: "Technical implementations"
+  - type: llm_judge
+    criteria: "Continuous compliance and automation are addressed — shift from point-in-time audits to continuous compliance. Implement: OPA/Kyverno policies enforcing compliance rules (prevent non-compliant resources), automated compliance dashboards (policy violations, RBAC drift, unscanned images), evidence collection automation (export audit logs, RBAC reports, scan results to compliance platform). Regular compliance reviews (monthly), automated drift detection. GDPR: implement data flow tracking, build cross-service data deletion API"
+    weight: 0.30
+    description: "Continuous compliance"

package/courses/mysql-query-optimization/course.yaml ADDED Viewed

@@ -0,0 +1,11 @@
+id: mysql-query-optimization
+name: "Database Query Optimization (MySQL)"
+description: >
+  Master MySQL query optimization from reading execution plans to
+  enterprise database architecture. Learn InnoDB internals, indexing
+  strategies, JOIN optimization, partitioning, replication, connection
+  management, write optimization, monitoring, and high-availability
+  configurations for large-scale MySQL deployments.
+levels: 5
+scenarios_per_level: 10
+tags: [development, MySQL, database, query-optimization, performance, SQL, InnoDB, DevOps]

package/courses/mysql-query-optimization/scenarios/level-1/buffer-pool-basics.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: buffer-pool-basics
+  level: 1
+  course: mysql-query-optimization
+  type: output
+  description: "Understand InnoDB buffer pool — learn how MySQL caches data in memory, why buffer pool size matters, and how to monitor hit ratios"
+  tags: [MySQL, InnoDB, buffer-pool, memory, caching, beginner]
+state: {}
+trigger: |
+  Your MySQL server has 64GB of RAM but the InnoDB buffer pool is set
+  to the default 128MB. The server is hitting disk for almost every
+  query. A colleague says "just set innodb_buffer_pool_size to 48GB"
+  but you want to understand what that actually does before changing a
+  production setting.
+  Current state:
+  SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';
+  +-------------------------------------+----------+
+  | Variable_name                       | Value    |
+  +-------------------------------------+----------+
+  | Innodb_buffer_pool_read_requests    | 50000000 |
+  | Innodb_buffer_pool_reads            | 15000000 |
+  | Innodb_buffer_pool_pages_total      | 8192     |
+  | Innodb_buffer_pool_pages_data       | 8100     |
+  | Innodb_buffer_pool_pages_dirty      | 200      |
+  | Innodb_buffer_pool_pages_free       | 12       |
+  +-------------------------------------+----------+
+  Calculations from this data:
+  - Buffer pool hit ratio: (50M - 15M) / 50M = 70% (terrible — should be 99%+)
+  - Buffer pool is full: only 12 free pages out of 8,192
+  - 200 dirty pages waiting to be flushed to disk
+  - Total data in buffer pool: 8,100 × 16KB = ~126MB (matches 128MB setting)
+  Database size: 30GB of data + 10GB of indexes = 40GB total
+  Available RAM: 64GB (OS and other processes need ~8GB)
+  Questions to answer:
+  1. What is the buffer pool and why is it critical for InnoDB?
+  2. What should innodb_buffer_pool_size be set to?
+  3. What happens when data isn't in the buffer pool (disk read)?
+  4. How does InnoDB decide what to evict (LRU algorithm)?
+  5. What are dirty pages and how does flushing work?
+  Task: Explain the InnoDB buffer pool. Write: what it is (in-memory
+  cache for data and indexes), how to size it (rule of thumb and
+  monitoring approach), how the LRU eviction works (midpoint insertion),
+  what dirty pages are and how checkpoint/flushing works, and how to
+  monitor buffer pool health.
+assertions:
+  - type: llm_judge
+    criteria: "Buffer pool is correctly explained — InnoDB's main memory cache holding both data pages and index pages (16KB each). All reads and writes go through the buffer pool first. A read request either finds the page in memory (cache hit) or reads from disk (cache miss). The 70% hit ratio is correctly identified as terrible and the 128MB setting is identified as far too small"
+    weight: 0.35
+    description: "Buffer pool basics"
+  - type: llm_judge
+    criteria: "Sizing recommendation is practical — suggests 70-80% of total RAM for dedicated MySQL servers (so ~48-50GB for this 64GB server), explains why the working set (40GB) should ideally fit in the buffer pool, mentions innodb_buffer_pool_instances for multi-core efficiency (1 instance per GB, up to 64), and notes that changing requires a restart in older versions but is dynamic in MySQL 8.0"
+    weight: 0.35
+    description: "Sizing recommendation"
+  - type: llm_judge
+    criteria: "LRU and flushing are explained — InnoDB uses a modified LRU with midpoint insertion (new pages go to midpoint, not head, to prevent one-time full table scans from evicting hot data), dirty pages are modified pages not yet written to disk, and checkpoint/flushing writes dirty pages back to tablespace files (controlled by innodb_io_capacity)"
+    weight: 0.30
+    description: "LRU and flushing explained"

package/courses/mysql-query-optimization/scenarios/level-1/explain-basics.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: explain-basics
+  level: 1
+  course: mysql-query-optimization
+  type: output
+  description: "Read MySQL EXPLAIN output — understand execution plans, access types, key usage, and rows examined to diagnose slow queries"
+  tags: [MySQL, EXPLAIN, execution-plan, access-types, beginner]
+state: {}
+trigger: |
+  You're a backend developer and your team lead asks you to investigate
+  why the product listing page takes 8 seconds to load. You've traced
+  it to this query:
+  SELECT p.*, c.name AS category_name, b.name AS brand_name
+  FROM products p
+  JOIN categories c ON c.id = p.category_id
+  JOIN brands b ON b.id = p.brand_id
+  WHERE p.status = 'active'
+    AND p.price BETWEEN 10 AND 100
+  ORDER BY p.created_at DESC
+  LIMIT 20;
+  You run EXPLAIN and get:
+  +----+------+-------+------+----------+------+----------+-----------+
+  | id | type | table | type | key      | rows | filtered | Extra     |
+  +----+------+-------+------+----------+------+----------+-----------+
+  |  1 | SIMPLE| p    | ALL  | NULL     | 2M   | 5.00     | Using where; Using filesort |
+  |  1 | SIMPLE| c    | eq_ref| PRIMARY | 1    | 100.00   | NULL      |
+  |  1 | SIMPLE| b    | eq_ref| PRIMARY | 1    | 100.00   | NULL      |
+  +----+------+-------+------+----------+------+----------+-----------+
+  The products table has 2 million rows. The EXPLAIN shows:
+  - type=ALL for products (full table scan)
+  - "Using filesort" in Extra (sorting on disk)
+  - rows=2M (examining all rows)
+  - filtered=5.00 (only 5% of rows match the WHERE)
+  Available indexes on products:
+  - PRIMARY (id)
+  - idx_category_id (category_id)
+  - idx_brand_id (brand_id)
+  No index covers the status + price + created_at combination.
+  Task: Explain what the EXPLAIN output means. Write: what each column
+  tells you (id, select_type, table, type, possible_keys, key, rows,
+  filtered, Extra), why this query is slow (full table scan + filesort),
+  what the access type hierarchy means (system > const > eq_ref > ref >
+  range > index > ALL), and what index you would add to fix this query.
+assertions:
+  - type: llm_judge
+    criteria: "EXPLAIN columns are correctly explained — id (query identifier), select_type (SIMPLE, SUBQUERY, DERIVED, etc.), table (which table), type (access method), possible_keys (candidate indexes), key (chosen index), rows (estimated rows examined), filtered (percentage matching WHERE), Extra (additional information like Using index, Using filesort, Using temporary)"
+    weight: 0.35
+    description: "EXPLAIN columns explained"
+  - type: llm_judge
+    criteria: "Access type hierarchy is explained — from best to worst: system (1 row), const (unique index lookup), eq_ref (unique index per join row), ref (non-unique index), range (index range scan), index (full index scan), ALL (full table scan). The current ALL type is identified as the problem, and the eq_ref for categories/brands is correctly identified as optimal"
+    weight: 0.35
+    description: "Access type hierarchy"
+  - type: llm_judge
+    criteria: "Fix is correct — recommends a composite index like (status, price, created_at) or (status, created_at) to eliminate both the full table scan and the filesort. Explains why column order matters in the index (equality columns first, then range, then sort)"
+    weight: 0.30
+    description: "Correct index fix"

package/courses/mysql-query-optimization/scenarios/level-1/first-optimization-shift.yaml ADDED Viewed

@@ -0,0 +1,78 @@
+meta:
+  id: first-optimization-shift
+  level: 1
+  course: mysql-query-optimization
+  type: output
+  description: "First optimization shift — diagnose and fix a slow e-commerce checkout combining multiple beginner optimization techniques"
+  tags: [MySQL, shift-simulation, checkout, diagnosis, beginner]
+state: {}
+trigger: |
+  It's your first on-call shift as a junior DBA. The checkout page is
+  loading slowly (8 seconds) and the engineering manager asks you to
+  investigate. You have access to the MySQL server and the application
+  logs.
+  Step 1 — You check SHOW PROCESSLIST and see 15 queries related to
+  the checkout endpoint, all from the same user session.
+  Step 2 — You enable the slow query log and capture these queries
+  from a single checkout request:
+  Query A (2 seconds):
+  SELECT * FROM products WHERE id IN (101, 205, 308, 412, 567);
+  -- EXPLAIN: type=range, key=PRIMARY, rows=5, Extra=NULL
+  -- Fast! Not the problem.
+  Query B (3 seconds):
+  SELECT * FROM inventory
+  WHERE product_id IN (101, 205, 308, 412, 567)
+    AND warehouse_id = 3
+    AND CAST(quantity AS CHAR) > '0';
+  -- EXPLAIN: type=ALL, rows=2M, Extra=Using where
+  -- Full table scan! The CAST prevents index usage.
+  Query C (0.5ms each, but called 5 times):
+  SELECT * FROM shipping_rates
+  WHERE zone_id = (SELECT zone_id FROM addresses WHERE user_id = 42);
+  -- Not slow individually, but it's an N+1 pattern (called per item)
+  Query D (2 seconds):
+  SELECT p.name, p.price, COUNT(r.id) AS review_count,
+         AVG(r.rating) AS avg_rating
+  FROM products p
+  LEFT JOIN reviews r ON r.product_id = p.id
+  WHERE p.id IN (101, 205, 308, 412, 567)
+  GROUP BY p.id;
+  -- EXPLAIN: reviews table type=ALL, rows=5M, Using temporary; Using filesort
+  -- No index on reviews.product_id!
+  Query E (1 second):
+  SELECT * FROM promotions
+  WHERE NOW() BETWEEN start_date AND end_date
+    AND minimum_order > 0
+  ORDER BY discount_percent DESC;
+  -- EXPLAIN: type=ALL, rows=50K
+  -- Function NOW() on every comparison prevents range scan optimization
+  Total: ~8 seconds across these queries.
+  Task: Diagnose and fix each slow query. Write: the root cause of each
+  query's performance problem, the specific fix (index, query rewrite,
+  or both), the expected improvement, and the priority order for
+  implementing fixes (which fix gives the biggest improvement first).
+assertions:
+  - type: llm_judge
+    criteria: "All root causes identified — Query B: CAST() on indexed column prevents index usage (SARGable violation), Query C: N+1 pattern (should be single query with JOIN), Query D: missing index on reviews.product_id, Query E: function wrapping issue with NOW(). Query A is correctly identified as not a problem"
+    weight: 0.35
+    description: "Root causes identified"
+  - type: llm_judge
+    criteria: "Fixes are correct — Query B: change to quantity > 0 (no CAST needed for numeric comparison), Query C: rewrite as single JOIN query, Query D: ADD INDEX idx_product_id (product_id) on reviews, Query E: rewrite WHERE clause to compare dates directly or add composite index on (end_date, start_date). Priority order puts highest-impact fixes first"
+    weight: 0.35
+    description: "Correct fixes"
+  - type: llm_judge
+    criteria: "Expected improvements are realistic — quantifies improvement for each fix (e.g., Query B from 3s to <10ms with proper index, Query D from 2s to <50ms with index on reviews.product_id). Total checkout time reduction from 8s to well under 1 second"
+    weight: 0.30
+    description: "Realistic improvement estimates"

package/courses/mysql-query-optimization/scenarios/level-1/innodb-index-fundamentals.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: innodb-index-fundamentals
+  level: 1
+  course: mysql-query-optimization
+  type: output
+  description: "Understand InnoDB indexing — learn clustered indexes, secondary indexes, and how InnoDB's B+tree structure affects query performance"
+  tags: [MySQL, InnoDB, indexes, B-tree, clustered-index, beginner]
+state: {}
+trigger: |
+  Your team is debating whether to use UUID or auto-increment INT as
+  the primary key for a new high-traffic table. A senior engineer says
+  "UUIDs will kill insert performance in InnoDB." You need to understand
+  why.
+  Context:
+  - Table: user_events (expected 100M rows in first year)
+  - Current schema option A (auto-increment):
+    CREATE TABLE user_events (
+      id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
+      user_id BIGINT UNSIGNED NOT NULL,
+      event_type VARCHAR(50) NOT NULL,
+      payload JSON,
+      created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+      INDEX idx_user_events (user_id, created_at)
+    ) ENGINE=InnoDB;
+  - Current schema option B (UUID):
+    CREATE TABLE user_events (
+      id CHAR(36) PRIMARY KEY,
+      user_id BIGINT UNSIGNED NOT NULL,
+      event_type VARCHAR(50) NOT NULL,
+      payload JSON,
+      created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+      INDEX idx_user_events (user_id, created_at)
+    ) ENGINE=InnoDB;
+  Your measurements after inserting 10M rows:
+  - Option A: 15 minutes, 2GB on disk, sequential I/O during inserts
+  - Option B: 45 minutes, 3.5GB on disk, random I/O during inserts
+  The team wants to understand:
+  1. What is a clustered index and why does InnoDB always have one?
+  2. Why do UUIDs cause random I/O on insert?
+  3. Why is Option B 75% larger on disk?
+  4. How do secondary indexes work in InnoDB (why do they store the PK)?
+  5. What is a "covering index" and how does it avoid table lookups?
+  Task: Explain InnoDB's indexing architecture. Write: how the clustered
+  index works (data stored in PK order), why UUIDs cause page splits
+  and random I/O, how secondary indexes reference the primary key (not
+  row pointers), what a covering index is (index-only scan), and your
+  recommendation for the primary key choice.
+assertions:
+  - type: llm_judge
+    criteria: "Clustered index is correctly explained — InnoDB stores table data in primary key order (the clustered index IS the table), inserts with auto-increment append to the end (sequential I/O), while random UUIDs insert into random positions causing page splits and fragmentation"
+    weight: 0.35
+    description: "Clustered index explanation"
+  - type: llm_judge
+    criteria: "Secondary indexes are explained — InnoDB secondary indexes store the primary key value (not a row pointer), so lookups via secondary index require a 'bookmark lookup' back to the clustered index. This means larger PKs (UUID=36 bytes vs BIGINT=8 bytes) make every secondary index larger, explaining the 75% size difference"
+    weight: 0.35
+    description: "Secondary index mechanics"
+  - type: llm_judge
+    criteria: "Recommendation is practical — recommends auto-increment BIGINT for this use case (high insert volume), or if UUIDs are needed, suggests UUID_TO_BIN() with swap flag (ordered UUID) or ULID/UUIDv7 (time-sorted) as a compromise that preserves insert performance while maintaining uniqueness"
+    weight: 0.30
+    description: "Practical PK recommendation"