npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/mysql-query-optimization/scenarios/level-4/aurora-vs-rds-evaluation.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: aurora-vs-rds-evaluation
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Evaluate Aurora vs RDS MySQL — compare performance, cost, features, and operational trade-offs for enterprise workloads"
+  tags: [MySQL, Aurora, RDS, AWS, vendor-evaluation, TCO, expert]
+state: {}
+trigger: |
+  Your company runs 10 MySQL RDS instances and the CTO wants to
+  evaluate migrating to Aurora MySQL. The AWS sales team claims
+  "5x performance improvement and lower TCO." You need to validate
+  these claims with real analysis.
+  Current state (RDS MySQL):
+  - 10 instances: 5 primary + 5 read replicas
+  - Instance type: db.r6g.4xlarge ($2.46/hr per instance)
+  - Storage: 5TB gp3 across all instances
+  - IOPS: 16,000 provisioned per primary
+  - Monthly cost: $22K (compute) + $3K (storage) + $1K (IOPS) = $26K
+  - Replication lag: 5-30 seconds (async)
+  - Failover time: 60-120 seconds
+  - Backup: automated daily, 7-day retention
+  Aurora proposal (AWS sales):
+  - 5 instances: 1 writer + 4 readers (Aurora needs fewer due to
+    shared storage)
+  - Instance type: db.r6g.4xlarge ($3.28/hr per instance)
+  - Storage: auto-scaling, pay per GB-month ($0.10/GB)
+  - I/O: pay per million requests ($0.20 per million)
+  - Estimated monthly cost: $12K (compute) + $0.5K (storage) + $4K (I/O) = $16.5K
+  - Replication lag: <100ms (shared storage, no log replay)
+  - Failover time: <30 seconds
+  - Backup: continuous, 35-day retention
+  Concerns raised by the team:
+  1. Aurora's "5x faster" claim — is it real for our workload?
+  2. I/O costs are unpredictable — could they spike?
+  3. Lock-in: Aurora has proprietary storage, can we migrate away?
+  4. MySQL compatibility: are there breaking differences?
+  5. The 4-reader setup: can it handle our analytics workload?
+  Task: Write the evaluation report. Include: the TCO comparison (with
+  hidden costs), the performance benchmark plan, the compatibility
+  assessment, the lock-in risk analysis, and the recommendation.
+assertions:
+  - type: llm_judge
+    criteria: "TCO comparison includes hidden costs — accounts for I/O costs (which can be significant and unpredictable for write-heavy workloads), data transfer costs, backup storage beyond included, and the cost of migration itself. 3-year projection shows when each option is cheaper. Validates or refutes the $26K vs $16.5K estimate"
+    weight: 0.35
+    description: "TCO with hidden costs"
+  - type: llm_judge
+    criteria: "Performance claims are critically analyzed — the '5x faster' claim applies to specific workloads (Aurora's distributed storage eliminates checkpoint stalls), not universally. Recommends benchmarking with actual workload, identifies where Aurora is genuinely faster (failover, replication lag, crash recovery) and where it may not be (single-query latency for simple queries)"
+    weight: 0.35
+    description: "Performance claims analyzed"
+  - type: llm_judge
+    criteria: "Lock-in and compatibility are honestly assessed — Aurora's proprietary storage means you can't simply pg_dump/restore back to RDS MySQL. Compatibility differences (minor SQL behavior, parameter differences, Aurora-specific features). Migration path back to RDS exists but requires downtime. Lock-in risk is quantified in terms of migration effort and cost"
+    weight: 0.30
+    description: "Lock-in and compatibility"

package/courses/mysql-query-optimization/scenarios/level-4/data-architecture.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: data-architecture
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Design MySQL data architecture — separate OLTP from OLAP, implement sharding strategies, and plan for 10x growth"
+  tags: [MySQL, data-architecture, sharding, OLTP, OLAP, scaling, expert]
+state: {}
+trigger: |
+  Your SaaS platform has outgrown a single MySQL cluster. The primary
+  database is 5TB, handles 100K QPS, and analytics queries are
+  degrading OLTP performance. The CTO asks you to design the data
+  architecture for the next 3 years of 10x growth.
+  Current state:
+  - Single Aurora MySQL cluster: 1 writer + 4 readers
+  - 5TB data, 500 tables, 100K QPS
+  - OLTP and analytics on the same cluster
+  - Largest table: events (2B rows, 2TB)
+  - Analytics queries run on readers but cause replication lag
+  Growth projections:
+  - Users: 5M → 50M in 3 years
+  - Data: 5TB → 50TB
+  - QPS: 100K → 1M
+  - Analytics: real-time dashboards, ML feature extraction
+  Architecture options:
+  A. Vertical scaling (bigger instances)
+  B. Read replicas (more readers)
+  C. Functional sharding (split by domain: users, orders, events)
+  D. Horizontal sharding (split by tenant_id or user_id)
+  E. Vitess (MySQL sharding proxy from YouTube/PlanetScale)
+  F. TiDB (MySQL-compatible distributed database)
+  G. OLTP stays MySQL + OLAP moves to ClickHouse/BigQuery
+  Constraints:
+  - Team knows MySQL, limited distributed systems experience
+  - Must support transactions within a single shard
+  - Cross-shard JOINs needed for some reports
+  - Data locality: some jurisdictions require data residency
+  Task: Design the data architecture. Write: the recommended approach
+  with justification, the sharding strategy (if applicable), the
+  OLTP/OLAP separation design, the data pipeline architecture, and
+  the migration plan from current state.
+assertions:
+  - type: llm_judge
+    criteria: "Architecture addresses 10x growth — evaluates all options, recommends a phased approach (functional sharding first, then horizontal if needed), separates OLTP and OLAP workloads (MySQL for OLTP, analytical database for OLAP), and explains why each rejected option doesn't meet requirements"
+    weight: 0.35
+    description: "10x growth architecture"
+  - type: llm_judge
+    criteria: "Sharding strategy is practical — if recommending sharding, defines the shard key (tenant_id for multi-tenant SaaS), explains how cross-shard queries are handled (fan-out or analytical database), addresses data locality requirements, and considers Vitess or ProxySQL for shard routing. If recommending against sharding, explains the alternative scaling path"
+    weight: 0.35
+    description: "Practical sharding strategy"
+  - type: llm_judge
+    criteria: "OLTP/OLAP separation is designed — explains the data pipeline from MySQL to analytical database (CDC via Debezium, binlog streaming, or DMS), data freshness requirements, and how this eliminates analytics impact on OLTP. Migration plan is phased with rollback capability at each stage"
+    weight: 0.30
+    description: "OLTP/OLAP separation"

package/courses/mysql-query-optimization/scenarios/level-4/database-migration-planning.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: database-migration-planning
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Plan a MySQL migration — migrate from MySQL 5.7 to 8.0, or from self-managed to Aurora, with near-zero downtime"
+  tags: [MySQL, migration, upgrade, Aurora, zero-downtime, expert]
+state: {}
+trigger: |
+  Your company needs to migrate from MySQL 5.7 (self-managed on EC2)
+  to MySQL 8.0 on Aurora. The database processes $30M in transactions
+  daily with a 99.95% availability SLA. MySQL 5.7 reaches end of
+  extended support, and the business needs Aurora's read scaling.
+  Source: MySQL 5.7.44 on EC2, 3TB, 150 tables, 80K QPS peak
+  Target: Aurora MySQL 8.0 (latest)
+  Challenges:
+  1. 3TB data takes ~5 hours via mysqldump
+  2. MySQL 5.7→8.0 has breaking changes:
+     - Default charset changes (latin1 → utf8mb4)
+     - GROUP BY implicit ordering removed
+     - Reserved word changes (RANK, GROUPS, etc.)
+     - JSON function behavior changes
+     - Authentication plugin changes (mysql_native_password → caching_sha2)
+  3. Application uses mysql_native_password — Aurora 8.0 defaults to
+     caching_sha2_password
+  4. 15 stored procedures using deprecated syntax
+  5. 3 applications use MySQL-specific features (GET_LOCK, FOUND_ROWS)
+  6. Read replicas serving 3 analytics dashboards must migrate too
+  7. Maximum acceptable downtime: 10 minutes
+  Migration approaches:
+  A. mysqldump + restore (simple, hours of downtime)
+  B. AWS DMS (managed migration, CDC for near-zero downtime)
+  C. Binlog replication (source→target, then cutover)
+  D. Percona XtraBackup + binlog replay
+  E. Blue-green with Aurora cloning
+  Task: Design the migration plan. Write: the compatibility assessment
+  (what breaks between 5.7 and 8.0), the chosen approach with
+  justification, the step-by-step migration procedure, the cutover
+  choreography (the 10-minute window), and the rollback plan.
+assertions:
+  - type: llm_judge
+    criteria: "Compatibility assessment is thorough — identifies the key 5.7→8.0 breaking changes (charset defaults, GROUP BY ordering, reserved words, authentication plugin, deprecated syntax), provides a testing plan to validate application compatibility, and addresses the stored procedure and MySQL-specific feature issues before migration day"
+    weight: 0.35
+    description: "Thorough compatibility assessment"
+  - type: llm_judge
+    criteria: "Migration approach achieves <10 min downtime — uses DMS or binlog replication for initial data sync + ongoing CDC, then a brief cutover window (stop writes, let replication catch up, switch endpoints, verify, resume). The 3TB initial sync happens while source is live. Cutover choreography has timing estimates for each step"
+    weight: 0.35
+    description: "Near-zero downtime approach"
+  - type: llm_judge
+    criteria: "Rollback plan is viable — maintains source MySQL 5.7 as hot standby during migration, reverse replication from Aurora back to source if needed, DNS/endpoint switching allows quick rollback, and defines go/no-go criteria (data validation, application smoke tests, latency checks) before committing to the cutover"
+    weight: 0.30
+    description: "Viable rollback plan"

package/courses/mysql-query-optimization/scenarios/level-4/enterprise-governance.yaml ADDED Viewed

@@ -0,0 +1,50 @@
+meta:
+  id: enterprise-governance
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Design MySQL governance for enterprises — standardize configurations, automate query review, and implement organization-wide database health monitoring"
+  tags: [MySQL, governance, enterprise, standardization, CI-CD, expert]
+state: {}
+trigger: |
+  You're the Director of Database Engineering at a company with 800
+  engineers and 120 MySQL databases. An audit revealed systemic issues:
+  - 120 databases, 25 teams, no consistent standards
+  - 40% of databases use default innodb_buffer_pool_size (128MB)
+  - 60 databases have no query-level monitoring
+  - Query performance varies 50x between teams for similar workloads
+  - 6 production incidents last year from unoptimized queries
+  - No query review before deployment
+  - Connection pooling used by only 3 teams
+  - Backups tested by 0 teams
+  The CTO wants a governance framework that:
+  - Standardizes MySQL configuration across all 120 databases
+  - Prevents slow queries from reaching production
+  - Ensures backup and recovery readiness
+  - Provides visibility into database health organization-wide
+  - Maintains team autonomy (no DBA bottleneck)
+  Your team: 4 DBAs (overwhelmed with tickets), no platform tooling.
+  Task: Design the governance framework. Write: the configuration
+  standards (tiered by workload size), the automated query review
+  pipeline (CI/CD integration), the monitoring standards, the backup
+  verification program, and the organizational structure.
+assertions:
+  - type: llm_judge
+    criteria: "Configuration standards are tiered — defines settings for small/medium/large databases (buffer pool sizing, connection limits, innodb settings), provides a base my.cnf template, and addresses the 40% on defaults as immediate remediation. Standards are enforceable via configuration management (Ansible, Terraform) not just documentation"
+    weight: 0.35
+    description: "Tiered configuration standards"
+  - type: llm_judge
+    criteria: "Query review is automated — CI/CD pipeline captures SQL changes, runs EXPLAIN against a staging database, flags queries with full table scans or high estimated rows, blocks deployment if thresholds exceeded. Self-service: teams see results immediately without waiting for DBA review. Includes pt-query-digest analysis of test suite queries"
+    weight: 0.35
+    description: "Automated query review"
+  - type: llm_judge
+    criteria: "Monitoring and backup programs are comprehensive — unified PMM or equivalent dashboard across 120 databases, standard alerting rules (replication lag, disk space, slow queries, connections), and backup verification: automated monthly restore tests with documented RTO/RPO per database. Organizational shift from ticket-taking DBAs to platform engineering team"
+    weight: 0.30
+    description: "Monitoring and backup programs"

package/courses/mysql-query-optimization/scenarios/level-4/executive-communication.yaml ADDED Viewed

@@ -0,0 +1,54 @@
+meta:
+  id: executive-communication
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Communicate MySQL performance to executives — translate optimization into ROI, business impact, and strategic investment"
+  tags: [MySQL, executive-communication, ROI, business-impact, strategy, expert]
+state: {}
+trigger: |
+  You're presenting a MySQL optimization program to the C-suite. Your
+  team spent 4 months optimizing the database infrastructure and the
+  CFO wants ROI numbers.
+  Optimization results:
+  1. Query optimization: Fixed 150 slow queries
+     - P99 latency: 3 seconds → 150ms (95% improvement)
+  2. Index cleanup: Removed 200 unused indexes, added 50 targeted ones
+     - Storage: reduced 15GB of unused indexes
+     - Write performance: 20% improvement
+  3. Connection pooling: Deployed ProxySQL across all services
+     - Connection-related errors: eliminated (was 500/day)
+  4. Buffer pool tuning: 128MB → 48GB
+     - Buffer pool hit ratio: 70% → 99.5%
+     - Disk reads: reduced 90%
+  5. Aurora migration for 3 clusters
+     - Failover time: 120 seconds → 30 seconds
+     - Cost increase: $5K/month
+  Financial context:
+  - Database infrastructure: $300K/year
+  - Engineering team (3 DBAs): $750K/year
+  - Revenue impact of downtime: $25K/minute
+  - Last year's downtime: 22 hours ($33M potential impact)
+  - This year so far: 45 minutes ($1.1M potential impact)
+  Task: Write the executive presentation. Include: ROI calculation,
+  technical debt assessment in business terms, scaling roadmap for 5x
+  growth, risk register, and recommended next investments.
+assertions:
+  - type: llm_judge
+    criteria: "ROI is in business terms — calculates value of reduced downtime ($33M→$1.1M risk reduction), faster page loads (conversion rate impact), and compares total investment against returns. Clear ROI ratio that a CFO can understand"
+    weight: 0.35
+    description: "Business-terms ROI"
+  - type: llm_judge
+    criteria: "Technical debt is translated for executives — presents remaining optimization opportunities as business risk/cost (not technical jargon), estimates cost of inaction, and prioritizes by business impact"
+    weight: 0.35
+    description: "Executive-friendly technical debt"
+  - type: llm_judge
+    criteria: "Scaling roadmap is concrete — projects costs at 2x, 3x, 5x growth, identifies architectural inflection points (when to shard, when to add caching layer), and provides timeline with investment milestones"
+    weight: 0.30
+    description: "Concrete scaling roadmap"

package/courses/mysql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: expert-optimization-shift
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Expert optimization shift — manage a multi-cluster MySQL crisis during a high-traffic event with cascading failures"
+  tags: [MySQL, shift-simulation, crisis, multi-cluster, cascading-failure, expert]
+state: {}
+trigger: |
+  You're the database lead during Cyber Monday. Your company's MySQL
+  infrastructure spans 5 clusters. At 09:00 AM, traffic hits 5x
+  normal and cascading failures begin.
+  Timeline:
+  09:00 — Traffic starts at 5x normal. All green initially.
+  09:12 — Cluster A (Orders): Primary CPU 95%. ProxySQL backend
+  connection queue growing. P99 latency at 3 seconds.
+  Cause: New recommendation engine query deployed at 08:55:
+    SELECT p.* FROM products p
+    JOIN purchase_history ph ON ph.product_id = p.id
+    JOIN user_preferences up ON up.category = p.category
+    WHERE up.user_id = ?
+    ORDER BY ph.purchase_count DESC LIMIT 20;
+  Missing index on purchase_history, scanning 200M rows per request.
+  09:18 — Cluster B (Users): InnoDB deadlock rate 200/minute.
+  The loyalty points update is deadlocking with the cart update:
+    -- Transaction 1 (loyalty): UPDATE users SET points = points + 10...
+    -- Transaction 2 (cart): UPDATE users SET cart_total = ...
+  Both acquiring row locks on the same user rows.
+  09:22 — Cluster C (Inventory): Disk space at 92%. Binary logs and
+  relay logs accumulating. All 3 replicas lagging >2 minutes.
+  Cause: Massive write volume (inventory decrements for every purchase)
+  generating 2GB/hour of binlog.
+  09:25 — Cluster A Replica 1 removed from ProxySQL (lag > 120s).
+  All read traffic shifts to Replica 2, which also starts lagging.
+  API error rate hits 30%.
+  09:28 — Cluster D (Analytics): 5 long-running queries holding
+  metadata locks, blocking a scheduled ALTER TABLE that's been
+  waiting for 10 minutes. Other DDL operations queuing behind it.
+  09:30 — Revenue loss: $50K/minute. CEO in war room.
+  Task: Write the incident response. Include: immediate triage (first
+  5 minutes), stabilization plan (next 30 minutes), root cause per
+  cluster, post-incident improvements, and war-room communications.
+assertions:
+  - type: llm_judge
+    criteria: "Triage is correctly prioritized — addresses the highest revenue impact first: Cluster A (kill the bad query or add emergency index, increase ProxySQL limits), then Cluster C disk space (PURGE BINARY LOGS, increase disk), then Cluster B deadlocks (separate the conflicting transactions), then Cluster D (kill waiting DDL, reschedule). Each step has a time estimate"
+    weight: 0.35
+    description: "Prioritized triage"
+  - type: llm_judge
+    criteria: "All 4 clusters are stabilized — Cluster A (add missing index or roll back deployment), Cluster B (redesign lock ordering or use optimistic locking), Cluster C (purge binlogs, add disk, tune binlog retention), Cluster D (cancel DDL, implement DDL scheduling policy). ProxySQL backend connections are restored"
+    weight: 0.35
+    description: "All clusters stabilized"
+  - type: llm_judge
+    criteria: "Post-incident improvements prevent recurrence — mandatory query review before deployment (especially before high-traffic events), deployment freeze during traffic events, binlog retention policy, disk space monitoring with 3-day projection alerts, and load testing with realistic traffic patterns before major events"
+    weight: 0.30
+    description: "Preventive improvements"

package/courses/mysql-query-optimization/scenarios/level-4/high-availability-architecture.yaml ADDED Viewed

@@ -0,0 +1,60 @@
+meta:
+  id: high-availability-architecture
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Design MySQL high availability — architect InnoDB Cluster, Group Replication, and multi-region deployments for 99.99% uptime"
+  tags: [MySQL, high-availability, InnoDB-Cluster, Group-Replication, failover, expert]
+state: {}
+trigger: |
+  Your company's SLA requires 99.99% database availability (52 minutes
+  of downtime per year). Currently you have a single MySQL primary with
+  two asynchronous replicas. Last year's downtime: 22 hours.
+  Past incidents:
+  1. Primary disk failure (6 hours): Promoted replica manually, lost
+     45 seconds of data (async replication lag at time of failure)
+  2. Failed MySQL upgrade (8 hours): Upgraded primary, application
+     broke on MySQL 8.0 incompatibility, had to roll back
+  3. Network partition (4 hours): Primary lost connection to replicas,
+     split-brain risk, manually stopped writes until resolved
+  4. Connection storm (4 hours): 15K connections overwhelmed primary,
+     OOM kill, cascading failure to replicas
+  Architecture options:
+  A. InnoDB Cluster (Group Replication + MySQL Router + MySQL Shell)
+  B. Galera Cluster (synchronous multi-master)
+  C. Orchestrator + async replication (automated failover)
+  D. AWS Aurora MySQL (managed, distributed storage)
+  E. ProxySQL + async replication (read/write split, manual failover)
+  Requirements:
+  - 99.99% availability (<52 min/year downtime)
+  - RPO <5 seconds (max data loss)
+  - RTO <30 seconds (time to recover)
+  - Multi-AZ deployment
+  - Read scaling for analytics
+  - Zero-downtime upgrades
+  Current infrastructure: AWS, 3 AZs, single region.
+  Task: Design the HA architecture. Write: the architecture
+  recommendation with justification, the failover mechanism (automatic
+  detection and promotion), the replication topology, the zero-downtime
+  upgrade strategy, and the DR plan.
+assertions:
+  - type: llm_judge
+    criteria: "Architecture is well-justified — evaluates all 5 options against requirements (availability, RPO, RTO, zero-downtime upgrades, read scaling), recommends InnoDB Cluster or Aurora with clear reasoning, and explains why async replication alone can't meet RPO <5 seconds reliably. Addresses how each past incident would be handled"
+    weight: 0.35
+    description: "Well-justified architecture"
+  - type: llm_judge
+    criteria: "Failover achieves RTO <30 seconds — automatic failure detection (health checks, consensus), automatic promotion (InnoDB Cluster election or Aurora failover), application routing (MySQL Router or Aurora endpoints), and explains the quorum mechanism that prevents split-brain (requires majority of nodes for writes)"
+    weight: 0.35
+    description: "Fast automatic failover"
+  - type: llm_judge
+    criteria: "Zero-downtime upgrades and DR are addressed — rolling upgrade strategy (upgrade secondaries, failover, upgrade old primary), or blue-green with replication. Cross-AZ deployment for local HA, with cross-region async replica for DR. RPO and RTO quantified for both local failures and regional disasters"
+    weight: 0.30
+    description: "Upgrades and DR"

package/courses/mysql-query-optimization/scenarios/level-4/optimizer-internals.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: optimizer-internals
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Understand MySQL optimizer internals — learn cost-based optimization, join order selection, and how to diagnose plan instability"
+  tags: [MySQL, optimizer, cost-model, statistics, planner, expert]
+state: {}
+trigger: |
+  Your team has a critical query that sometimes runs in 5ms and
+  sometimes in 30 seconds. The plan switches unpredictably between an
+  index scan and a full table scan. The team wants to "force" the fast
+  plan with FORCE INDEX but you want to understand root cause.
+  The query:
+  SELECT * FROM orders
+  WHERE status = ? AND customer_id = ?
+  ORDER BY created_at DESC LIMIT 10;
+  Indexes: idx_status (status), idx_customer (customer_id),
+           idx_status_customer (status, customer_id)
+  Fast plan (status='pending', customer_id=42):
+  - Index scan on idx_status_customer
+  - Rows examined: 15
+  - Time: 5ms
+  Slow plan (status='completed', customer_id=42):
+  - Full table scan with filesort
+  - Rows examined: 5,000,000
+  - Time: 30 seconds
+  Why the difference: 'pending' has 500 rows (0.01%), 'completed'
+  has 5M rows (50%). The optimizer estimates that scanning the index
+  for 5M rows + sorting is more expensive than a full table scan.
+  Additional mysteries:
+  1. ANALYZE TABLE sometimes changes the plan — why?
+  2. Histogram statistics (MySQL 8.0) — when do they help?
+  3. Optimizer trace shows different cost estimates each time — why?
+  4. Prepared statements get different plans than ad-hoc SQL — why?
+  Task: Explain MySQL's optimizer decision-making. Write: the cost
+  model (how MySQL estimates query cost), how statistics influence
+  estimates, why the plan switches for different parameter values,
+  when to use histograms, and best practices for plan stability.
+assertions:
+  - type: llm_judge
+    criteria: "Cost model is explained — MySQL estimates cost based on disk I/O (pages to read, sequential vs random), CPU (comparisons, sorting), and memory. seq_scan cost is proportional to table size, index scan cost includes random I/O per row. When the optimizer estimates >20-30% of table rows match, it prefers full table scan over index scan because random I/O cost exceeds sequential scan cost"
+    weight: 0.35
+    description: "Cost model explained"
+  - type: llm_judge
+    criteria: "Statistics and plan instability are addressed — ANALYZE TABLE updates index cardinality estimates (InnoDB samples random pages, so estimates vary). Histograms (MySQL 8.0) provide value distribution for non-indexed columns, helping the optimizer estimate selectivity for skewed data. The plan switches because 'pending' (0.01%) vs 'completed' (50%) have vastly different selectivity"
+    weight: 0.35
+    description: "Statistics and instability"
+  - type: llm_judge
+    criteria: "Practical stability recommendations are provided — a composite index (status, customer_id, created_at) handles both cases efficiently, histogram on status column helps optimizer with skewed distributions, optimizer hints as short-term fix, and ANALYZE TABLE schedule for up-to-date statistics"
+    weight: 0.30
+    description: "Stability recommendations"

package/courses/mysql-query-optimization/scenarios/level-4/performance-sla-design.yaml ADDED Viewed

@@ -0,0 +1,52 @@
+meta:
+  id: performance-sla-design
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Design MySQL performance SLAs — define latency budgets, throughput targets, and capacity planning for enterprise workloads"
+  tags: [MySQL, SLA, performance, latency, capacity-planning, expert]
+state: {}
+trigger: |
+  Your company is formalizing database performance SLAs. The platform
+  serves 30M daily active users across 8 product teams. There are no
+  documented standards — teams say "the database should be fast" but
+  there's no definition of "fast."
+  Current state:
+  - 10 MySQL clusters (mix of RDS and Aurora)
+  - P50: 3ms, P95: 30ms, P99: 300ms, P99.9: 3 seconds
+  - The P99.9 tail causes user-visible timeouts
+  - No query budget per team — one team's query impacts others
+  - No capacity planning — discover limits during traffic spikes
+  Product teams:
+  1. Checkout (critical): Needs <5ms P99 for payment queries
+  2. Product Search: Needs <50ms P95 for search results
+  3. Analytics: 15-minute aggregation queries, needs throughput
+  4. Notifications: Burst writes 30K/sec during campaigns
+  5. User Profile: Read-heavy, needs 99.99% availability
+  The VP asks:
+  - "What should our database SLAs be?"
+  - "How do we prevent one team from degrading another?"
+  - "When do we need to add capacity?"
+  Task: Design the performance SLA framework. Write: tiered SLA
+  definitions, latency budget allocation per team, capacity planning
+  model, isolation strategy, and cost-performance trade-off analysis.
+assertions:
+  - type: llm_judge
+    criteria: "SLA tiers are defined with specific metrics — at least 3 tiers (critical/standard/batch) with latency targets (P50, P95, P99), throughput limits, and availability per tier. Maps the 5 teams to appropriate tiers. Critical tier (checkout) has strictest requirements"
+    weight: 0.35
+    description: "Specific SLA tiers"
+  - type: llm_judge
+    criteria: "Isolation strategy prevents noisy neighbors — separate clusters or ProxySQL query routing per team, connection limits per team/service, query timeout enforcement (MAX_EXECUTION_TIME), and resource groups (MySQL 8.0) for CPU allocation. Explains how to detect approaching limits before incidents"
+    weight: 0.35
+    description: "Noisy neighbor isolation"
+  - type: llm_judge
+    criteria: "Capacity planning is quantified — utilization thresholds for scaling (70% CPU triggers planning, 85% triggers action), growth projections based on historical data, lead time for provisioning, and the cost of each SLA tier to help teams make informed decisions"
+    weight: 0.30
+    description: "Quantified capacity planning"

package/courses/mysql-query-optimization/scenarios/level-4/read-replica-scaling.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+meta:
+  id: read-replica-scaling
+  level: 4
+  course: mysql-query-optimization
+  type: output
+  description: "Scale MySQL reads — design read replica topology, implement intelligent routing, and manage replication lag for consistency"
+  tags: [MySQL, read-replicas, replication, routing, consistency, scaling, expert]
+state: {}
+trigger: |
+  Your platform serves 100K read QPS and the primary is at 90% CPU.
+  You have 3 read replicas but they're not helping enough — load is
+  uneven, some replicas are lagging, and users see stale data after
+  updates.
+  Current architecture:
+  - Primary: db.r6g.4xlarge, 100K QPS total (30K writes, 70K reads)
+  - Replica 1: db.r6g.4xlarge — API reads (overloaded)
+  - Replica 2: db.r6g.2xlarge — analytics (lagging 60s)
+  - Replica 3: db.r6g.2xlarge — mostly idle (recently added)
+  Problems:
+  1. Stale reads: User updates profile, refreshes, sees old data
+     (read went to replica lagging 5 seconds)
+  2. Uneven load: Round-robin routing sends equal traffic to all
+     replicas despite different sizes
+  3. Replica 2 lag: Analytics queries consume all CPU, SQL applier
+     can't keep up, lag grows to 60 seconds
+  4. No lag-aware routing: Application connects to replicas regardless
+     of their lag
+  5. Failover: When primary fails, which replica to promote?
+  Task: Design the read replica scaling architecture. Write: the
+  replica topology, the routing strategy (ProxySQL or MySQL Router),
+  the read-after-write consistency solution, the lag monitoring and
+  management, and the promotion criteria for failover.
+assertions:
+  - type: llm_judge
+    criteria: "Routing strategy is intelligent — uses ProxySQL or MySQL Router with weighted distribution (larger instances get more traffic), lag-aware routing (remove replicas exceeding lag threshold from pool), and workload separation (dedicated replicas for analytics vs OLTP reads)"
+    weight: 0.35
+    description: "Intelligent routing"
+  - type: llm_judge
+    criteria: "Read-after-write consistency is solved — implements session-based routing (reads after a write go to primary for N seconds), or GTID-based routing (route to replica only if it has applied the GTID from the user's last write), or application-level cache invalidation. Explains trade-offs of each approach"
+    weight: 0.35
+    description: "Read-after-write consistency"
+  - type: llm_judge
+    criteria: "Lag management and failover are addressed — monitors Seconds_Behind_Source per replica, auto-removes lagging replicas from read pool, separates analytics to prevent lag on OLTP replicas, and defines promotion criteria (lowest lag, same instance size as primary, sync replica if available)"
+    weight: 0.30
+    description: "Lag and failover management"

package/courses/mysql-query-optimization/scenarios/level-5/ai-database-future.yaml ADDED Viewed

@@ -0,0 +1,45 @@
+meta:
+  id: ai-database-future
+  level: 5
+  course: mysql-query-optimization
+  type: output
+  description: "Explore AI-powered MySQL optimization — evaluate autonomous tuning, AI query optimization, and the evolving DBA role"
+  tags: [MySQL, AI, autonomous-tuning, machine-learning, future, master]
+state: {}
+trigger: |
+  Your CTO asks you to evaluate how AI will transform MySQL operations
+  and recommend which AI-powered technologies to invest in.
+  Current AI-database landscape:
+  1. Autonomous tuning (OtterTune/Releem): ML-based my.cnf optimization
+  2. AI query optimization: learned optimizers replacing cost-based
+  3. AI-powered indexing: automatic index recommendations
+  4. AI monitoring: anomaly detection, predictive alerting
+  5. Natural language to SQL: text-to-SQL for non-technical users
+  6. MySQL HeatWave: Oracle's ML-in-database for analytics
+  Evaluation criteria:
+  - Production readiness
+  - ROI vs complexity
+  - Risk when AI is wrong
+  - Team skill requirements
+  Task: Write the technology evaluation. Include: assessment of each
+  technology, adoption roadmap, text-to-SQL safety framework, and
+  prediction of how the DBA role evolves.
+assertions:
+  - type: llm_judge
+    criteria: "Technologies are honestly assessed — autonomous tuning (semi-ready, good ROI), learned optimizers (research stage), index advisors (ready), AI monitoring (production-ready, high ROI), text-to-SQL (risky without guardrails), HeatWave (production-ready for analytics). Avoids both hype and dismissiveness"
+    weight: 0.35
+    description: "Honest assessment"
+  - type: llm_judge
+    criteria: "Adoption roadmap is practical — separates 'adopt now' (AI monitoring, index advisors), 'evaluate' (autonomous tuning, text-to-SQL), and 'watch' (learned optimizers). Each has implementation complexity and expected timeline"
+    weight: 0.35
+    description: "Practical roadmap"
+  - type: llm_judge
+    criteria: "DBA role evolution and safety are addressed — AI augments not replaces DBAs, text-to-SQL needs guardrails (read-only, query cost limits, audit logging), and the shift from manual tuning to AI supervision"
+    weight: 0.30
+    description: "DBA evolution and safety"