npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/postgresql-query-optimization/scenarios/level-3/specialized-index-types.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: specialized-index-types
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Use specialized index types — choose between GIN, GiST, BRIN, and SP-GiST for non-standard data types and query patterns"
+  tags: [PostgreSQL, GIN, GiST, BRIN, SP-GiST, specialized-indexes, advanced]
+state: {}
+trigger: |
+  Your platform has diverse data types that B-tree indexes can't
+  handle efficiently. Each needs a specialized index type.
+  Scenario 1 — Full-text search (articles table, 20M rows):
+  SELECT * FROM articles
+  WHERE to_tsvector('english', title || ' ' || body) @@
+        to_tsquery('english', 'postgresql & optimization');
+  Currently: Seq Scan, 15 seconds per query.
+  Needs: GIN index on tsvector.
+  Scenario 2 — JSONB queries (products table, 5M rows):
+  SELECT * FROM products WHERE metadata @> '{"color": "red"}';
+  SELECT * FROM products WHERE metadata ? 'warranty';
+  Currently: Seq Scan, 8 seconds per query.
+  Needs: GIN index on JSONB column.
+  Scenario 3 — Geospatial (locations table, 2M rows):
+  SELECT * FROM locations
+  WHERE ST_DWithin(coordinates, ST_MakePoint(-73.98, 40.75), 1000);
+  Currently: Seq Scan with distance calculation on every row.
+  Needs: GiST index on geometry column.
+  Scenario 4 — Time-series data (metrics table, 10B rows):
+  SELECT * FROM metrics
+  WHERE recorded_at BETWEEN '2026-02-26' AND '2026-02-27';
+  Data is inserted in time order (naturally ordered on disk).
+  B-tree index would be 200GB. Needs: BRIN index (tiny, ~10MB).
+  Scenario 5 — Array contains (tags table, 3M rows):
+  SELECT * FROM posts WHERE tags @> ARRAY['postgresql', 'tutorial'];
+  Currently: Seq Scan.
+  Needs: GIN index on array column.
+  Scenario 6 — IP address range (access_logs table, 100M rows):
+  SELECT * FROM access_logs
+  WHERE ip_address <<= '192.168.0.0/16';
+  Needs: SP-GiST or GiST index on inet type.
+  Task: For each scenario, write: the CREATE INDEX statement, the
+  expected query performance improvement, the index size compared to
+  B-tree, and the trade-offs (build time, write overhead, maintenance).
+  Then create a decision matrix for choosing the right index type.
+assertions:
+  - type: llm_judge
+    criteria: "Index type selection is correct — GIN for full-text search, JSONB, and arrays (inverted index for containment), GiST for geospatial (nearest-neighbor, distance), BRIN for naturally-ordered time-series (block range summary), SP-GiST for IP ranges and hierarchical data"
+    weight: 0.35
+    description: "Correct index type selection"
+  - type: llm_judge
+    criteria: "Performance improvements are realistic — GIN on JSONB: 100-1000x improvement, GiST on geometry: spatial index eliminates full scan, BRIN on time-series: tiny index (10MB vs 200GB B-tree) with good pruning on ordered data. Each improvement is justified by the data characteristics"
+    weight: 0.35
+    description: "Realistic performance estimates"
+  - type: llm_judge
+    criteria: "Decision matrix is comprehensive — covers when to use each type (data type, query pattern, data ordering, write frequency), includes trade-offs (GIN: fast queries but slow updates, BRIN: tiny but requires physical ordering, GiST: versatile but larger than BRIN), and handles hybrid approaches"
+    weight: 0.30
+    description: "Comprehensive decision matrix"

package/courses/postgresql-query-optimization/scenarios/level-3/write-optimization.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: write-optimization
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize write performance — accelerate bulk inserts, COPY operations, and upserts for data pipeline workloads"
+  tags: [PostgreSQL, write-optimization, COPY, bulk-insert, upsert, advanced]
+state: {}
+trigger: |
+  Your data pipeline ingests 100M rows per day into PostgreSQL. The
+  current pipeline takes 16 hours to load each day's data, meaning
+  you're falling behind. By next month, you'll have a 2-day backlog.
+  Current pipeline:
+  1. Read CSV files from S3 (100 files × 1M rows each)
+  2. For each file: for each row: INSERT INTO events (...) VALUES (...)
+  3. After all inserts: UPDATE statistics tables
+  4. After all updates: run VACUUM ANALYZE
+  Table: events (currently 5B rows)
+  - 15 columns, average row size 500 bytes
+  - 8 indexes (some composite)
+  - 3 triggers (audit logging, updated_at, notification)
+  - Foreign keys to 4 other tables
+  Performance profile:
+  - Single INSERT: 0.5ms per row
+  - 100M rows × 0.5ms = 50,000 seconds = 13.8 hours
+  - Overhead: FK checks (1 hour), trigger execution (1 hour),
+    WAL generation (0.2 hours)
+  - Total: ~16 hours
+  Target: Load 100M rows in < 2 hours
+  Optimization options to evaluate:
+  1. COPY instead of INSERT (binary or CSV format)
+  2. Batch INSERTs (100-1000 rows per statement)
+  3. Drop indexes → load → recreate indexes
+  4. Disable triggers during load
+  5. Disable FK checks during load
+  6. UNLOGGED table for staging
+  7. Increase wal_buffers and checkpoint_timeout
+  8. Parallel loading (multiple COPY workers)
+  Task: Design the optimized pipeline. Evaluate each option with
+  expected speedup. Write: the pipeline architecture, the COPY
+  configuration, the index management strategy, the WAL tuning,
+  and the safety measures (how to validate data integrity after
+  disabling constraints).
+assertions:
+  - type: llm_judge
+    criteria: "Pipeline uses COPY for bulk loading — explains that COPY is 10-100x faster than individual INSERTs, shows the exact COPY command with optimal settings, and uses parallel COPY workers (one per file) with appropriate staging table approach"
+    weight: 0.35
+    description: "COPY-based pipeline"
+  - type: llm_judge
+    criteria: "Index and constraint management is safe — drops non-essential indexes before load, recreates with CREATE INDEX CONCURRENTLY after, disables triggers safely with ALTER TABLE DISABLE TRIGGER, and validates FK constraints after re-enabling with ALTER TABLE VALIDATE CONSTRAINT"
+    weight: 0.35
+    description: "Safe index/constraint management"
+  - type: llm_judge
+    criteria: "WAL optimization is included — increases checkpoint_timeout and max_wal_size for large loads, uses wal_level=minimal for initial load (if possible), and explains the tradeoff between WAL reduction and crash recovery risk. The total estimated time is under 2 hours"
+    weight: 0.30
+    description: "WAL optimization included"

package/courses/postgresql-query-optimization/scenarios/level-4/data-architecture-analytics.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: data-architecture-analytics
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Design data architecture for analytics — separate OLTP from OLAP, implement data warehousing patterns, and optimize PostgreSQL for analytical workloads"
+  tags: [PostgreSQL, data-architecture, OLTP, OLAP, analytics, warehousing, expert]
+state: {}
+trigger: |
+  Your company runs everything on PostgreSQL — OLTP transactions AND
+  analytics queries on the same database. The analytics team's queries
+  are killing production performance. The CTO wants a data architecture
+  that separates concerns without introducing too much complexity.
+  Current state:
+  - Single PostgreSQL cluster handling both OLTP and analytics
+  - OLTP: 50K transactions/second, needs <10ms latency
+  - Analytics: 200 queries/day, each scanning 100M+ rows, taking 5-30 min
+  - Analytics queries cause: table locks, buffer cache pollution, CPU spikes
+  - Data team wants: real-time dashboards, ad-hoc queries, ML feature extraction
+  - Current data volume: 2TB, growing 50GB/month
+  Recent incidents caused by analytics on production:
+  1. Analytics query consumed all work_mem (256MB × 50 parallel workers),
+     causing OOM and PostgreSQL restart
+  2. Sequential scan on 500M-row table evicted hot OLTP data from
+     shared_buffers, causing 10x latency spike for 30 minutes
+  3. Long-running analytics transaction prevented VACUUM, causing table
+     bloat that filled the disk
+  Architecture options being considered:
+  A. Read replica dedicated to analytics (simplest)
+  B. PostgreSQL + columnar extension (Citus columnar, pg_analytics)
+  C. PostgreSQL OLTP + dedicated analytics database (ClickHouse, DuckDB)
+  D. PostgreSQL OLTP + data warehouse (Snowflake, BigQuery, Redshift)
+  E. PostgreSQL OLTP + data lake (S3 + Iceberg + query engine)
+  Constraints:
+  - Engineering team knows PostgreSQL, limited experience with other DBs
+  - Budget: $50K/month for analytics infrastructure
+  - Data freshness: dashboards need data within 5 minutes
+  - Compliance: PII must be masked in analytics layer
+  - 3 data engineers available for implementation
+  Task: Design the data architecture. Write: the recommended approach
+  with justification, the data pipeline design (how data flows from OLTP
+  to analytics), the query optimization for analytical workloads, the
+  PII handling strategy, and the migration plan from current state.
+assertions:
+  - type: llm_judge
+    criteria: "Architecture separates OLTP and OLAP effectively — recommends an approach that eliminates analytics impact on production, justifies the choice against all 5 options considering the team's PostgreSQL expertise and constraints, and explains the data flow from OLTP to analytics layer (CDC, logical replication, or ETL)"
+    weight: 0.35
+    description: "Sound OLTP/OLAP separation"
+  - type: llm_judge
+    criteria: "Data pipeline achieves freshness requirements — designs a pipeline that delivers data within 5 minutes to the analytics layer, handles schema evolution, includes PII masking (column-level encryption, tokenization, or view-based masking), and addresses the 3 past incidents (memory, cache pollution, vacuum blocking)"
+    weight: 0.35
+    description: "Production-ready data pipeline"
+  - type: llm_judge
+    criteria: "Migration plan is realistic — phases the migration (start with read replica, then add dedicated analytics), estimates timeline considering 3 data engineers, identifies risks (data consistency during migration, team learning curve), and provides a cost projection within the $50K/month budget"
+    weight: 0.30
+    description: "Realistic migration plan"

package/courses/postgresql-query-optimization/scenarios/level-4/database-executive-communication.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: database-executive-communication
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Communicate database performance to executives — translate query optimization into business impact, ROI, and strategic decisions"
+  tags: [PostgreSQL, executive-communication, ROI, business-impact, strategy, expert]
+state: {}
+trigger: |
+  You're the Director of Database Engineering presenting to the C-suite.
+  Your team has spent 6 months optimizing the PostgreSQL infrastructure,
+  and the CFO wants to know the ROI. The CTO wants to understand the
+  technical debt. The CEO wants to know if the database can support 10x
+  growth.
+  What your team accomplished (6-month optimization program):
+  1. Query optimization: Identified and fixed 200 slow queries
+     - Average response time: 250ms → 45ms (82% improvement)
+     - P99 latency: 5 seconds → 200ms (96% improvement)
+  2. Index optimization: Added 50 indexes, removed 30 unused ones
+     - Storage for indexes: reduced from 800GB to 500GB
+     - Write performance: improved 15% (fewer indexes to maintain)
+  3. Connection pooling: Deployed PgBouncer across all services
+     - Max connections reduced from 5,000 to 500 (to PostgreSQL)
+     - Connection-related timeouts: eliminated
+  4. Partitioning: Partitioned 5 largest tables
+     - Query performance on partitioned tables: 10x faster
+     - VACUUM time: 8 hours → 45 minutes per table
+  5. Infrastructure: Migrated 3 clusters to larger instances
+     - Cost increase: $15K/month ($180K/year)
+  Financial context:
+  - Database infrastructure cost: $1.2M/year
+  - Engineering team (6 DBAs + 4 SREs): $2M/year
+  - Revenue impact of downtime: $50K/minute
+  - Last year's downtime: 18 hours ($54M potential impact)
+  - This year's downtime (so far): 52 minutes ($2.6M potential impact)
+  The executives ask:
+  - CFO: "What's the ROI of this optimization program?"
+  - CTO: "What's our remaining technical debt in the database layer?"
+  - CEO: "Can we support 10x user growth without 10x database cost?"
+  Task: Write the executive presentation. Include: the ROI calculation
+  (quantified business impact), the technical debt assessment (in
+  business terms), the scaling roadmap (cost projections for 10x),
+  the risk register (what could still go wrong), and the recommended
+  next investments.
+assertions:
+  - type: llm_judge
+    criteria: "ROI is quantified in business terms — calculates the value of reduced downtime ($54M→$2.6M potential impact = $51.4M risk reduction), the value of faster response times (conversion rate impact, user experience), and compares against total investment ($180K infrastructure + team costs). Presents clear ROI ratio"
+    weight: 0.35
+    description: "Quantified business ROI"
+  - type: llm_judge
+    criteria: "Technical debt is translated for executives — expresses database technical debt in terms of risk, cost, and timeline rather than technical jargon. Identifies remaining debt items (e.g., tables not yet partitioned, queries not yet optimized, missing monitoring) and estimates the cost/effort to address each"
+    weight: 0.35
+    description: "Executive-friendly technical debt"
+  - type: llm_judge
+    criteria: "Scaling roadmap addresses 10x growth — projects database costs at 2x, 5x, 10x user growth (not linear — explains economies of scale and breaking points), identifies when architectural changes are needed (e.g., sharding at 50TB), and provides a timeline with investment milestones"
+    weight: 0.30
+    description: "10x scaling roadmap"

package/courses/postgresql-query-optimization/scenarios/level-4/database-migration-planning.yaml ADDED Viewed

@@ -0,0 +1,57 @@
+meta:
+  id: database-migration-planning
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Plan a terabyte-scale database migration — migrate PostgreSQL with near-zero downtime using logical replication"
+  tags: [PostgreSQL, migration, logical-replication, zero-downtime, expert]
+state: {}
+trigger: |
+  Your company needs to migrate a 5TB PostgreSQL 14 database to
+  PostgreSQL 16 on new hardware. The database processes $50M in
+  transactions daily and has an SLA of 99.99% availability. Maximum
+  acceptable downtime: 5 minutes.
+  Source: PostgreSQL 14 on bare metal, 5TB, 200 tables
+  Target: PostgreSQL 16 on AWS RDS, same schema
+  Challenges:
+  1. 5TB of data takes ~10 hours to pg_dump/restore
+  2. Logical replication needs to catch up on changes during transfer
+  3. 15 custom extensions on the source (3 not available on RDS)
+  4. Sequences must be synchronized at cutover
+  5. Application uses prepared statements tied to backend connections
+  6. The orders table has 2B rows and takes 3 hours to COPY alone
+  7. Read replicas serving analytics must also be migrated
+  8. Rollback plan: if the new database fails, how to go back?
+  Migration options evaluated:
+  A. pg_dump + pg_restore (simple, requires hours of downtime)
+  B. Logical replication (near-zero downtime, complex setup)
+  C. AWS DMS (managed, but performance limitations)
+  D. pglogical extension (more features than native logical rep)
+  The VP of Engineering asked: "Can we do this with zero downtime
+  and zero data loss? What's the actual risk?"
+  Task: Design the migration plan. Write: the chosen approach with
+  justification, the step-by-step migration procedure, the cutover
+  choreography (the 5-minute window), the validation checklist
+  (how to verify the migration succeeded), the rollback plan, and
+  the risk assessment.
+assertions:
+  - type: llm_judge
+    criteria: "Migration approach handles the constraints — addresses the 5TB volume (parallel COPY or initial pg_dump + logical replication for incremental), the 3 incompatible extensions (workarounds or replacements), and sequence synchronization at cutover. The approach achieves <5 minutes downtime"
+    weight: 0.35
+    description: "Constraint-handling approach"
+  - type: llm_judge
+    criteria: "Cutover choreography is precise — defines the exact steps during the 5-minute window (stop writes, ensure replication caught up, sync sequences, switch application connection strings, verify, resume writes), with timing estimates for each step"
+    weight: 0.35
+    description: "Precise cutover choreography"
+  - type: llm_judge
+    criteria: "Validation and rollback are thorough — row count comparison per table, checksum verification for critical tables, application smoke tests, and the rollback plan maintains the original database as hot standby during the migration so it can take over immediately if needed"
+    weight: 0.30
+    description: "Thorough validation and rollback"

package/courses/postgresql-query-optimization/scenarios/level-4/enterprise-database-governance.yaml ADDED Viewed

@@ -0,0 +1,52 @@
+meta:
+  id: enterprise-database-governance
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Design enterprise database governance — standardize PostgreSQL operations across a large organization with multiple teams and databases"
+  tags: [PostgreSQL, governance, enterprise, standardization, expert]
+state: {}
+trigger: |
+  You're the VP of Data Infrastructure at a company with 1,500
+  engineers and 200 PostgreSQL databases. An audit revealed systemic
+  problems:
+  - 200 databases, 40 teams, no consistent standards
+  - 30% of databases have never been vacuumed manually
+  - 50 databases have no monitoring
+  - 15 databases have default settings (shared_buffers=128MB on
+    256GB RAM servers)
+  - Query performance varies 100x between teams for similar workloads
+  - 8 production incidents last year caused by unoptimized queries
+  - No query review process before deployment
+  - Connection pooling used by only 5 teams
+  - Backup tested by 0 teams (backups exist but never restored)
+  The CTO wants a governance framework that:
+  - Standardizes PostgreSQL configuration across all 200 databases
+  - Prevents slow queries from reaching production
+  - Ensures backup and recovery readiness
+  - Provides visibility into database health organization-wide
+  - Doesn't create a bottleneck (teams must remain autonomous)
+  Task: Design the governance framework. Write: the configuration
+  standards (tiered by database size/criticality), the query review
+  process (automated CI/CD checks), the monitoring and alerting
+  standards, the backup verification program, and the organizational
+  structure (DBA team vs platform team vs embedded).
+assertions:
+  - type: llm_judge
+    criteria: "Configuration standards are tiered and practical — defines standard settings for small/medium/large databases, provides a base configuration template that teams can customize, and addresses the 15 databases with default settings as an immediate remediation"
+    weight: 0.35
+    description: "Tiered configuration standards"
+  - type: llm_judge
+    criteria: "Query review is automated — CI/CD pipeline includes EXPLAIN analysis for new queries, automated detection of missing indexes, N+1 patterns, and sequential scans on large tables. Blocks deployment if query exceeds performance thresholds. Teams maintain autonomy through self-service tooling"
+    weight: 0.35
+    description: "Automated query review"
+  - type: llm_judge
+    criteria: "Monitoring and backup programs are comprehensive — unified monitoring dashboard across 200 databases, alerting on key metrics (replication lag, table bloat, connection count, query latency), and backup recovery is tested quarterly with documented RTO/RPO per database"
+    weight: 0.30
+    description: "Comprehensive monitoring and backup"

package/courses/postgresql-query-optimization/scenarios/level-4/expert-optimization-shift.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+meta:
+  id: expert-optimization-shift
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Expert optimization shift — handle a complex production crisis combining replication failure, query regression, and capacity emergency during a traffic spike"
+  tags: [PostgreSQL, shift-simulation, crisis, replication, capacity, expert]
+state: {}
+trigger: |
+  You're the on-call database lead during Black Friday. Multiple
+  cascading failures hit your PostgreSQL infrastructure simultaneously.
+  The CEO is in the war room. Revenue is dropping $100K/minute.
+  Timeline of events:
+  10:00 AM — Traffic starts at 3x normal. All systems green.
+  10:15 AM — Alert: Primary CPU at 95%. Query latency P99 jumps from
+  200ms to 2 seconds. PgBouncer queue depth growing.
+  10:22 AM — Alert: Replication lag on Replica 2 exceeds 60 seconds.
+  Replica removed from read pool automatically. Remaining 2 replicas
+  now handle all read traffic — they start lagging too.
+  10:25 AM — Alert: Connection pool exhausted. Application returning
+  503 errors. 40% of checkout requests failing. Revenue impact begins.
+  10:30 AM — Investigation reveals: A deployment at 10:05 AM introduced
+  a new query in the checkout flow:
+  SELECT * FROM inventory i
+  JOIN products p ON p.id = i.product_id
+  JOIN warehouses w ON w.id = i.warehouse_id
+  WHERE p.category_id IN (SELECT id FROM categories WHERE parent_id = $1)
+  AND i.quantity > 0
+  ORDER BY w.distance_from(point($2, $3))
+  LIMIT 20;
+  This query does a sequential scan on inventory (200M rows) because
+  the planner doesn't have statistics for the distance_from function.
+  10:35 AM — Alert: Disk space on primary dropping fast. Autovacuum
+  is blocked by a 2-hour-old analytics transaction on Replica 1 that's
+  preventing cleanup of dead tuples. WAL files accumulating.
+  10:40 AM — Disk at 90%. If it hits 100%, PostgreSQL will crash.
+  Simultaneous problems:
+  1. New query causing sequential scan on 200M rows (10:30 AM finding)
+  2. Replication cascade failure (replicas can't keep up)
+  3. Connection pool exhausted (503 errors)
+  4. Disk filling due to WAL accumulation + vacuum blocked
+  5. Revenue loss of $100K/minute
+  Task: Write the incident response plan. Include: the immediate
+  triage (first 5 minutes — what do you do RIGHT NOW), the
+  stabilization plan (next 30 minutes), the root cause analysis,
+  the post-incident improvements, and the communication updates to
+  the war room at each stage.
+assertions:
+  - type: llm_judge
+    criteria: "Immediate triage prioritizes correctly — addresses the disk space emergency first (terminate the blocking analytics transaction, manually trigger VACUUM, or move WAL to another volume), then mitigates the bad query (kill active instances, roll back the deployment or add a query hint/index), and increases connection pool limits or redirects traffic. Timing for each step is estimated"
+    weight: 0.35
+    description: "Correct triage prioritization"
+  - type: llm_judge
+    criteria: "Stabilization addresses all 5 problems — fixes the query (add appropriate index or rewrite), recovers replication (restart WAL replay, potentially rebuild lagging replicas), restores connection capacity (increase pool, reduce per-query connection time), reclaims disk space (vacuum, archive WAL), and provides war-room updates with ETA for recovery"
+    weight: 0.35
+    description: "Comprehensive stabilization"
+  - type: llm_judge
+    criteria: "Post-incident improvements prevent recurrence — recommends query review in CI/CD (EXPLAIN analysis before deployment), query timeout configuration, disk space alerting with more headroom, replication lag circuit breakers, and a deployment freeze policy during high-traffic events. Root cause traces back to missing query review process"
+    weight: 0.30
+    description: "Preventive post-incident improvements"

package/courses/postgresql-query-optimization/scenarios/level-4/high-availability-architecture.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: high-availability-architecture
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Design PostgreSQL high availability — architect streaming replication, automatic failover, and multi-region deployments"
+  tags: [PostgreSQL, high-availability, replication, failover, Patroni, expert]
+state: {}
+trigger: |
+  Your company's SLA requires 99.99% database availability (52
+  minutes of downtime per year). Currently you have a single
+  PostgreSQL instance with no replication. Last year's downtime:
+  18 hours across 4 incidents.
+  Incidents that caused downtime:
+  1. Hardware failure (4 hours): Server disk died, restored from
+     backup. RPO: 1 hour of data lost.
+  2. Failed upgrade (6 hours): PG 15 → 16 upgrade failed, had to
+     roll back. No secondary to fail to.
+  3. DDL lock (3 hours): ALTER TABLE on 500M-row table took
+     AccessExclusive lock, blocking all queries.
+  4. Connection storm (5 hours): Application bug opened 10K
+     connections, server ran out of memory and crashed.
+  Requirements:
+  - 99.99% availability (< 52 minutes downtime/year)
+  - RPO < 5 minutes (maximum data loss in disaster)
+  - RTO < 2 minutes (time to recover from failure)
+  - Zero-downtime deployments (schema changes, PG upgrades)
+  - Multi-region for disaster recovery
+  - Read scaling for analytics workload
+  Architecture options:
+  A. Patroni + streaming replication (self-managed)
+  B. AWS RDS Multi-AZ
+  C. AWS Aurora PostgreSQL
+  D. Citus for horizontal scaling
+  E. Hybrid: Aurora primary + self-managed read replicas
+  Current infrastructure: AWS, 3 availability zones, 2 regions
+  Task: Design the HA architecture. Write: the architecture
+  recommendation (with justification against all 5 options), the
+  failover mechanism (automatic detection and promotion), the
+  replication topology (sync vs async, cascade), the zero-downtime
+  deployment strategy, and the DR plan (cross-region recovery).
+assertions:
+  - type: llm_judge
+    criteria: "Architecture choice is well-justified — evaluates all 5 options against the requirements (availability, RPO, RTO, zero-downtime, multi-region, read scaling), selects the best fit, and explains why the others don't meet requirements or are less optimal"
+    weight: 0.35
+    description: "Well-justified architecture choice"
+  - type: llm_judge
+    criteria: "Failover mechanism achieves RTO < 2 minutes — automatic failure detection (health checks, WAL lag monitoring), automatic promotion (Patroni consensus or Aurora automatic failover), application connection routing (DNS failover, connection string discovery), and the 4 past incidents would all be handled"
+    weight: 0.35
+    description: "RTO-achieving failover"
+  - type: llm_judge
+    criteria: "Zero-downtime deployments and DR are addressed — schema changes use online DDL techniques (CREATE INDEX CONCURRENTLY, pg_repack), PG upgrades use logical replication for live migration, and cross-region DR with async replication achieves RPO < 5 minutes"
+    weight: 0.30
+    description: "Zero-downtime and DR"

package/courses/postgresql-query-optimization/scenarios/level-4/optimizer-internals.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: optimizer-internals
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Understand the cost-based optimizer — learn how PostgreSQL estimates query costs and how to influence plan choices"
+  tags: [PostgreSQL, optimizer, cost-model, statistics, planner, expert]
+state: {}
+trigger: |
+  Your team has a critical query that occasionally switches from a
+  fast plan (Index Scan, 5ms) to a slow plan (Seq Scan, 15 seconds).
+  It seems random. The team wants to "force" the fast plan, but you
+  need to understand why the planner sometimes picks the slow one.
+  The query:
+  SELECT o.*, c.name
+  FROM orders o
+  JOIN customers c ON c.id = o.customer_id
+  WHERE o.status = $1 AND o.created_at >= $2;
+  Fast plan (when status = 'pending', 500 rows estimated):
+  Nested Loop Join
+    → Index Scan on orders using idx_orders_status_date
+    → Index Scan on customers (pk)
+  Cost: 1500, Time: 5ms
+  Slow plan (when status = 'completed', 5M rows estimated):
+  Hash Join
+    → Seq Scan on orders (filter: status = 'completed')
+    → Hash on customers
+  Cost: 45000, Time: 15 seconds
+  The problem: With prepared statements, PostgreSQL eventually
+  switches to a generic plan that must work for ALL parameter values.
+  If 'completed' is the most common value, the generic plan chooses
+  Seq Scan (because it's better for 5M rows), even when the actual
+  parameter is 'pending' (500 rows).
+  Additional optimizer mysteries:
+  1. Why does the planner sometimes choose Hash Join over Merge Join
+     even when data is sorted?
+  2. Why does increasing work_mem from 4MB to 256MB make some queries
+     slower? (Hash Join replaces Index Scan)
+  3. The planner estimates 100 rows but actual is 50,000 — why are
+     statistics wrong?
+  4. Two identical queries with different literal values get different
+     plans — what's happening?
+  Task: Explain the optimizer's decision-making process. Write: how
+  cost estimation works (the formula), how statistics influence
+  estimates, why prepared statement plans differ from ad-hoc plans,
+  how to fix the plan instability for the critical query, and when
+  it's appropriate to influence the planner.
+assertions:
+  - type: llm_judge
+    criteria: "Cost model is correctly explained — explains the cost formula (startup_cost + total_cost), how seq_page_cost and random_page_cost affect scan choice, how row estimates come from statistics (histograms, MCVs, distinct values), and why the planner's choice is rational given its estimates"
+    weight: 0.35
+    description: "Correct cost model explanation"
+  - type: llm_judge
+    criteria: "Prepared statement plan instability is addressed — explains custom vs generic plan switching (threshold after ~5 executions), how plan_cache_mode can force custom plans, and alternative solutions (partition by status, partial indexes, or restructuring the query to avoid the problem)"
+    weight: 0.35
+    description: "Plan instability addressed"
+  - type: llm_judge
+    criteria: "All 4 mysteries are explained — Hash vs Merge Join cost comparison, work_mem increase changing plan choice (hash now fits in memory so Hash Join replaces Index Scan), statistics accuracy (insufficient samples, correlated columns), and parameterized query plans using different statistics for different literal values"
+    weight: 0.30
+    description: "All mysteries explained"

package/courses/postgresql-query-optimization/scenarios/level-4/performance-sla-design.yaml ADDED Viewed

@@ -0,0 +1,58 @@
+meta:
+  id: performance-sla-design
+  level: 4
+  course: postgresql-query-optimization
+  type: output
+  description: "Design database performance SLAs — define latency budgets, throughput targets, and capacity planning for PostgreSQL at scale"
+  tags: [PostgreSQL, SLA, performance, latency, throughput, capacity-planning, expert]
+state: {}
+trigger: |
+  Your company is formalizing database performance SLAs for the first
+  time. The platform serves 50M daily active users across 12 product
+  teams. Each team has different performance needs, but there are no
+  documented standards. Last quarter, 3 incidents were caused by teams
+  with conflicting expectations about what "fast" means.
+  Current state:
+  - 15 PostgreSQL clusters (mix of RDS and self-managed)
+  - No formal latency targets — teams say "it should be fast"
+  - P50 latency: 5ms, P95: 50ms, P99: 500ms, P99.9: 5 seconds
+  - The P99.9 tail is causing user-facing timeouts
+  - No query budget per team — one team's expensive query impacts others
+  - No capacity planning — teams discover limits during traffic spikes
+  - Monitoring exists but no alerting thresholds defined
+  Product teams with different needs:
+  1. Checkout (critical path): Needs <10ms P99 for payment queries
+  2. Search: Needs <100ms P95 for search results
+  3. Analytics: Runs 30-minute aggregation queries, needs throughput
+  4. Notification: Burst writes of 50K/sec during campaigns
+  5. User Profile: Read-heavy, tolerates 50ms P99 but needs 99.99% availability
+  The VP of Engineering asks:
+  - "What should our database SLAs be?"
+  - "How do we prevent one team from degrading another's performance?"
+  - "When do we need to add capacity, and how much lead time do we need?"
+  - "What's the cost of improving P99 from 500ms to 50ms?"
+  Task: Design the performance SLA framework. Write: tiered SLA
+  definitions (by criticality), the latency budget allocation per team,
+  the capacity planning model (when to scale), the isolation strategy
+  (preventing noisy neighbors), and the cost-performance trade-off
+  analysis.
+assertions:
+  - type: llm_judge
+    criteria: "SLA tiers are well-defined — creates at least 3 tiers (e.g., critical/standard/batch) with specific latency targets (P50, P95, P99), throughput limits, and availability requirements per tier. Maps the 5 product teams to appropriate tiers with justification"
+    weight: 0.35
+    description: "Well-defined SLA tiers"
+  - type: llm_judge
+    criteria: "Isolation and capacity planning are addressed — explains workload isolation strategies (separate clusters, connection pool limits per team, resource governor, query timeouts), capacity planning model (utilization thresholds, growth projections, lead time for provisioning), and how to detect approaching limits before they cause incidents"
+    weight: 0.35
+    description: "Isolation and capacity planning"
+  - type: llm_judge
+    criteria: "Cost-performance trade-off is analyzed — explains the diminishing returns of tail latency improvement (P99 500ms→50ms might require 3x infrastructure), quantifies the cost of each SLA tier, and provides a framework for teams to choose their tier based on business impact vs cost"
+    weight: 0.30
+    description: "Cost-performance analysis"