npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/postgresql-query-optimization/scenarios/level-1/first-optimization-shift.yaml ADDED Viewed

@@ -0,0 +1,77 @@
+meta:
+  id: first-optimization-shift
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "First optimization shift — triage and fix multiple slow queries during a production performance incident"
+  tags: [PostgreSQL, optimization, shift-simulation, triage, beginner]
+state: {}
+trigger: |
+  You're on-call for a SaaS application when alerts fire at 9:00 AM
+  Monday (peak traffic). Database CPU is at 95% and climbing.
+  Response times have tripled. Users are complaining on Twitter.
+  Your monitoring dashboard shows:
+  Active connections: 180/200 (at limit)
+  Query queue: 45 queries waiting
+  Longest running query: 180 seconds (and counting)
+  Database CPU: 95%
+  Disk I/O: 98% utilization
+  Top 5 queries by current CPU consumption:
+  1. (45% CPU) Running for 180 seconds:
+     SELECT * FROM audit_logs WHERE created_at > '2020-01-01'
+     ORDER BY created_at;
+     (audit_logs has 200M rows, no LIMIT, someone ran it from an
+     admin panel)
+  2. (20% CPU) Running 500 concurrent copies:
+     SELECT u.*, p.* FROM users u
+     JOIN profiles p ON p.user_id = u.id
+     WHERE u.email = $1;
+     (Seq Scan on users — missing index on email)
+  3. (15% CPU) Running 200 concurrent copies:
+     SELECT COUNT(*) FROM products
+     WHERE category_id = $1 AND active = true;
+     (Seq Scan on products — missing index)
+  4. (10% CPU) Running every 5 seconds (background job):
+     UPDATE notifications SET checked_at = NOW()
+     WHERE user_id = $1 AND checked_at IS NULL;
+     (Seq Scan + row lock contention)
+  5. (5% CPU) Running 50 concurrent copies:
+     SELECT * FROM orders WHERE customer_id = $1
+     ORDER BY created_at DESC LIMIT 10;
+     (Using an index but fetching too many columns)
+  Available actions:
+  - Kill specific queries (pg_terminate_backend)
+  - Create indexes (takes time on large tables)
+  - Modify application code (requires deploy)
+  - Add connection pooling (PgBouncer not yet set up)
+  - Scale up database instance
+  Task: Triage and fix this incident. Write: the immediate actions
+  (first 5 minutes), the short-term fixes (next hour), the medium-
+  term improvements (this week), and the monitoring setup to prevent
+  this from recurring.
+assertions:
+  - type: llm_judge
+    criteria: "Immediate triage is correct — kills the 200M-row audit_logs query first (45% CPU), considers killing or limiting long-running queries, and doesn't create indexes under 95% CPU load (would make things worse). Prioritizes by impact"
+    weight: 0.35
+    description: "Correct immediate triage"
+  - type: llm_judge
+    criteria: "Short-term fixes address root causes — index on users(email) for the most frequent query, index on products(category_id, active) for the product count, application-level statement_timeout to prevent unbounded queries, and connection pooling to handle the 180/200 connection pressure"
+    weight: 0.35
+    description: "Root cause fixes"
+  - type: llm_judge
+    criteria: "Prevention measures are practical — sets statement_timeout for different query types, implements pg_stat_statements monitoring, adds connection pooling, puts row limits on admin panel queries, and sets up alerting on slow queries and connection count before hitting limits"
+    weight: 0.30
+    description: "Practical prevention measures"

package/courses/postgresql-query-optimization/scenarios/level-1/index-fundamentals.yaml ADDED Viewed

@@ -0,0 +1,76 @@
+meta:
+  id: index-fundamentals
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Understand index fundamentals — learn when to create B-tree indexes and how they affect query performance"
+  tags: [PostgreSQL, indexes, B-tree, performance, beginner]
+state: {}
+trigger: |
+  You're working on a customer support ticketing system. The database
+  has a tickets table with 5 million rows:
+  CREATE TABLE tickets (
+    id SERIAL PRIMARY KEY,
+    customer_id INTEGER NOT NULL,
+    agent_id INTEGER,
+    status VARCHAR(20) NOT NULL DEFAULT 'open',
+    priority VARCHAR(10) NOT NULL DEFAULT 'medium',
+    subject TEXT NOT NULL,
+    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
+    resolved_at TIMESTAMP,
+    category VARCHAR(50),
+    tags TEXT[]
+  );
+  Currently the only index is on id (primary key). These queries are
+  all doing sequential scans on 5M rows:
+  Query 1 — Agent dashboard (runs 500 times/minute):
+  SELECT * FROM tickets
+  WHERE agent_id = 42 AND status = 'open'
+  ORDER BY priority, created_at;
+  Query 2 — Customer history (runs 200 times/minute):
+  SELECT * FROM tickets
+  WHERE customer_id = 12345
+  ORDER BY created_at DESC;
+  Query 3 — Reporting (runs hourly):
+  SELECT status, COUNT(*) FROM tickets
+  WHERE created_at >= '2026-02-01'
+  GROUP BY status;
+  Query 4 — Search (runs 100 times/minute):
+  SELECT * FROM tickets
+  WHERE category = 'billing' AND status IN ('open', 'pending')
+  AND created_at >= NOW() - INTERVAL '7 days';
+  Query 5 — Duplicate detection (runs on every new ticket):
+  SELECT * FROM tickets
+  WHERE customer_id = 12345
+  AND subject = 'Refund request'
+  AND created_at >= NOW() - INTERVAL '24 hours';
+  Task: For each of the 5 queries, recommend the right index. Explain
+  why you chose each index (column order matters!), what type of scan
+  the query will use after indexing, and the trade-offs of adding too
+  many indexes. Then prioritize which indexes to create first based on
+  query frequency and impact.
+assertions:
+  - type: llm_judge
+    criteria: "Index recommendations are correct — multi-column indexes have the right column order (equality columns first, then range/sort columns), each query's access pattern is analyzed (point lookup vs range scan vs sort), and the recommended indexes would eliminate the sequential scans"
+    weight: 0.35
+    description: "Correct index recommendations"
+  - type: llm_judge
+    criteria: "Trade-offs are explained — discusses write overhead (indexes slow down INSERTs/UPDATEs), storage cost, index maintenance (bloat, REINDEX), and why not to create an index for every possible query. Addresses whether some queries can share indexes"
+    weight: 0.35
+    description: "Trade-offs explained"
+  - type: llm_judge
+    criteria: "Prioritization is well-reasoned — considers query frequency (agent dashboard at 500/min is highest priority), business impact, and whether a single index can serve multiple queries. Creates the minimum set of indexes for maximum impact"
+    weight: 0.30
+    description: "Well-reasoned prioritization"

package/courses/postgresql-query-optimization/scenarios/level-1/join-basics.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+meta:
+  id: join-basics
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Understand JOIN performance — learn how PostgreSQL executes different JOIN types and when each is appropriate"
+  tags: [PostgreSQL, JOIN, nested-loop, hash-join, merge-join, beginner]
+state: {}
+trigger: |
+  You're building a reporting dashboard for a school management
+  system. The queries are slow and you need to understand how JOINs
+  work in PostgreSQL to fix them.
+  Tables:
+  - students (10,000 rows): id, name, grade_level, enrollment_date
+  - courses (500 rows): id, name, department, credits
+  - enrollments (100,000 rows): id, student_id, course_id, semester,
+    grade (references students and courses)
+  - teachers (200 rows): id, name, department
+  - course_assignments (600 rows): course_id, teacher_id, semester
+  Slow queries:
+  Query 1 — Student transcript (Nested Loop is fine here):
+  SELECT c.name, e.grade, e.semester
+  FROM enrollments e
+  JOIN courses c ON c.id = e.course_id
+  WHERE e.student_id = 42;
+  (Returns ~40 rows, but takes 2 seconds)
+  Query 2 — Department report (Hash Join expected):
+  SELECT s.name, c.name, e.grade
+  FROM students s
+  JOIN enrollments e ON e.student_id = s.id
+  JOIN courses c ON c.id = e.course_id
+  WHERE c.department = 'Mathematics';
+  (Returns ~5,000 rows, takes 8 seconds)
+  Query 3 — Full grade export (Merge Join might help):
+  SELECT s.name, c.name, e.grade
+  FROM students s
+  JOIN enrollments e ON e.student_id = s.id
+  JOIN courses c ON c.id = e.course_id
+  ORDER BY s.name, c.name;
+  (Returns 100,000 rows, takes 30 seconds)
+  Query 4 — Students without enrollments:
+  SELECT s.* FROM students s
+  LEFT JOIN enrollments e ON e.student_id = s.id
+  WHERE e.id IS NULL;
+  (Checking for unenrolled students)
+  Task: For each query, explain: which JOIN algorithm PostgreSQL
+  should choose and why, what indexes are needed, the expected
+  execution plan after optimization, and any query rewrites that
+  would help. Then write a general guide for when each JOIN
+  algorithm (Nested Loop, Hash Join, Merge Join) is appropriate.
+assertions:
+  - type: llm_judge
+    criteria: "JOIN algorithm selection is correct — Nested Loop is good for small outer sets (Query 1: student_id = 42 returns few rows), Hash Join for medium sets with equality conditions (Query 2), Merge Join for large pre-sorted datasets (Query 3 with ORDER BY). Each selection is justified by data sizes and access patterns"
+    weight: 0.35
+    description: "Correct JOIN algorithm selection"
+  - type: llm_judge
+    criteria: "Index recommendations are specific — enrollments needs index on student_id (for Query 1 and 3), courses may need index on department (Query 2), and the indexes support the JOIN algorithms (index for Nested Loop inner table, or pre-sorting for Merge Join)"
+    weight: 0.35
+    description: "Specific index recommendations"
+  - type: llm_judge
+    criteria: "General JOIN guide is clear — explains when each algorithm is best (Nested Loop: small outer × indexed inner, Hash Join: medium tables with equality, Merge Join: large pre-sorted tables), how to read the execution plan to see which was chosen, and how work_mem affects Hash Join behavior"
+    weight: 0.30
+    description: "Clear general JOIN guide"

package/courses/postgresql-query-optimization/scenarios/level-1/n-plus-one-queries.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: n-plus-one-queries
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Fix N+1 query problems — identify and resolve the most common ORM performance anti-pattern"
+  tags: [PostgreSQL, N+1, ORM, performance, anti-pattern, beginner]
+state: {}
+trigger: |
+  Your blog application's homepage takes 12 seconds to load. The page
+  shows the 20 most recent posts with their authors and comment
+  counts. Your monitoring shows the page makes 62 database queries
+  per request.
+  The ORM code (pseudocode):
+  posts = Post.findAll({ limit: 20, order: 'created_at DESC' })
+  for each post in posts:
+    author = User.findById(post.author_id)       // N queries
+    commentCount = Comment.count(post_id: post.id) // N queries
+    categories = post.getCategories()              // N queries
+    post.display(author, commentCount, categories)
+  Database query log (abbreviated):
+  Query 1:  SELECT * FROM posts ORDER BY created_at DESC LIMIT 20
+  Query 2:  SELECT * FROM users WHERE id = 101
+  Query 3:  SELECT COUNT(*) FROM comments WHERE post_id = 1
+  Query 4:  SELECT c.* FROM categories c JOIN post_categories pc
+            ON c.id = pc.category_id WHERE pc.post_id = 1
+  Query 5:  SELECT * FROM users WHERE id = 102
+  Query 6:  SELECT COUNT(*) FROM comments WHERE post_id = 2
+  Query 7:  SELECT c.* FROM categories c JOIN post_categories pc
+            ON c.id = pc.category_id WHERE pc.post_id = 2
+  ... (repeat for all 20 posts = 1 + 20*3 = 61 queries)
+  Table sizes:
+  - posts: 50,000 rows
+  - users: 10,000 rows
+  - comments: 500,000 rows
+  - categories: 50 rows
+  - post_categories: 100,000 rows
+  Task: Rewrite this as optimized SQL. Show: (1) the N+1 problem
+  explained visually (why 62 queries), (2) the optimized version
+  using JOINs (1-2 queries), (3) an alternative using subqueries,
+  (4) how to fix this at the ORM level (eager loading), and (5) the
+  expected performance improvement with EXPLAIN ANALYZE comparison.
+assertions:
+  - type: llm_judge
+    criteria: "N+1 problem is clearly explained — shows why the loop generates 1 + N*3 queries, explains the network round-trip overhead of 62 separate queries, and demonstrates that the database does redundant work (same author fetched multiple times)"
+    weight: 0.35
+    description: "Clear N+1 explanation"
+  - type: llm_judge
+    criteria: "Optimized query is correct — uses JOINs to fetch posts with authors, comment counts (via LEFT JOIN + GROUP BY or subquery), and categories in 1-2 queries instead of 62. The SQL is syntactically correct and handles edge cases (posts with no comments, no categories)"
+    weight: 0.35
+    description: "Correct optimized query"
+  - type: llm_judge
+    criteria: "Multiple solutions are provided — shows the pure SQL approach (JOINs), the ORM approach (eager loading / includes / preload), and discusses when each is appropriate. Includes the expected performance improvement (62 queries → 1-2 queries, 12s → <200ms)"
+    weight: 0.30
+    description: "Multiple solution approaches"

package/courses/postgresql-query-optimization/scenarios/level-1/query-rewriting-basics.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: query-rewriting-basics
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Rewrite queries for performance — transform common slow query patterns into efficient alternatives"
+  tags: [PostgreSQL, query-rewriting, optimization, patterns, beginner]
+state: {}
+trigger: |
+  You've been asked to review and optimize a set of queries from a
+  legacy application. Each query works correctly but performs poorly.
+  The table has 5 million invoices.
+  Query 1 — Correlated subquery (runs per row):
+  SELECT i.*, (SELECT name FROM customers c
+               WHERE c.id = i.customer_id) as customer_name
+  FROM invoices i
+  WHERE i.amount > 1000;
+  (Executes the subquery for each of the 200,000 matching rows)
+  Query 2 — DISTINCT instead of proper JOIN:
+  SELECT DISTINCT i.*
+  FROM invoices i
+  JOIN line_items li ON li.invoice_id = i.id
+  WHERE li.product_id = 42;
+  (DISTINCT removes duplicates caused by the JOIN, very expensive
+  on wide rows)
+  Query 3 — Unnecessary sorting:
+  SELECT * FROM invoices
+  WHERE status = 'overdue'
+  ORDER BY id
+  LIMIT 100;
+  (The ORDER BY id is meaningless if you just want any 100 rows)
+  Query 4 — Counting all to check existence:
+  IF (SELECT COUNT(*) FROM invoices
+      WHERE customer_id = 42 AND status = 'unpaid') > 0 THEN ...
+  (Counts all matching rows just to check if any exist)
+  Query 5 — Multiple queries that could be one:
+  total = SELECT COUNT(*) FROM invoices WHERE status = 'paid';
+  total_amount = SELECT SUM(amount) FROM invoices WHERE status = 'paid';
+  avg_amount = SELECT AVG(amount) FROM invoices WHERE status = 'paid';
+  (Three full table scans for the same filter)
+  Query 6 — Inefficient pagination:
+  SELECT * FROM invoices ORDER BY created_at OFFSET 999980 LIMIT 20;
+  (OFFSET 999980 scans and discards almost 1M rows)
+  Task: Rewrite each query with the optimized version. For each,
+  explain: why the original is slow, what the optimized version does
+  differently, and the expected performance improvement.
+assertions:
+  - type: llm_judge
+    criteria: "All 6 queries are correctly rewritten — correlated subquery becomes a JOIN, DISTINCT becomes EXISTS or proper deduplication, unnecessary ORDER BY is removed, COUNT for existence becomes EXISTS, three queries become one with multiple aggregates, and OFFSET pagination becomes keyset/cursor pagination"
+    weight: 0.35
+    description: "All queries correctly rewritten"
+  - type: llm_judge
+    criteria: "Explanations show why originals are slow — correlated subquery executes N times, DISTINCT sorts/hashes wide rows, COUNT(*) scans all matching rows vs EXISTS stops at first, OFFSET scans and discards rows. Each explanation identifies the specific performance anti-pattern"
+    weight: 0.35
+    description: "Clear explanations of slowness"
+  - type: llm_judge
+    criteria: "Performance improvements are quantified — estimates the improvement for each rewrite (e.g., EXISTS vs COUNT: O(1) vs O(N), keyset pagination: constant time vs linear time with offset). The improvements are realistic for the 5M row table"
+    weight: 0.30
+    description: "Quantified performance improvements"

package/courses/postgresql-query-optimization/scenarios/level-1/select-star-problems.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: select-star-problems
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Eliminate SELECT * problems — understand why selecting all columns hurts performance and how to fix it"
+  tags: [PostgreSQL, SELECT, columns, performance, anti-pattern, beginner]
+state: {}
+trigger: |
+  Your API has a /users endpoint that returns user profiles. The
+  endpoint query is:
+  SELECT * FROM users WHERE active = true ORDER BY name LIMIT 50;
+  The users table has 25 columns including:
+  - id, name, email (used by the API response)
+  - password_hash, salt (NEVER should be sent to clients)
+  - profile_photo BYTEA (avg 500KB per row)
+  - resume_pdf BYTEA (avg 2MB per row)
+  - login_history JSONB (avg 100KB per row)
+  - preferences JSONB (avg 10KB per row)
+  - 15 other metadata columns
+  Performance problems caused by SELECT *:
+  1. Data transfer: Each row is ~2.6MB instead of ~200 bytes (name +
+     email). For 50 rows, that's 130MB instead of 10KB.
+  2. Memory usage: The application ORM hydrates all 25 columns into
+     objects, consuming 130MB of heap memory per request.
+  3. Security leak: password_hash and salt are fetched (even if the
+     API doesn't serialize them, they're in memory).
+  4. Index-only scan prevented: With SELECT *, PostgreSQL must visit
+     the heap even if an index covers the needed columns. An index on
+     (active, name) could serve SELECT name, email with an index-only
+     scan but SELECT * forces a heap fetch.
+  5. Schema evolution risk: When a new column is added (like a 5MB
+     attachment_data), all SELECT * queries automatically get slower
+     without any code change.
+  Other scenarios to address:
+  - COUNT(*) vs COUNT(1) — is there a difference?
+  - SELECT * in subqueries — does the optimizer help?
+  - SELECT * in EXISTS — does it matter?
+  - When is SELECT * actually OK? (ad-hoc queries, CTEs?)
+  Task: Rewrite the query with explicit columns, explain each of the
+  5 problems in detail, address the 4 additional scenarios, and
+  provide a coding guideline for when to use SELECT * vs explicit
+  columns.
+assertions:
+  - type: llm_judge
+    criteria: "All 5 problems are clearly explained — data transfer bloat (130MB vs 10KB), memory waste, security risk (password in memory), prevented index-only scans, and schema evolution risk. Each explanation connects the technical issue to real impact (latency, memory, security)"
+    weight: 0.35
+    description: "All problems clearly explained"
+  - type: llm_judge
+    criteria: "Additional scenarios are correctly answered — COUNT(*) vs COUNT(1) are identical in PostgreSQL, SELECT * in subqueries may or may not be optimized by the planner, SELECT * in EXISTS is fine because only existence is checked, and SELECT * in development/debugging contexts is acceptable"
+    weight: 0.35
+    description: "Additional scenarios correctly answered"
+  - type: llm_judge
+    criteria: "Practical guidelines are provided — when to always use explicit columns (production code, APIs), when SELECT * is acceptable (ad-hoc queries, EXISTS), how to enforce this in code review or linting, and the covering index opportunity when selecting specific columns"
+    weight: 0.30
+    description: "Practical guidelines"

package/courses/postgresql-query-optimization/scenarios/level-1/slow-query-diagnosis.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: slow-query-diagnosis
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Diagnose slow queries — use pg_stat_statements and query logs to find and prioritize the worst-performing queries"
+  tags: [PostgreSQL, pg_stat_statements, slow-queries, diagnosis, beginner]
+state: {}
+trigger: |
+  Your application's database response times have been creeping up
+  over the last month. Users are complaining about slowness but
+  nobody knows which queries are the problem. Your senior DBA is
+  on vacation and left you a note: "Enable pg_stat_statements and
+  check the slow query log."
+  You enable pg_stat_statements and after 24 hours, here's the
+  top 10 by total_exec_time:
+  | calls    | mean_ms | total_ms   | query (truncated)                |
+  |----------|---------|------------|----------------------------------|
+  | 2,400,000| 0.5     | 1,200,000  | SELECT id FROM sessions WHERE... |
+  | 50,000   | 200     | 10,000,000 | SELECT * FROM orders JOIN...     |
+  | 500,000  | 15      | 7,500,000  | UPDATE products SET view_count...|
+  | 10,000   | 500     | 5,000,000  | SELECT * FROM analytics WHERE...|
+  | 1,000    | 3,000   | 3,000,000  | SELECT * FROM reports WHERE...   |
+  | 100,000  | 25      | 2,500,000  | INSERT INTO events (...)...      |
+  | 5,000    | 400     | 2,000,000  | DELETE FROM sessions WHERE...    |
+  | 800,000  | 2       | 1,600,000  | SELECT 1 FROM users WHERE id=...|
+  | 200,000  | 5       | 1,000,000  | SELECT name FROM categories...   |
+  | 50       | 15,000  | 750,000    | SELECT * FROM users CROSS JOIN...|
+  Questions to answer:
+  1. Which query should you optimize first? (It's not the one with
+     the longest single execution time)
+  2. The session check query (row 1) is fast per call but runs 2.4M
+     times — is this a problem?
+  3. The reports query (row 5) runs only 1,000 times but takes 3
+     seconds each — how do you approach this differently than the
+     orders query?
+  4. The CROSS JOIN query (row 10) takes 15 seconds — what's likely
+     wrong?
+  Task: Analyze the pg_stat_statements data and create an optimization
+  priority list. For each of the top 10 queries, explain: whether it
+  needs optimization, what the likely issue is, how to investigate
+  further, and the expected impact of fixing it. Then explain how to
+  set up ongoing slow query monitoring.
+assertions:
+  - type: llm_judge
+    criteria: "Prioritization is correct — ranks by total_exec_time (not mean or calls alone), identifies the orders query (50K calls × 200ms = 10M ms total) as the highest priority, explains why a fast-but-frequent query can consume more time than a slow-but-rare one"
+    weight: 0.35
+    description: "Correct query prioritization"
+  - type: llm_judge
+    criteria: "Analysis of each query is insightful — identifies likely issues (missing indexes for orders/analytics, CROSS JOIN likely missing a WHERE clause, session checks might be unnecessary with caching, view_count UPDATE might need batching). Investigation steps are practical (EXPLAIN ANALYZE, check indexes, check table sizes)"
+    weight: 0.35
+    description: "Insightful per-query analysis"
+  - type: llm_judge
+    criteria: "Monitoring setup is practical — explains how to configure pg_stat_statements (shared_preload_libraries), log_min_duration_statement for slow query logging, and how to set up ongoing monitoring (periodic snapshots of pg_stat_statements, alerting on regression)"
+    weight: 0.30
+    description: "Practical monitoring setup"

package/courses/postgresql-query-optimization/scenarios/level-1/vacuum-and-statistics.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: vacuum-and-statistics
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Understand VACUUM and statistics — learn why PostgreSQL needs maintenance and how outdated statistics cause bad query plans"
+  tags: [PostgreSQL, VACUUM, autovacuum, statistics, ANALYZE, beginner]
+state: {}
+trigger: |
+  Your orders table has 10 million rows. Yesterday it was fast.
+  Today, a simple query that used to take 50ms now takes 15 seconds:
+  SELECT * FROM orders WHERE status = 'pending'
+  AND created_at >= '2026-02-26';
+  EXPLAIN shows: Seq Scan on orders (estimated rows: 5,000,000)
+  But the actual result is only 150 rows.
+  Investigation reveals:
+  1. The table statistics say 50% of rows have status = 'pending'
+     (true 6 months ago, now only 0.01% are pending)
+  2. A bulk UPDATE last night changed 8M rows from 'pending' to
+     'completed' but ANALYZE hasn't run since
+  3. The table has 8M dead tuples (from the bulk UPDATE) that
+     VACUUM hasn't cleaned up yet
+  4. Table bloat: the table occupies 4GB on disk but only 2GB of
+     that is live data
+  Your DBA explains: "PostgreSQL's query planner uses statistics to
+  estimate how many rows a query will return. If the statistics are
+  wrong, the planner picks the wrong execution plan."
+  Additional scenarios:
+  - Why did autovacuum not run? (It's been blocked by a long-running
+    transaction from a reporting query)
+  - What happens when dead tuples accumulate? (Table bloat, slower
+    sequential scans, index bloat)
+  - What's the difference between VACUUM and VACUUM FULL?
+  - What does ANALYZE do and when should you run it?
+  Task: Explain what happened, fix the immediate problem, and set up
+  ongoing maintenance. Write: why the statistics were wrong and how
+  they led to a bad plan, the immediate fix (ANALYZE + VACUUM), the
+  difference between VACUUM, VACUUM FULL, and VACUUM ANALYZE, the
+  autovacuum configuration for high-update tables, and how to prevent
+  long-running transactions from blocking autovacuum.
+assertions:
+  - type: llm_judge
+    criteria: "Root cause is clearly explained — outdated statistics caused the planner to estimate 5M rows when only 150 existed, leading it to choose Seq Scan instead of Index Scan. ANALYZE updates statistics, fixing the plan. Dead tuples from the bulk UPDATE caused bloat"
+    weight: 0.35
+    description: "Clear root cause explanation"
+  - type: llm_judge
+    criteria: "Maintenance operations are correctly distinguished — VACUUM reclaims dead tuples without locking (but doesn't shrink the file), VACUUM FULL rewrites the table (locks it), ANALYZE updates statistics only, VACUUM ANALYZE does both. Explains when each is appropriate"
+    weight: 0.35
+    description: "Correct maintenance distinctions"
+  - type: llm_judge
+    criteria: "Autovacuum configuration is practical — explains key parameters (autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor) for high-update tables, how to set per-table autovacuum settings, and how to prevent long-running transactions from blocking autovacuum (statement_timeout, idle_in_transaction_session_timeout)"
+    weight: 0.30
+    description: "Practical autovacuum configuration"

package/courses/postgresql-query-optimization/scenarios/level-1/where-clause-optimization.yaml ADDED Viewed

@@ -0,0 +1,74 @@
+meta:
+  id: where-clause-optimization
+  level: 1
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize WHERE clauses — rewrite query filters to use indexes effectively and avoid common pitfalls"
+  tags: [PostgreSQL, WHERE, filters, optimization, SARGable, beginner]
+state: {}
+trigger: |
+  Your team has a users table (2 million rows) with indexes on email,
+  created_at, and status. But several queries are still doing full
+  table scans despite having indexes on the filtered columns.
+  Index definitions:
+  CREATE INDEX idx_users_email ON users(email);
+  CREATE INDEX idx_users_created ON users(created_at);
+  CREATE INDEX idx_users_status ON users(status);
+  Queries that aren't using indexes (and why):
+  Query 1 — Function on indexed column:
+  SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
+  (Index on email exists but LOWER() prevents its use)
+  Query 2 — Implicit type cast:
+  SELECT * FROM users WHERE id = '12345';
+  (id is INTEGER, '12345' is TEXT — implicit cast)
+  Query 3 — Leading wildcard:
+  SELECT * FROM users WHERE email LIKE '%@gmail.com';
+  (Leading wildcard can't use B-tree index)
+  Query 4 — OR with different columns:
+  SELECT * FROM users WHERE email = 'john@example.com'
+  OR status = 'suspended';
+  (OR across columns prevents single index use)
+  Query 5 — Math on indexed column:
+  SELECT * FROM users WHERE created_at + INTERVAL '30 days' > NOW();
+  (Math on the column prevents index use)
+  Query 6 — NOT IN with subquery:
+  SELECT * FROM users WHERE id NOT IN
+  (SELECT user_id FROM deleted_users);
+  (NOT IN with NULL handling issues)
+  Query 7 — Coalesce on indexed column:
+  SELECT * FROM users WHERE COALESCE(status, 'unknown') = 'active';
+  (Function wrapping prevents index use)
+  Query 8 — Comparing to column instead of constant:
+  SELECT * FROM users WHERE created_at > updated_at;
+  (Two columns compared — no single index helps)
+  Task: Rewrite each of the 8 queries to use indexes effectively.
+  Explain the SARGable concept (Search ARGument ABLE) and why each
+  original query violates it. Show the EXPLAIN output before and
+  after each rewrite.
+assertions:
+  - type: llm_judge
+    criteria: "All 8 queries are correctly rewritten — LOWER(email) uses a functional index or citext, type cast is fixed by matching types, leading wildcard uses pg_trgm or reverse index, OR uses UNION, math is moved to the constant side, NOT IN is replaced with NOT EXISTS or LEFT JOIN, COALESCE is eliminated, and column comparison is addressed"
+    weight: 0.35
+    description: "All queries correctly rewritten"
+  - type: llm_judge
+    criteria: "SARGable concept is clearly explained — defines what makes a WHERE clause SARGable (index-usable), the general rule (don't transform the indexed column), and provides the mental model for writing index-friendly filters"
+    weight: 0.35
+    description: "Clear SARGable explanation"
+  - type: llm_judge
+    criteria: "Before/after comparison shows the improvement — demonstrates that the rewritten queries use index scans instead of sequential scans, with approximate row counts and timing to illustrate the performance difference on 2M rows"
+    weight: 0.30
+    description: "Improvement demonstration"