npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/postgresql-query-optimization/scenarios/level-2/window-function-optimization.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: window-function-optimization
+  level: 2
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize window functions — reduce sorting overhead and memory usage in analytical queries"
+  tags: [PostgreSQL, window-functions, analytics, sorting, intermediate]
+state: {}
+trigger: |
+  Your analytics team writes queries with multiple window functions
+  that are extremely slow. The DBA says "window functions cause
+  re-sorting" but the team doesn't understand why.
+  Problem query (takes 45 seconds on 10M rows):
+  SELECT
+    id, department, salary, hire_date,
+    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
+      as salary_rank,
+    SUM(salary) OVER (PARTITION BY department) as dept_total,
+    AVG(salary) OVER (PARTITION BY department) as dept_avg,
+    RANK() OVER (ORDER BY salary DESC) as global_rank,
+    LAG(salary) OVER (PARTITION BY department ORDER BY hire_date)
+      as prev_salary,
+    SUM(salary) OVER (ORDER BY hire_date ROWS BETWEEN UNBOUNDED
+      PRECEDING AND CURRENT ROW) as running_total
+  FROM employees;
+  EXPLAIN ANALYZE shows 4 separate Sort operations:
+  - Sort 1: PARTITION BY department ORDER BY salary DESC (for salary_rank)
+  - Sort 2: PARTITION BY department (for dept_total, dept_avg)
+  - Sort 3: ORDER BY salary DESC (for global_rank)
+  - Sort 4: PARTITION BY department ORDER BY hire_date (for prev_salary)
+  - Sort 5: ORDER BY hire_date (for running_total)
+  Each sort processes all 10M rows. Five sorts × 10M rows = slow.
+  Additional challenges:
+  1. The query spills to disk (work_mem too small for 10M-row sorts)
+  2. Can any window functions share a sort operation?
+  3. Is there a way to pre-compute some of these values?
+  4. The running_total window frame is expensive — alternatives?
+  Task: Optimize this query. Show: how to consolidate window
+  definitions to minimize sorts, how to use named WINDOW clauses,
+  which window functions can share sort operations, work_mem tuning
+  for window functions, and alternative approaches for running totals
+  at scale.
+assertions:
+  - type: llm_judge
+    criteria: "Window consolidation reduces sorts — groups window functions with compatible PARTITION BY and ORDER BY into shared window definitions using named WINDOW clauses. dept_total and dept_avg share the same window, salary_rank and prev_salary may be reorganized to share sorts"
+    weight: 0.35
+    description: "Sort consolidation"
+  - type: llm_judge
+    criteria: "work_mem and disk spill are addressed — explains that each Sort node uses work_mem independently, so 5 sorts × work_mem must fit in RAM. Recommends appropriate work_mem for the query size and shows how to set it per-session for analytical queries"
+    weight: 0.35
+    description: "Memory tuning"
+  - type: llm_judge
+    criteria: "Alternative approaches for scale are provided — suggests pre-computing aggregates in materialized views, using CTEs to separate window function groups, or restructuring to avoid global window functions (ORDER BY salary DESC without PARTITION) that require sorting the entire table"
+    weight: 0.30
+    description: "Scalable alternatives"

package/courses/postgresql-query-optimization/scenarios/level-3/advanced-optimization-shift.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: advanced-optimization-shift
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Advanced optimization shift — rescue a failing database under production load with multiple concurrent crises"
+  tags: [PostgreSQL, optimization, shift-simulation, production, advanced]
+state: {}
+trigger: |
+  You're the senior DBA called in to a production emergency. The main
+  database for a fintech platform (processing $10M/day) is failing
+  under load.
+  System: PostgreSQL 16, 64 cores, 512GB RAM, 10TB NVMe
+  Database: 3TB across 200 tables
+  Load: 50,000 queries/second (normally 30,000)
+  Crisis timeline:
+  9:00 AM — Traffic spike (Black Friday sale):
+  - Queries/second jumped from 30K to 50K
+  - Connection pool exhausted (200/200 connections)
+  - New customers getting "connection refused" errors
+  9:15 AM — Table bloat discovered:
+  - The transactions table (500M rows, 200GB) is 60% bloated
+  - Autovacuum has been failing for 3 weeks (blocked by a cron job
+    that opens a transaction and never closes it)
+  - Dead tuples: 300M
+  - Sequential scans on this bloated table are 2.5x slower
+  9:30 AM — Query plan regression:
+  - The pricing query (runs 10,000 times/second) switched from Index
+    Scan to Seq Scan after an ANALYZE ran with stale statistics
+  - This single query is now consuming 40% of CPU
+  - Before regression: 2ms per query
+  - After regression: 50ms per query
+  9:45 AM — Disk space alert:
+  - WAL directory is 500GB (normally 10GB)
+  - A stuck replication slot from a decommissioned replica is
+    preventing WAL cleanup
+  - Disk: 9.5TB / 10TB used (95%)
+  10:00 AM — Deadlock storm:
+  - The payment processing code has a deadlock pattern:
+    Process A: Lock order (accounts → transactions)
+    Process B: Lock order (transactions → accounts)
+  - 100 deadlocks in the last 15 minutes
+  - Each deadlock rolls back one transaction → customer sees error
+  Task: Stabilize the database. Write: the triage order (what to fix
+  first), the immediate actions for each crisis, the root cause fixes,
+  and the post-incident improvements to prevent this combination of
+  issues from recurring.
+assertions:
+  - type: llm_judge
+    criteria: "Triage order is correct — disk space first (95% full is existential: drop replication slot to free WAL), then query plan regression (40% CPU from one query: force correct plan or recreate statistics), then connections (PgBouncer or increase max_connections), then deadlocks (fix lock ordering), then bloat (long-term fix)"
+    weight: 0.35
+    description: "Correct triage order"
+  - type: llm_judge
+    criteria: "Immediate actions are specific — DROP REPLICATION SLOT for the stuck slot, SET enable_seqscan = off or pin the pricing query plan, kill the cron job holding the open transaction, PgBouncer for connection pooling, and fix the lock ordering in application code"
+    weight: 0.35
+    description: "Specific immediate actions"
+  - type: llm_judge
+    criteria: "Post-incident improvements prevent recurrence — replication slot monitoring with alerting, idle_in_transaction_session_timeout to prevent autovacuum blocking, query plan monitoring to detect regressions, connection pooling as default architecture, and deadlock detection in CI/CD testing"
+    weight: 0.30
+    description: "Recurrence prevention"

package/courses/postgresql-query-optimization/scenarios/level-3/connection-pooling.yaml ADDED Viewed

@@ -0,0 +1,60 @@
+meta:
+  id: connection-pooling
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Implement connection pooling — deploy and tune PgBouncer for high-concurrency applications"
+  tags: [PostgreSQL, PgBouncer, connection-pooling, concurrency, advanced]
+state: {}
+trigger: |
+  Your application is hitting PostgreSQL connection limits and
+  performance is degrading. During peak traffic:
+  Symptoms:
+  - max_connections = 200, all slots used
+  - New connections queue and timeout after 5 seconds
+  - Each connection consumes ~10MB of RAM (200 × 10MB = 2GB just for
+    connections)
+  - Connection creation takes 50-100ms (TCP + SSL + auth)
+  - Application creates and destroys connections per request
+    (no pooling)
+  - Serverless functions (Lambda) open hundreds of connections
+    simultaneously
+  Application stack:
+  - 20 application servers (Node.js)
+  - 50 serverless functions (AWS Lambda)
+  - 5 background job workers
+  - 3 analytics/reporting services
+  - 1 admin dashboard
+  Current connection distribution:
+  - App servers: 20 × 10 connections = 200 (at the limit!)
+  - Lambda: 50 × 1-5 connections = 50-250 (spiky, blows past limit)
+  - Workers: 5 × 5 connections = 25
+  - Analytics: 3 × 10 connections = 30 (long-running queries)
+  - Admin: 1 × 5 connections = 5
+  Total potential: 310-510 connections (but limit is 200)
+  Task: Deploy PgBouncer to solve the connection problem. Write:
+  the PgBouncer configuration (pool mode, pool sizes per application
+  type), the architecture diagram (where PgBouncer sits), the
+  migration plan (how to switch applications to PgBouncer), the
+  limitations (what breaks in transaction pooling mode), and the
+  monitoring setup.
+assertions:
+  - type: llm_judge
+    criteria: "PgBouncer configuration is correct — uses transaction pooling mode for most apps (not session mode), configures different pool sizes per database/user (larger for app servers, smaller for Lambda), and the total server-side connections are well under max_connections"
+    weight: 0.35
+    description: "Correct PgBouncer configuration"
+  - type: llm_judge
+    criteria: "Limitations of transaction pooling are addressed — prepared statements don't work in transaction mode (or need server_reset_query), session variables are lost between transactions, LISTEN/NOTIFY doesn't work, advisory locks need session mode, and the analytics long-running queries may need a separate pool or direct connection"
+    weight: 0.35
+    description: "Transaction pooling limitations"
+  - type: llm_judge
+    criteria: "Architecture and migration plan are practical — PgBouncer deployment location (same host, sidecar, or separate), application configuration changes (connection string swap), monitoring (pgbouncer SHOW commands, stats database), and a phased rollout that doesn't risk all traffic at once"
+    weight: 0.30
+    description: "Practical architecture and migration"

package/courses/postgresql-query-optimization/scenarios/level-3/full-text-search-optimization.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: full-text-search-optimization
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize full-text search — build fast, relevant search using tsvector, GIN indexes, and ranking functions"
+  tags: [PostgreSQL, full-text-search, tsvector, GIN, ranking, advanced]
+state: {}
+trigger: |
+  Your knowledge base has 5M articles and users complain that
+  search is "slow and returns irrelevant results." Current
+  implementation uses LIKE queries:
+  SELECT * FROM articles
+  WHERE title ILIKE '%postgresql performance%'
+  OR body ILIKE '%postgresql performance%'
+  ORDER BY created_at DESC LIMIT 20;
+  Problems:
+  1. ILIKE with leading wildcard: Seq Scan on 5M rows, 8 seconds
+  2. No relevance ranking: Results sorted by date, not relevance
+  3. No stemming: Searching "running" doesn't find "run" or "ran"
+  4. No phrase matching: "postgresql performance" matches articles
+     where "postgresql" and "performance" are in different paragraphs
+  5. No highlighting: Users can't see why a result matched
+  Requirements:
+  - Search must be < 100ms
+  - Results ranked by relevance (not just date)
+  - Support for phrases, stemming, and Boolean operators
+  - Highlight matching text in results
+  - Support for multiple languages (English, Spanish, German)
+  - Boosted ranking for title matches vs body matches
+  Articles table:
+  CREATE TABLE articles (
+    id SERIAL PRIMARY KEY,
+    title TEXT NOT NULL,
+    body TEXT NOT NULL,
+    language VARCHAR(10) DEFAULT 'english',
+    category VARCHAR(50),
+    published_at TIMESTAMP,
+    author_id INTEGER
+  );
+  Task: Migrate from ILIKE to full-text search. Write: the tsvector
+  column and GIN index setup, the search query with ts_query and
+  ranking, the relevance tuning (title boost, recency factor), the
+  multi-language support approach, and the maintenance plan
+  (index updates, dictionary management).
+assertions:
+  - type: llm_judge
+    criteria: "FTS implementation is correct — creates a generated tsvector column combining title (weight A) and body (weight B/C), GIN index on the tsvector column, search uses plainto_tsquery or websearch_to_tsquery, and ts_rank or ts_rank_cd for relevance scoring"
+    weight: 0.35
+    description: "Correct FTS implementation"
+  - type: llm_judge
+    criteria: "Relevance tuning addresses the requirements — title matches are boosted (using tsvector weights A/B/C/D), recency can factor into ranking (combined score of ts_rank + recency bonus), phrase matching uses phraseto_tsquery, and ts_headline provides result highlighting"
+    weight: 0.35
+    description: "Relevance tuning"
+  - type: llm_judge
+    criteria: "Multi-language and maintenance are addressed — uses language-specific text search configurations (english, spanish, german) per article, handles the GIN index maintenance (fastupdate setting, periodic reindex), and performance is estimated to be <100ms for the 5M article dataset with proper indexing"
+    weight: 0.30
+    description: "Multi-language and maintenance"

package/courses/postgresql-query-optimization/scenarios/level-3/jsonb-optimization.yaml ADDED Viewed

@@ -0,0 +1,88 @@
+meta:
+  id: jsonb-optimization
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize JSONB queries — index and query JSON documents efficiently without sacrificing flexibility"
+  tags: [PostgreSQL, JSONB, JSON, indexing, semi-structured, advanced]
+state: {}
+trigger: |
+  Your SaaS platform stores product configurations in a JSONB column.
+  The schema-less flexibility was great for development, but now
+  queries are slow at scale (10M rows).
+  Table:
+  CREATE TABLE products (
+    id SERIAL PRIMARY KEY,
+    tenant_id INTEGER NOT NULL,
+    name TEXT NOT NULL,
+    config JSONB NOT NULL
+  );
+  Sample config JSONB:
+  {
+    "category": "electronics",
+    "brand": "Samsung",
+    "price": 999.99,
+    "specs": {
+      "weight": 0.5,
+      "dimensions": { "width": 15, "height": 7, "depth": 0.8 },
+      "colors": ["black", "white", "blue"]
+    },
+    "tags": ["premium", "flagship", "5G"],
+    "reviews_count": 2500,
+    "in_stock": true
+  }
+  Slow queries:
+  Q1: Filter by top-level key:
+  WHERE config->>'brand' = 'Samsung'
+  (Returns 500K rows, Seq Scan on 10M)
+  Q2: Filter by nested key:
+  WHERE config->'specs'->>'weight' < '1.0'
+  (Text comparison, not numeric! '9.5' < '1.0' = true)
+  Q3: Array containment:
+  WHERE config->'tags' @> '"premium"'
+  (Check if tags array contains 'premium')
+  Q4: Existence check:
+  WHERE config ? 'warranty'
+  (Does the key 'warranty' exist?)
+  Q5: Complex filter:
+  WHERE config @> '{"brand": "Samsung", "in_stock": true}'
+  AND (config->>'price')::numeric BETWEEN 500 AND 1000
+  Q6: Aggregation on JSONB:
+  SELECT config->>'category', AVG((config->>'price')::numeric)
+  FROM products GROUP BY config->>'category';
+  Performance issues:
+  - Q2 compares strings, not numbers (wrong results)
+  - Full GIN index is 2GB for 10M rows
+  - Some queries can't use GIN (numeric range)
+  - TOAST storage for large JSONB (>2KB) causes slowdown
+  Task: Optimize all 6 queries. For each, write: the correct query
+  syntax, the optimal index (GIN, expression, or partial), and the
+  expected performance. Then discuss when to extract JSONB fields
+  into regular columns vs keeping them in JSONB.
+assertions:
+  - type: llm_judge
+    criteria: "All 6 queries are optimized with correct indexing — GIN with jsonb_ops or jsonb_path_ops for containment queries (Q3, Q4, Q5), expression indexes for specific field queries (Q1: (config->>'brand'), Q2: ((config->'specs'->>'weight')::numeric)), and explains why GIN can't help numeric range queries"
+    weight: 0.35
+    description: "Correct JSONB indexing"
+  - type: llm_judge
+    criteria: "Q2 type comparison bug is fixed — the text comparison issue is identified (string ordering vs numeric ordering), the fix uses explicit casting ((config->'specs'->>'weight')::numeric), and the expression index uses the cast to support the corrected query"
+    weight: 0.35
+    description: "Type comparison bug fixed"
+  - type: llm_judge
+    criteria: "Column extraction guidance is practical — recommends extracting frequently-queried, well-defined fields (brand, price, category) into regular columns with proper types, keeping flexible/rare fields in JSONB, and using generated columns for the hybrid approach. Discusses TOAST storage impact"
+    weight: 0.30
+    description: "Practical extraction guidance"

package/courses/postgresql-query-optimization/scenarios/level-3/lock-contention-analysis.yaml ADDED Viewed

@@ -0,0 +1,80 @@
+meta:
+  id: lock-contention-analysis
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Diagnose lock contention — identify and resolve blocking queries, deadlocks, and lock escalation issues"
+  tags: [PostgreSQL, locks, deadlocks, contention, concurrency, advanced]
+state: {}
+trigger: |
+  Your application is experiencing intermittent slowdowns. Some
+  requests take 30+ seconds when they normally take 50ms. The
+  monitoring shows queries waiting for locks.
+  pg_stat_activity during a slowdown shows:
+  PID 1001: UPDATE orders SET status = 'shipped'
+            WHERE id = 42;
+            wait_event_type: Lock, wait_event: transactionid
+            state: active, waiting: true, duration: 28s
+  PID 1002: UPDATE orders SET status = 'paid'
+            WHERE id = 42;
+            wait_event_type: Lock, wait_event: transactionid
+            state: active, waiting: true, duration: 25s
+  PID 1003: SELECT * FROM orders WHERE id = 42;
+            state: active, waiting: false, duration: 0.1ms
+            (SELECT is not blocked — MVCC!)
+  PID 999:  BEGIN;
+            UPDATE orders SET tracking_number = 'ABC123'
+            WHERE id = 42;
+            -- Transaction started 30 seconds ago, no COMMIT yet
+            state: idle in transaction
+  Diagnosis: PID 999 holds a row lock on orders id=42, PIDs 1001
+  and 1002 are waiting for it. PID 999 is "idle in transaction" —
+  the application opened a transaction, did an UPDATE, then went
+  to do something else (call an external API?) before committing.
+  Additional issues found:
+  Issue 2 — Deadlock:
+  ERROR: deadlock detected
+  Process 2001 waits for ShareLock on transaction 5000;
+  blocked by process 2002.
+  Process 2002 waits for ShareLock on transaction 4999;
+  blocked by process 2001.
+  (Two processes updating the same rows in opposite order)
+  Issue 3 — DDL blocking DML:
+  ALTER TABLE orders ADD COLUMN notes TEXT;
+  This takes an AccessExclusive lock, blocking ALL queries on the
+  orders table for 45 seconds during the ALTER.
+  Issue 4 — VACUUM blocked by long transaction:
+  Autovacuum on orders table can't reclaim dead tuples because
+  PID 3001 has a transaction open for 2 hours (analytics query).
+  Dead tuples: 5M and growing.
+  Task: Resolve all 4 lock issues. For each, explain: the lock type
+  involved, why the contention occurs, the immediate fix, and the
+  preventive measure. Then write the monitoring queries to detect
+  each issue proactively.
+assertions:
+  - type: llm_judge
+    criteria: "All 4 issues are correctly diagnosed — idle-in-transaction holding row locks (fix: idle_in_transaction_session_timeout), deadlock from inconsistent lock ordering (fix: application-level ordering), DDL blocking (fix: lock_timeout + retry, or use CREATE INDEX CONCURRENTLY approach), VACUUM blocked by long transaction (fix: old_snapshot_threshold or statement_timeout for analytics)"
+    weight: 0.35
+    description: "All issues correctly diagnosed"
+  - type: llm_judge
+    criteria: "Lock types are correctly identified — row-level locks (FOR UPDATE, FOR SHARE), table-level locks (AccessExclusive for DDL, RowExclusiveLock for DML), and transaction ID locks. Explains MVCC's role (SELECTs don't block, only writes contend)"
+    weight: 0.35
+    description: "Correct lock type identification"
+  - type: llm_judge
+    criteria: "Monitoring queries are practical — uses pg_stat_activity to find idle-in-transaction sessions, pg_locks to detect blocking chains, log_lock_waits for automatic logging, and includes alerting thresholds (alert when any transaction is idle-in-transaction for > 60 seconds)"
+    weight: 0.30
+    description: "Practical monitoring queries"

package/courses/postgresql-query-optimization/scenarios/level-3/materialized-view-optimization.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+meta:
+  id: materialized-view-optimization
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize with materialized views — design refresh strategies, concurrent refresh, incremental maintenance, and query routing for expensive aggregations"
+  tags: [PostgreSQL, materialized-views, refresh, aggregation, caching, advanced]
+state: {}
+trigger: |
+  Your analytics dashboard queries aggregate data from a 500M-row
+  events table. The dashboard loads in 45 seconds because every page
+  view re-executes expensive aggregations. Your team wants to use
+  materialized views but has concerns about data freshness and refresh
+  cost.
+  The expensive dashboard query (45 seconds):
+  SELECT
+    date_trunc('hour', created_at) AS hour,
+    event_type,
+    COUNT(*) AS event_count,
+    COUNT(DISTINCT user_id) AS unique_users,
+    AVG(EXTRACT(EPOCH FROM duration)) AS avg_duration_secs
+  FROM events
+  WHERE created_at >= NOW() - INTERVAL '7 days'
+  GROUP BY 1, 2
+  ORDER BY 1 DESC, 2;
+  Current state:
+  - events table: 500M rows, 200GB, partitioned by month
+  - 50 new events/second (4.3M/day)
+  - Dashboard accessed by 30 analysts, 200 page views/day
+  - Data freshness requirement: within 15 minutes
+  - Full refresh of a materialized view takes 8 minutes
+  - During refresh, the old data must remain queryable
+  Challenges:
+  1. REFRESH MATERIALIZED VIEW takes an ACCESS EXCLUSIVE lock by default
+     — blocks all reads during the 8-minute refresh
+  2. REFRESH MATERIALIZED VIEW CONCURRENTLY requires a UNIQUE INDEX and
+     is slower (12 minutes) but doesn't block reads
+  3. No built-in incremental refresh — the entire view is recomputed
+  4. Multiple materialized views for different aggregations (hourly,
+     daily, weekly) — refresh coordination is needed
+  5. The view must include the last 15 minutes of data, but refreshing
+     every 15 minutes means the view is always being refreshed
+  Additional materialized view patterns to evaluate:
+  A. Single matview + CONCURRENTLY refresh every 15 minutes
+  B. Two matviews (swap pattern): refresh one while serving from the other
+  C. Matview for historical data + live query for recent 15 minutes (UNION)
+  D. pg_ivm extension for incremental materialized views
+  E. Continuous aggregation (TimescaleDB approach)
+  Task: Design the materialized view strategy. Write: the recommended
+  approach with justification, the refresh scheduling architecture,
+  how to handle the freshness requirement, the indexing strategy for
+  the materialized view, and monitoring for refresh health.
+assertions:
+  - type: llm_judge
+    criteria: "Recommended approach handles the freshness-vs-performance trade-off — the hybrid approach (option C: matview for historical + live query for recent data via UNION ALL) or a swap pattern (option B) is recommended with justification. Explains why simple CONCURRENTLY refresh alone isn't sufficient (12-minute refresh for 15-minute freshness leaves only 3 minutes of fresh data)"
+    weight: 0.35
+    description: "Sound matview strategy"
+  - type: llm_judge
+    criteria: "Refresh architecture is production-ready — includes scheduling (pg_cron or external scheduler), CONCURRENTLY to avoid blocking reads, unique index requirement for concurrent refresh, error handling for failed refreshes, and monitoring (track refresh duration, detect stale views, alert on refresh failures)"
+    weight: 0.35
+    description: "Production-ready refresh architecture"
+  - type: llm_judge
+    criteria: "Indexing and query routing are addressed — materialized view has appropriate indexes for dashboard query patterns, explains how to route queries to use the matview (direct query or view rewriting), and discusses trade-offs of each approach (storage cost, refresh CPU, staleness window)"
+    weight: 0.30
+    description: "Indexing and query routing"

package/courses/postgresql-query-optimization/scenarios/level-3/parallel-query-execution.yaml ADDED Viewed

@@ -0,0 +1,74 @@
+meta:
+  id: parallel-query-execution
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Optimize parallel query execution — configure and tune PostgreSQL parallel workers for analytical workloads"
+  tags: [PostgreSQL, parallel-query, workers, analytics, advanced]
+state: {}
+trigger: |
+  Your analytics database has a 64-core server but analytical queries
+  only use 1 core. EXPLAIN shows no parallel workers despite large
+  table scans.
+  Server: 64 cores, 512GB RAM, NVMe storage
+  PostgreSQL 16
+  Database: 2TB across 50 tables
+  Current settings (defaults):
+  max_parallel_workers = 8
+  max_parallel_workers_per_gather = 2
+  max_worker_processes = 8
+  parallel_setup_cost = 1000
+  parallel_tuple_cost = 0.1
+  min_parallel_table_scan_size = 8MB
+  min_parallel_index_scan_size = 512kB
+  Queries that should parallelize but don't:
+  Q1: Full table aggregate (100M rows):
+  SELECT COUNT(*), AVG(amount), SUM(amount)
+  FROM transactions
+  WHERE created_at >= '2026-01-01';
+  EXPLAIN shows: Seq Scan (no Gather node). Why?
+  Q2: GROUP BY with many groups (100M rows):
+  SELECT customer_id, SUM(amount)
+  FROM transactions
+  GROUP BY customer_id;
+  Uses parallel workers but is still slow — partial aggregate
+  + final aggregate creates bottleneck at Gather.
+  Q3: Complex JOIN (50M × 10M):
+  SELECT t.*, c.name
+  FROM transactions t
+  JOIN customers c ON c.id = t.customer_id
+  WHERE t.amount > 1000;
+  Parallel scan on transactions but nested loop JOIN is serial.
+  Q4: Partitioned table scan (500M rows, 60 partitions):
+  SELECT region, COUNT(*)
+  FROM events
+  GROUP BY region;
+  Should scan partitions in parallel but scans sequentially.
+  Task: Tune parallel query settings for this 64-core server. For
+  each query, diagnose why parallelism isn't working and fix it.
+  Write the optimal configuration, the expected speedup, and the
+  limitations of parallel queries in PostgreSQL.
+assertions:
+  - type: llm_judge
+    criteria: "Configuration is correct for 64-core server — max_parallel_workers increased to 32+ (not 64, need cores for other work), max_parallel_workers_per_gather increased to 8-16, parallel_setup_cost reduced for more aggressive parallelization, and explains why max_worker_processes must be >= max_parallel_workers"
+    weight: 0.35
+    description: "Correct parallel configuration"
+  - type: llm_judge
+    criteria: "Each query's parallelism issue is diagnosed — Q1 may not parallelize due to cost estimate (reduce parallel_setup_cost), Q2 has Gather bottleneck (increase max_parallel_workers_per_gather), Q3 needs parallel-aware JOIN (hash join can parallelize, nested loop may not), Q4 needs enable_parallel_append for partitioned tables"
+    weight: 0.35
+    description: "Per-query diagnosis"
+  - type: llm_judge
+    criteria: "Limitations are honestly stated — not all operations parallelize (correlated subqueries, some functions, writes), Gather node is a bottleneck for result assembly, parallel workers consume memory (work_mem per worker), and adding workers has diminishing returns (Amdahl's law)"
+    weight: 0.30
+    description: "Honest limitations"

package/courses/postgresql-query-optimization/scenarios/level-3/partitioning-strategies.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: partitioning-strategies
+  level: 3
+  course: postgresql-query-optimization
+  type: output
+  description: "Design partitioning strategies — partition large tables for query performance, maintenance efficiency, and data lifecycle management"
+  tags: [PostgreSQL, partitioning, range, list, hash, advanced]
+state: {}
+trigger: |
+  Your IoT platform stores sensor data in a single events table that
+  has grown to 5 billion rows (2TB). Performance has degraded to the
+  point where even simple queries take minutes.
+  Table:
+  CREATE TABLE events (
+    id BIGSERIAL,
+    device_id INTEGER NOT NULL,
+    event_type VARCHAR(50) NOT NULL,
+    payload JSONB NOT NULL,
+    region VARCHAR(10) NOT NULL,
+    created_at TIMESTAMP NOT NULL
+  );
+  Current indexes: (device_id, created_at), (event_type, created_at)
+  Index sizes: 800GB total (larger than the data!)
+  Query patterns:
+  Q1 (80% of queries): Recent events for a device
+      WHERE device_id = ? AND created_at >= NOW() - INTERVAL '7 days'
+  Q2 (10%): Regional aggregation
+      WHERE region = ? AND created_at BETWEEN ? AND ?
+  Q3 (5%): Event type analysis
+      WHERE event_type = ? AND created_at >= ?
+  Q4 (5%): Data archival
+      DELETE FROM events WHERE created_at < NOW() - INTERVAL '2 years'
+  Partitioning options to evaluate:
+  A. Range partition by created_at (monthly)
+  B. List partition by region (5 regions)
+  C. Hash partition by device_id (64 partitions)
+  D. Composite: Range by created_at, then List by region
+  E. Composite: Range by created_at, then Hash by device_id
+  Constraints:
+  - Q1 must be < 100ms
+  - Monthly archival must be < 1 minute (currently takes 6 hours)
+  - No more than 500 total partitions (planner overhead)
+  - Must migrate from monolithic table with < 1 hour downtime
+  Task: Evaluate all 5 partitioning options. Recommend the best one
+  with justification. Write: the CREATE TABLE DDL, the partition
+  creation automation (pg_partman or custom), the migration plan from
+  the monolithic table, the impact on each query pattern, and the
+  ongoing maintenance (partition creation, archival, VACUUM per
+  partition).
+assertions:
+  - type: llm_judge
+    criteria: "Partitioning evaluation analyzes all 5 options — range by date enables fast archival (DROP PARTITION) and prunes for time-based queries, hash by device_id distributes load evenly, composite enables multi-dimensional pruning. The chosen strategy balances all 4 query patterns, not just Q1"
+    weight: 0.35
+    description: "Thorough option evaluation"
+  - type: llm_judge
+    criteria: "Migration plan minimizes downtime — uses pg_partman or logical replication to migrate data, creates partitions in advance, uses concurrent index builds on partitions, and handles the transition from monolithic to partitioned with <1 hour downtime"
+    weight: 0.35
+    description: "Low-downtime migration plan"
+  - type: llm_judge
+    criteria: "Ongoing maintenance is automated — partition creation for future months/weeks, old partition archival via DROP (instead of DELETE), per-partition VACUUM (much faster than table-wide), and monitoring for partition imbalance or missing partitions"
+    weight: 0.30
+    description: "Automated ongoing maintenance"