npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/rest-api-error-handling/scenarios/level-2/rfc7807-problem-details.yaml ADDED Viewed

@@ -0,0 +1,60 @@
+meta:
+  id: rfc7807-problem-details
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Implement RFC 7807 Problem Details — adopt the standard error format for consistent, machine-readable API error responses"
+  tags: [REST, API, RFC-7807, problem-details, standard, intermediate]
+state: {}
+trigger: |
+  Your company has 12 microservices, each with its own error format.
+  The API gateway team is frustrated because they can't build unified
+  error handling. The CTO mandates adopting RFC 7807 (Problem Details
+  for HTTP APIs) as the company standard.
+  Current service error formats:
+  - User service: { "error": { "code": "USR001", "msg": "..." } }
+  - Order service: { "status": "error", "message": "..." }
+  - Payment service: { "errors": [{ "field": "...", "message": "..." }] }
+  - Inventory service: { "fault": { "type": "...", "detail": "..." } }
+  - Notification service: plain text error messages
+  You need to migrate all services to RFC 7807. The standard defines:
+  - type (URI): identifies the error type
+  - title: short human-readable summary
+  - status: HTTP status code
+  - detail: human-readable explanation specific to this occurrence
+  - instance (URI): identifies this specific occurrence
+  Challenges to address:
+  1. How to define the "type" URI — what namespace? Public or internal?
+  2. How to handle validation errors (multiple fields) within RFC 7807
+  3. How to extend the standard with custom fields (error codes,
+     timestamps, retry-after)
+  4. How the API gateway should aggregate errors from multiple services
+  5. How to version error types when error semantics change
+  6. Backward compatibility — how existing API consumers handle the
+     format change
+  Task: Design the RFC 7807 implementation for the company. Write:
+  the error type URI scheme, example Problem Details responses for
+  5 common error scenarios (validation, not found, auth, rate limit,
+  server error), the extension strategy for custom fields, the
+  migration plan for the 5 services, and the API gateway error
+  aggregation approach.
+assertions:
+  - type: llm_judge
+    criteria: "RFC 7807 implementation is technically correct — all required fields (type, title, status, detail) are present, type URIs are well-designed (resolvable or documented namespace), instance URIs uniquely identify error occurrences, and the Content-Type is application/problem+json"
+    weight: 0.35
+    description: "Technically correct RFC 7807"
+  - type: llm_judge
+    criteria: "Extension strategy is practical — handles validation errors with an 'errors' array extension, adds useful custom fields (error_code, timestamp, request_id) without conflicting with standard fields, and the type URI scheme supports versioning and categorization"
+    weight: 0.35
+    description: "Practical extension strategy"
+  - type: llm_judge
+    criteria: "Migration plan is realistic — phases the 5 services by complexity, addresses backward compatibility (content negotiation, dual-format period), handles API gateway aggregation of errors from services at different migration stages, and includes client SDK updates"
+    weight: 0.30
+    description: "Realistic migration plan"

package/courses/rest-api-error-handling/scenarios/level-2/webhook-error-handling.yaml ADDED Viewed

@@ -0,0 +1,55 @@
+meta:
+  id: webhook-error-handling
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Handle webhook delivery errors — design retry, dead letter queues, and error reporting for outbound webhook systems"
+  tags: [REST, API, webhooks, retry, delivery, dead-letter, intermediate]
+state: {}
+trigger: |
+  Your SaaS platform sends webhooks to customer endpoints when events
+  occur (order created, payment received, subscription changed). The
+  webhook system is unreliable and customers are losing data.
+  Current problems:
+  1. If a customer's endpoint returns 500, you retry 3 times with no
+     delay and then give up permanently — events are lost
+  2. If a customer's endpoint is slow (>5s), you timeout and retry,
+     causing duplicate deliveries
+  3. Customer's endpoint returns 301 redirect — you don't follow it
+     and mark the delivery as failed
+  4. Customer's SSL certificate expired — you fail silently, no
+     notification to the customer
+  5. Customer's endpoint returns 200 but the response body says
+     { "error": "processing failed" } — you mark it as delivered
+  6. Events arrive out of order (order.paid before order.created)
+     because retries for the earlier event haven't completed
+  7. A customer's endpoint is down for 3 days — you've queued 50,000
+     events and when they come back online, you flood them
+  Customer complaints:
+  - "We missed 200 payment events last month and our accounting is off"
+  - "We get duplicate events and our system processes them twice"
+  - "We had no idea our endpoint was failing until we lost data"
+  Task: Redesign the webhook error handling system. Write: the retry
+  policy (backoff schedule, max attempts, retry-after logic), the
+  dead letter queue design, the customer notification system for
+  delivery failures, the duplicate prevention mechanism, and the
+  endpoint health monitoring and automatic disable logic.
+assertions:
+  - type: llm_judge
+    criteria: "Retry policy is robust — uses exponential backoff with jitter (not fixed intervals), retries over hours/days (not seconds), has a maximum retry window (e.g., 72 hours), and handles different failure types appropriately (4xx vs 5xx vs timeout vs DNS failure)"
+    weight: 0.35
+    description: "Robust retry policy"
+  - type: llm_judge
+    criteria: "Dead letter queue and notification system prevents data loss — failed events after max retries go to a DLQ that customers can inspect and replay, customers are notified of delivery failures (email/dashboard), and the system tracks delivery status with full history per event"
+    weight: 0.35
+    description: "Data loss prevention"
+  - type: llm_judge
+    criteria: "Addresses all 7 problems — duplicate prevention (idempotency keys in webhook payloads), ordering guarantees or handling (sequence numbers), flood protection (rate limiting replays), SSL/redirect handling, response body validation, and automatic endpoint disabling with re-enable flow"
+    weight: 0.30
+    description: "All problems addressed"

package/courses/rest-api-error-handling/scenarios/level-3/advanced-error-shift.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: advanced-error-shift
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Advanced error handling shift — manage a distributed system failure with data consistency implications"
+  tags: [REST, API, error-handling, shift-simulation, distributed, advanced]
+state: {}
+trigger: |
+  You're the principal engineer on-call for a financial services
+  platform. At 11:45 PM on a Friday (month-end processing), you
+  receive a cascade of alerts.
+  The system:
+  - 30 microservices processing $50M in daily transactions
+  - PostgreSQL primary with 4 read replicas
+  - Redis cluster (6 nodes) for caching and distributed locking
+  - RabbitMQ for async event processing
+  - API gateway serving 500 requests/second
+  11:45 PM — Alert: Redis cluster node 3 failed
+  - Redis cluster reshards, 15-second disruption
+  - Distributed locks are lost during reshard
+  - Two instances of the payment processor run simultaneously
+    (lock was protecting against this)
+  11:46 PM — Alert: Duplicate transactions detected
+  - 23 payments processed twice (total $47,000 in duplicates)
+  - Some customers charged twice, some merchants paid twice
+  - The idempotency keys were stored in the Redis node that failed
+  11:48 PM — Alert: Read replica lag at 45 seconds
+  - Month-end batch job is hammering the primary
+  - API reads from replicas are returning stale data
+  - Account balance checks are approving payments they shouldn't
+    (stale balance data)
+  11:50 PM — Alert: RabbitMQ dead letter queue growing
+  - 5,000 events in DLQ — all "balance update" events
+  - These events normally update cached balances in Redis
+  - But Redis was resharding, so events failed delivery
+  - Now balances in Redis don't match balances in PostgreSQL
+  11:55 PM — Business impact:
+  - 23 duplicate transactions ($47K)
+  - Unknown number of over-approved payments (stale balances)
+  - Event backlog growing (15,000 now in DLQ)
+  - Month-end reconciliation report will be wrong
+  - Customer-facing error rate: 8% and climbing
+  Task: Navigate this crisis from the error handling perspective.
+  Write: the triage prioritization, the immediate containment
+  actions, the data consistency recovery plan (duplicate transactions,
+  stale balances, DLQ replay), the API error responses during
+  recovery (what clients see), and the post-incident architectural
+  changes to prevent this cascade.
+assertions:
+  - type: llm_judge
+    criteria: "Triage is prioritized correctly — stops the bleeding first (halt new payments to prevent more duplicates), then contains (identify all affected transactions), then recovers (resolve duplicates, replay DLQ, reconcile balances). Financial integrity takes precedence over availability"
+    weight: 0.35
+    description: "Correct triage prioritization"
+  - type: llm_judge
+    criteria: "Data consistency recovery is thorough — identifies all 23 duplicates for reversal, detects over-approved payments from stale balances, plans DLQ replay with idempotency checks, and includes a full reconciliation process to verify PostgreSQL and Redis are consistent before resuming"
+    weight: 0.35
+    description: "Thorough data consistency recovery"
+  - type: llm_judge
+    criteria: "Post-incident changes prevent cascade — moves idempotency keys to PostgreSQL (not Redis), implements read-your-writes consistency for balance checks, adds Redis cluster monitoring with automatic payment pause, and designs the DLQ replay to handle out-of-order processing safely"
+    weight: 0.30
+    description: "Cascade prevention changes"

package/courses/rest-api-error-handling/scenarios/level-3/api-gateway-errors.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: api-gateway-errors
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Design API gateway error handling — manage errors at the gateway layer for routing, transformation, and aggregation"
+  tags: [REST, API, gateway, routing, transformation, aggregation, advanced]
+state: {}
+trigger: |
+  You're architecting the error handling for an API gateway that
+  fronts 40 microservices. The gateway handles authentication, rate
+  limiting, routing, request/response transformation, and response
+  aggregation.
+  Current problems:
+  1. Gateway errors vs backend errors are indistinguishable:
+     When the gateway itself fails (routing error, transform error),
+     it returns the same format as backend errors. Clients can't tell
+     if their request never reached the backend or if the backend
+     failed.
+  2. Error format translation:
+     - Service A returns RFC 7807
+     - Service B returns { "error": "...", "code": 123 }
+     - Service C returns XML errors
+     - Service D returns plain text
+     Gateway currently passes through whatever the backend returns.
+  3. Aggregated endpoint errors:
+     GET /dashboard calls 5 services. If 2 fail:
+     - Currently returns 500 (entire dashboard fails)
+     - Wanted: partial response with errors for failed sections
+  4. Authentication at gateway vs service level:
+     Gateway validates JWT, service validates permissions. If the
+     service returns 403, should the gateway override, pass through,
+     or add context?
+  5. Timeout ambiguity:
+     Gateway timeout is 30s, but Service A has internal timeout of
+     10s. When Service A times out at 10s and returns 504, gateway
+     sees 504 from Service A — is this a gateway timeout or a service
+     timeout? The error is different.
+  6. Circuit breaker at gateway level:
+     When a service is down, the gateway's circuit breaker returns
+     503 — but the response lacks the backend's normal error format,
+     confusing clients.
+  Task: Design the gateway error handling architecture. Write: the
+  gateway error vs backend error distinction (with different error
+  envelope), the error format normalization layer, the partial
+  failure response format for aggregated endpoints, the auth error
+  handling strategy, and the circuit breaker error responses.
+assertions:
+  - type: llm_judge
+    criteria: "Gateway vs backend errors are clearly distinguished — gateway errors have a different envelope or indicator (e.g., error source field), clients can programmatically determine whether to retry at the same endpoint or if the backend is down, and each error includes enough context for debugging"
+    weight: 0.35
+    description: "Clear gateway vs backend distinction"
+  - type: llm_judge
+    criteria: "Error format normalization handles all 4 service formats — transforms RFC 7807, custom JSON, XML, and plain text into a unified format, preserves original error details in a nested field for debugging, and the aggregated endpoint returns partial success with per-section error details"
+    weight: 0.35
+    description: "Format normalization"
+  - type: llm_judge
+    criteria: "Advanced scenarios are handled — auth error layering (gateway vs service-level), timeout disambiguation, circuit breaker responses match the expected backend format, and the design considers caching error responses to reduce backend load during outages"
+    weight: 0.30
+    description: "Advanced scenario handling"

package/courses/rest-api-error-handling/scenarios/level-3/async-api-errors.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: async-api-errors
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Handle async API errors — design error handling for long-running operations, polling, and callback-based APIs"
+  tags: [REST, API, async, long-running, polling, callbacks, advanced]
+state: {}
+trigger: |
+  Your document processing API handles large file uploads that take
+  1-30 minutes to process. You've implemented an async pattern:
+  1. POST /documents → 202 Accepted { "job_id": "abc123" }
+  2. GET /documents/jobs/abc123 → { "status": "processing" }
+  3. GET /documents/jobs/abc123 → { "status": "completed", "result": {...} }
+  But error handling for this async flow is broken:
+  Problem 1 — Immediate validation vs deferred validation:
+  User uploads a 500MB file. Should you validate the file format
+  immediately (synchronous, but adds latency) or defer validation
+  to the background job (fast acceptance, but error arrives later)?
+  Currently you accept everything and fail later.
+  Problem 2 — Job failure notification:
+  When processing fails (corrupted PDF, unsupported format, OCR
+  failure), the job status just shows "failed" with no details.
+  Polling clients have to guess what went wrong.
+  Problem 3 — Partial failure:
+  A 100-page document: 95 pages processed successfully, 5 pages had
+  OCR errors. Is this a success or failure? Currently it's marked
+  as "failed" and the 95 good pages are thrown away.
+  Problem 4 — Timeout and abandonment:
+  Job has been "processing" for 2 hours (expected max: 30 min).
+  Is it stuck? Dead? Still running? No way to tell.
+  Problem 5 — Callback errors:
+  Client registered a callback URL. The callback fails (404, timeout).
+  The result is ready but the client doesn't know. No retry.
+  Problem 6 — Concurrent operations:
+  Two jobs on the same document submitted accidentally. They race
+  and produce conflicting results.
+  Task: Design the async error handling system. Write: the sync vs
+  async validation strategy, the job failure response format (with
+  error details, partial results, progress), the timeout detection
+  and recovery mechanism, the callback retry and dead-letter system,
+  and the concurrency control approach.
+assertions:
+  - type: llm_judge
+    criteria: "Validation strategy balances immediacy and thoroughness — validates file size, format header, and auth synchronously (fail fast on obvious issues), defers content validation to async processing (OCR quality, page parsing), and communicates which validations are immediate vs deferred"
+    weight: 0.35
+    description: "Balanced validation strategy"
+  - type: llm_judge
+    criteria: "Job failure format is detailed and handles partial success — includes error type, affected items (which pages failed), partial results (the 95 good pages), progress percentage, and distinguishes between retryable failures (temporary resource issues) and permanent failures (unsupported format)"
+    weight: 0.35
+    description: "Detailed job failure format"
+  - type: llm_judge
+    criteria: "Timeout, callback, and concurrency issues are all addressed — stuck job detection with heartbeats or deadlines, callback retry with exponential backoff and dead letter queue, and duplicate job prevention (idempotency key or resource locking)"
+    weight: 0.30
+    description: "All async issues addressed"

package/courses/rest-api-error-handling/scenarios/level-3/caching-error-scenarios.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: caching-error-scenarios
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Handle caching error scenarios — manage stale data, cache failures, and thundering herds in API caching layers"
+  tags: [REST, API, caching, Redis, stale-data, thundering-herd, advanced]
+state: {}
+trigger: |
+  Your API has a Redis caching layer that serves 80% of read traffic.
+  Recent incidents revealed that your caching error handling is
+  inadequate.
+  Incident 1 — Cache stampede (thundering herd):
+  Popular product page cache expired. 10,000 concurrent requests
+  all missed the cache simultaneously and hit the database. Database
+  connection pool exhausted, API returned 503 for 45 seconds.
+  Incident 2 — Stale data served during outage:
+  Redis cluster failed over (30 seconds). During failover, the API
+  fell back to direct database queries. When Redis came back, it
+  served stale cached data (pre-outage prices) for products whose
+  prices were updated during the outage. Customers saw wrong prices.
+  Incident 3 — Negative caching trap:
+  A product was temporarily out of stock. The "out of stock" response
+  was cached for 1 hour. When inventory was restocked 10 minutes
+  later, customers still saw "out of stock" for 50 minutes.
+  Incident 4 — Cache poisoning:
+  A downstream service returned an error (500), which was cached as
+  if it were a valid response. The cached error was served to all
+  users for the TTL duration (15 minutes).
+  Incident 5 — Inconsistent cache state:
+  User updated their profile. The profile cache was updated, but the
+  user list cache still showed the old data. Different endpoints
+  returned different data for the same user.
+  Incident 6 — Cache memory exhaustion:
+  Redis ran out of memory. Eviction policy kicked in and removed
+  frequently-accessed keys. Error rates spiked because "cache misses"
+  on critical data overwhelmed the database.
+  Task: Design the caching error handling strategy. For each of the
+  6 incidents, write: the root cause, the prevention mechanism, the
+  API's behavior when the cache fails (degrade gracefully vs error),
+  and the cache headers the API should return to communicate data
+  freshness to clients.
+assertions:
+  - type: llm_judge
+    criteria: "All 6 incidents have prevention mechanisms — stampede protection (singleflight/lock, stale-while-revalidate), stale data detection (version tracking or invalidation), negative caching with short TTL, error response exclusion from cache, cache invalidation patterns (event-driven), and memory management (eviction policies, monitoring)"
+    weight: 0.35
+    description: "All incidents prevented"
+  - type: llm_judge
+    criteria: "Graceful degradation strategy is clear — defines when to serve stale data vs return an error vs fall back to the database, uses Cache-Control headers to communicate freshness, and handles the Redis failure scenario with circuit breakers and fallback behavior"
+    weight: 0.35
+    description: "Clear degradation strategy"
+  - type: llm_judge
+    criteria: "Cache headers communicate data state — uses Cache-Control, ETag, Age, and custom headers to tell clients if data is fresh, stale, or from a fallback source. The strategy prevents clients from caching error responses and handles CDN caching layers"
+    weight: 0.30
+    description: "Communicative cache headers"

package/courses/rest-api-error-handling/scenarios/level-3/chaos-engineering-apis.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: chaos-engineering-apis
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Apply chaos engineering to APIs — design experiments that validate error handling under controlled failure conditions"
+  tags: [REST, API, chaos-engineering, fault-injection, resilience, advanced]
+state: {}
+trigger: |
+  Your API platform claims to handle failures gracefully, but you've
+  never tested it. After a major outage where circuit breakers didn't
+  fire, retries caused thundering herds, and error messages were
+  wrong, the CTO mandates chaos engineering.
+  Your API architecture:
+  - 15 microservices
+  - API gateway (Kong)
+  - PostgreSQL (primary + 2 replicas)
+  - Redis (caching + rate limiting)
+  - RabbitMQ (async processing)
+  - 3 external APIs (Stripe, SendGrid, Twilio)
+  Failure modes to test:
+  1. Latency injection: What happens when a service responds in 5s
+     instead of 50ms? Do timeouts fire? Do circuit breakers trip?
+  2. Error injection: What if a service returns 500 for 10% of
+     requests? Is the error budget consumed correctly?
+  3. Connection failure: What if Redis goes down? Does rate limiting
+     fail open or closed? Do cached error responses work?
+  4. Partial network partition: Service A can reach B but not C.
+     Does the error propagation handle this?
+  5. Data corruption: What if the database returns truncated data?
+     Does the API validate response data from downstream?
+  6. Clock skew: What if JWT validation fails because of 30-second
+     clock drift between services?
+  7. Resource exhaustion: What if thread pools or connection pools
+     fill up? Does the API respond with 503 or just hang?
+  8. Cascading failure: What if a failure in service D causes A, B,
+     and C to degrade?
+  Task: Design the chaos engineering program for your APIs. Write:
+  the experiment design for each of the 8 failure modes (hypothesis,
+  injection method, success criteria, abort conditions), the safety
+  guardrails (blast radius limits, kill switches), the progressive
+  rollout (start in staging, expand to production), and the error
+  handling improvements discovered template.
+assertions:
+  - type: llm_judge
+    criteria: "Experiment designs are rigorous — each has a clear hypothesis (e.g., 'circuit breaker should trip within 10s of 50% error rate'), specific injection method (proxy, sidecar, library), measurable success criteria, and abort conditions that prevent customer impact"
+    weight: 0.35
+    description: "Rigorous experiment designs"
+  - type: llm_judge
+    criteria: "All 8 failure modes are tested with practical approaches — latency injection via proxy, error injection via feature flags, connection failures via network policies, and resource exhaustion via controlled load. Each experiment validates specific error handling behavior (not just 'does it survive')"
+    weight: 0.35
+    description: "All failure modes tested"
+  - type: llm_judge
+    criteria: "Safety guardrails are comprehensive — blast radius limits (percentage of traffic, specific services), kill switches (automatic and manual), staging-first progression, and the results template captures what error handling worked, what didn't, and the remediation plan"
+    weight: 0.30
+    description: "Comprehensive safety guardrails"

package/courses/rest-api-error-handling/scenarios/level-3/database-error-handling.yaml ADDED Viewed

@@ -0,0 +1,79 @@
+meta:
+  id: database-error-handling
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Handle database errors in APIs — translate database failures into appropriate API responses without leaking internals"
+  tags: [REST, API, database, PostgreSQL, errors, translation, advanced]
+state: {}
+trigger: |
+  Your API uses PostgreSQL and you're auditing how database errors
+  surface to API consumers. You find that many database errors leak
+  directly into API responses or are handled incorrectly.
+  Problematic error translations found in your codebase:
+  1. Unique constraint violation:
+     DB: ERROR 23505: duplicate key value violates unique constraint
+     "users_email_key"
+     API returns: 500 { "error": "duplicate key value violates unique
+     constraint \"users_email_key\"" }
+     Should be: 409 Conflict with user-friendly message
+  2. Foreign key violation:
+     DB: ERROR 23503: insert or update on table "orders" violates
+     foreign key constraint "orders_user_id_fkey"
+     API returns: 500 with the raw DB error
+     Should be: 400 or 422 with "Referenced user does not exist"
+  3. Connection pool exhausted:
+     DB: TimeoutError: acquiring connection from pool timed out
+     API returns: hangs until client timeout
+     Should be: 503 with Retry-After
+  4. Deadlock detected:
+     DB: ERROR 40P01: deadlock detected
+     API returns: 500 { "error": "deadlock detected" }
+     Should be: automatically retried, then 503 if retries exhausted
+  5. Check constraint violation:
+     DB: ERROR 23514: new row for relation "products" violates check
+     constraint "products_price_check" (price must be > 0)
+     API returns: 500 with the constraint name
+     Should be: 422 with "Price must be greater than zero"
+  6. Read replica lag:
+     User creates a post (writes to primary), immediately fetches
+     their posts (reads from replica) — post is missing
+     API returns: 200 { "posts": [] } (incorrect but no "error")
+     Should be: consistency-aware response
+  7. Long query timeout:
+     Complex search query exceeds statement_timeout (30s)
+     DB: ERROR 57014: canceling statement due to statement timeout
+     API returns: 504 with no context
+     Should be: 408 or 504 with search refinement suggestions
+  Task: Design the database error translation layer. Write: the
+  mapping from PostgreSQL error codes to HTTP status codes, the
+  error message translation strategy (constraint names to human
+  messages), the automatic retry logic for transient errors
+  (deadlocks, connection issues), the read replica consistency
+  handling, and the implementation architecture (middleware vs
+  per-query vs ORM-level).
+assertions:
+  - type: llm_judge
+    criteria: "Error code mapping is comprehensive — maps PostgreSQL error classes (23xxx constraint violations, 40xxx transaction errors, 53xxx resource errors, 57xxx timeout errors) to appropriate HTTP status codes, with different handling for client-caused vs server-caused database errors"
+    weight: 0.35
+    description: "Comprehensive error code mapping"
+  - type: llm_judge
+    criteria: "Error message translation is secure and helpful — constraint names are translated to user-friendly messages without revealing table/column names, unique violations identify the conflicting field, and the translation is configurable (not hardcoded). Internal database details never leak to the API response"
+    weight: 0.35
+    description: "Secure helpful message translation"
+  - type: llm_judge
+    criteria: "Transient error handling is robust — deadlocks are auto-retried with backoff, connection pool exhaustion triggers 503 with monitoring alerts, read replica lag is detected and handled (read-your-writes consistency or causal consistency), and the architecture centralizes translation without coupling business logic to database specifics"
+    weight: 0.30
+    description: "Robust transient error handling"

package/courses/rest-api-error-handling/scenarios/level-3/distributed-error-propagation.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: distributed-error-propagation
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Design distributed error propagation — manage error context across microservice call chains without information loss"
+  tags: [REST, API, distributed, microservices, error-propagation, advanced]
+state: {}
+trigger: |
+  Your e-commerce platform has a 6-service call chain for placing an
+  order:
+  API Gateway → Order Service → Inventory Service → Payment Service
+                                                   → Shipping Service
+                                                   → Notification Service
+  A customer reports: "I got an error 'Something went wrong' when
+  placing my order." Your support team traces the request and finds:
+  - API Gateway returned: 500 { "error": "Internal server error" }
+  - Order Service logged: "PaymentError: upstream service failed"
+  - Payment Service logged: "HTTP 503 from Stripe"
+  - Stripe's actual error: "Card issuer declined — insufficient funds"
+  The real error (insufficient funds) is a 402 client error, but by
+  the time it reached the customer, it became a 500 server error.
+  Nobody in the chain translated it correctly.
+  Other propagation problems:
+  1. Inventory Service returns 409 (conflict: item reserved by another
+     user) → Order Service converts to 500 → customer sees "server
+     error" instead of "item no longer available"
+  2. Shipping Service returns 422 (address validation failed) → gets
+     lost in the chain → customer's order fails with no useful message
+  3. Two services fail simultaneously (payment and shipping) → only
+     one error reaches the customer
+  4. A service returns 200 with an error in the body (legacy service)
+     → downstream treats it as success
+  5. Timeout at layer 3 of 6 → unclear if the operation partially
+     completed
+  Task: Design the error propagation framework. Write: the error
+  context object that flows through the service chain (preserving
+  original error while adding context at each hop), the error
+  translation rules (when to propagate vs wrap vs replace errors),
+  the distributed tracing integration, the multi-error aggregation
+  strategy, and the client-facing error derivation logic.
+assertions:
+  - type: llm_judge
+    criteria: "Error context preservation is thorough — errors carry the original cause through the chain (structured error chain, not string concatenation), each service adds its context without losing upstream detail, and the full chain is available in logs/traces while only safe information reaches the client"
+    weight: 0.35
+    description: "Thorough error context preservation"
+  - type: llm_judge
+    criteria: "Error translation rules are correct — client errors (4xx) from downstream are translated appropriately for the current context (402 from Stripe becomes a meaningful payment error for the customer), server errors (5xx) are not exposed to the client as the downstream service's error, and the 5 specific problems are all addressed"
+    weight: 0.35
+    description: "Correct translation rules"
+  - type: llm_judge
+    criteria: "Distributed tracing integration is practical — uses correlation IDs (trace ID, span ID) to link errors across services, integrates with standard tracing (OpenTelemetry), and enables support to trace a customer's error back through the full service chain in minutes rather than hours"
+    weight: 0.30
+    description: "Practical tracing integration"

package/courses/rest-api-error-handling/scenarios/level-3/error-budgets-sre.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: error-budgets-sre
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Implement error budgets — design SLO/SLI/SLA framework for API reliability with error budget policies"
+  tags: [REST, API, SRE, error-budgets, SLO, SLI, SLA, advanced]
+state: {}
+trigger: |
+  Your API platform serves 50M requests/day across 200 endpoints for
+  3,000 API consumers. The VP of Engineering wants to move from
+  "fight every error" to "error budget-based reliability management."
+  Current state:
+  - Overall error rate: 0.3% (150,000 errors/day)
+  - But it's unevenly distributed:
+    - GET endpoints: 0.05% error rate (mostly 404s from invalid IDs)
+    - POST /payments: 2.1% error rate (mostly declined cards — 402)
+    - POST /orders: 0.8% error rate (inventory conflicts — 409)
+    - Search endpoints: 0.4% error rate (timeout under load)
+  - P99 latency: 800ms overall (but payment is 2.5s)
+  - Availability: 99.85% (monthly)
+  Questions from the VP:
+  1. "What's our SLO? How do we define 'error' for error budget
+     purposes? Is a 402 Declined Card an 'error'?"
+  2. "How much error budget do we have? When should we stop shipping
+     features and fix reliability?"
+  3. "How do we allocate error budget across teams? The payments team
+     has a higher 'error rate' but most are expected business errors."
+  4. "What's the difference between our SLO (internal target) and SLA
+     (customer promise)? How much buffer do we need?"
+  5. "How do we calculate error budget burn rate? I want an alert when
+     we're burning budget too fast."
+  Your API consumers have different needs:
+  - Enterprise (10 consumers): need 99.99% availability, <200ms p99
+  - Standard (200 consumers): need 99.9% availability, <500ms p99
+  - Free tier (2,790 consumers): best effort
+  Task: Design the complete error budget framework. Write: the SLI
+  definitions (what counts as an error), the SLO targets per tier,
+  the error budget calculation (with the 402/409 classification
+  decision), the burn rate alerting system, and the error budget
+  policy (what happens when budget is exhausted).
+assertions:
+  - type: llm_judge
+    criteria: "SLI/SLO definitions are precise — clearly defines what counts as an error (5xx yes, 4xx depends on type), distinguishes between expected business errors (402 declined, 409 conflict) and unexpected errors (500, 503), and sets different SLOs for different endpoint categories and consumer tiers"
+    weight: 0.35
+    description: "Precise SLI/SLO definitions"
+  - type: llm_judge
+    criteria: "Error budget math is correct — calculates monthly budget from SLO (e.g., 99.9% = 0.1% budget = 50,000 errors at 50M req/day), explains burn rate computation (fast burn vs slow burn), and the alerting thresholds catch budget exhaustion before it happens"
+    weight: 0.35
+    description: "Correct error budget math"
+  - type: llm_judge
+    criteria: "Error budget policy is actionable — defines what happens when budget is exhausted (feature freeze, reliability sprint), how to allocate budget across teams, the SLO-to-SLA buffer rationale, and how the policy integrates with sprint planning and release decisions"
+    weight: 0.30
+    description: "Actionable budget policy"