npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/rest-api-error-handling/scenarios/level-1/request-validation-errors.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+meta:
+  id: request-validation-errors
+  level: 1
+  course: rest-api-error-handling
+  type: output
+  description: "Handle request validation errors — implement input validation with clear, actionable error messages for API consumers"
+  tags: [REST, API, validation, error-messages, input, beginner]
+state: {}
+trigger: |
+  You're building a user registration endpoint for an e-commerce API.
+  The endpoint POST /users accepts:
+  {
+    "email": "string (required, valid email)",
+    "password": "string (required, 8-64 chars, 1 uppercase, 1 number)",
+    "name": "string (required, 2-100 chars)",
+    "phone": "string (optional, E.164 format)",
+    "date_of_birth": "string (optional, ISO 8601, must be 13+)",
+    "referral_code": "string (optional, 8 alphanumeric chars)"
+  }
+  A frontend developer sends this request and gets back a 400 error
+  with the message "Invalid input" — nothing else. They're frustrated:
+  "Which field is wrong? What's wrong with it? How do I fix it?"
+  Test cases to handle:
+  1. Empty body: {}
+  2. Missing required fields: { "email": "test@example.com" }
+  3. Invalid email: { "email": "not-an-email", "password": "Abc12345",
+     "name": "Jo" }
+  4. Weak password: { "email": "a@b.com", "password": "short",
+     "name": "Jo" }
+  5. Multiple errors: { "email": "bad", "password": "x", "name": "",
+     "phone": "12345", "date_of_birth": "not-a-date" }
+  6. Valid request with unknown extra fields:
+     { "email": "a@b.com", "password": "Abc12345", "name": "Jo",
+       "admin": true, "role": "superuser" }
+  Task: Write the validation error responses for all 6 test cases.
+  Each response should tell the consumer exactly what's wrong and
+  how to fix it. Then write the validation strategy: should you fail
+  on the first error or collect all errors? How do you handle unknown
+  fields? What about nested objects?
+assertions:
+  - type: llm_judge
+    criteria: "Error responses are field-specific and actionable — each validation error names the exact field, describes what's wrong, and indicates what's expected. Multiple errors are returned together (not fail-fast) so the consumer can fix everything in one attempt"
+    weight: 0.35
+    description: "Field-specific actionable errors"
+  - type: llm_judge
+    criteria: "All 6 test cases produce appropriate responses — empty body lists all required fields, missing fields identifies which are missing, invalid formats explain the expected format, multiple errors are collected into an array, and unknown fields are handled safely (stripped or rejected with explanation)"
+    weight: 0.35
+    description: "All test cases handled"
+  - type: llm_judge
+    criteria: "Validation strategy is well-reasoned — explains collect-all-errors approach, addresses unknown/extra field handling (security implications of accepting 'admin: true'), and considers validation ordering (format before business logic)"
+    weight: 0.30
+    description: "Well-reasoned validation strategy"

package/courses/rest-api-error-handling/scenarios/level-1/server-error-handling.yaml ADDED Viewed

@@ -0,0 +1,55 @@
+meta:
+  id: server-error-handling
+  level: 1
+  course: rest-api-error-handling
+  type: output
+  description: "Handle 5xx server errors — implement safe error handling for unexpected failures without exposing internals"
+  tags: [REST, API, 5xx, server-errors, error-handling, beginner]
+state: {}
+trigger: |
+  You're on-call when the monitoring dashboard lights up. Your API is
+  returning 500 errors with full stack traces in the response body.
+  A customer screenshots one and posts it on Twitter:
+  {
+    "error": "DatabaseError: connection refused to postgres://admin:
+    s3cretP@ss@db-prod-01.internal.company.com:5432/users_db",
+    "stack": "at Pool.connect (/app/node_modules/pg/lib/pool.js:332)\n
+    at UserService.getUser (/app/src/services/user.ts:47)\n
+    at UserController.get (/app/src/controllers/user.ts:23)\n..."
+  }
+  The tweet goes viral. Security is panicking because the response
+  exposed: database credentials, internal hostnames, database name,
+  technology stack (Node.js, pg, TypeScript), and code structure.
+  Your team lead wants you to fix this immediately. The API has these
+  failure modes that can cause 500 errors:
+  1. Database connection failures
+  2. Unhandled null pointer exceptions in business logic
+  3. Third-party API timeouts (payment processor)
+  4. Out of memory errors during large report generation
+  5. File system permission errors when writing uploads
+  Task: Write the global error handler that catches all unhandled
+  errors and returns safe responses. Show: the safe 500 error response
+  format (what the client sees), the server-side error logging format
+  (what goes to your logging system), how to add request correlation
+  IDs so support can link a user's error to the server log, and how
+  each of the 5 failure modes should be handled differently.
+assertions:
+  - type: llm_judge
+    criteria: "Safe error responses never expose internals — no stack traces, database credentials, internal hostnames, technology details, or code structure in client-facing responses. Returns a generic message with a correlation ID that support can use to look up the full error"
+    weight: 0.35
+    description: "Safe error responses"
+  - type: llm_judge
+    criteria: "Server-side logging captures full diagnostic information — logs stack trace, error type, request details, correlation ID, and context. The 5 failure modes are handled differently where appropriate (503 for DB/third-party failures vs 500 for unhandled exceptions)"
+    weight: 0.35
+    description: "Comprehensive server-side logging"
+  - type: llm_judge
+    criteria: "Correlation ID system is well-designed — explains how IDs are generated (UUID), how they flow through the request lifecycle, how they appear in both the error response and server logs, and how support uses them to debug customer-reported errors"
+    weight: 0.30
+    description: "Correlation ID system"

package/courses/rest-api-error-handling/scenarios/level-2/api-versioning-errors.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: api-versioning-errors
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Handle API versioning errors — manage error responses during API version transitions and deprecation"
+  tags: [REST, API, versioning, deprecation, migration, intermediate]
+state: {}
+trigger: |
+  Your API is migrating from v1 to v2. The changes include:
+  - Different error response format (custom JSON → RFC 7807)
+  - Some endpoints removed (POST /users/bulk → use /users/import)
+  - Field renames (user.firstName → user.first_name)
+  - Stricter validation (email regex is now RFC 5322 compliant)
+  - New required fields (POST /orders now requires shipping_address)
+  Timeline:
+  - v2 launched 3 months ago
+  - v1 sunset in 6 months
+  - Currently 60% of traffic is v1, 40% is v2
+  Customer issues during the migration:
+  1. Customer sends v1-format request to v2 endpoint:
+     POST /v2/users { "firstName": "John" }
+     Gets 400 but doesn't understand why (field was renamed)
+  2. Customer uses removed v1 endpoint:
+     POST /v1/users/bulk [...]
+     Gets 404 (endpoint already removed from v1 ahead of schedule)
+  3. Customer on v1 gets different error format than v2 customer
+     for the same underlying issue — confusing for shared tooling
+  4. Customer tries to use v2 features via v1:
+     POST /v1/orders { "shipping_method": "express" }
+     (shipping_method only exists in v2)
+  5. v1 validation passes but v2 validation rejects the same input
+     (stricter email regex)
+  6. No version specified:
+     POST /users { ... }
+     Which version should be assumed?
+  Task: Design the versioning error handling strategy. Write: how
+  each of the 6 issues should be handled (status codes, error
+  messages, migration hints), the deprecation warning system (headers,
+  response fields), the sunset timeline communication in error
+  responses, and the version negotiation logic.
+assertions:
+  - type: llm_judge
+    criteria: "All 6 versioning issues have clear solutions — renamed fields get migration hints in the error message, removed endpoints return 410 Gone (not 404) with the replacement URL, version-specific validation differences are documented in errors, and missing version defaults are clearly defined"
+    weight: 0.35
+    description: "All versioning issues solved"
+  - type: llm_judge
+    criteria: "Deprecation communication is proactive — uses Sunset header (RFC 8594) and Deprecation header, includes deprecation warnings on successful v1 responses (not just errors), and provides migration documentation links in responses"
+    weight: 0.35
+    description: "Proactive deprecation communication"
+  - type: llm_judge
+    criteria: "Version negotiation is well-designed — explains how to handle missing version (default policy), URL path vs header vs query param versioning approach, and how the error format itself should be versioned (v1 clients get v1 errors, v2 clients get RFC 7807)"
+    weight: 0.30
+    description: "Well-designed version negotiation"

package/courses/rest-api-error-handling/scenarios/level-2/batch-request-errors.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: batch-request-errors
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Handle batch request errors — design error responses when some items in a batch succeed and others fail"
+  tags: [REST, API, batch, partial-failure, bulk-operations, intermediate]
+state: {}
+trigger: |
+  Your CRM API has a batch endpoint for importing contacts:
+  POST /contacts/batch with an array of up to 1,000 contacts.
+  A customer imports 500 contacts and gets back a 400 error:
+  { "error": "Validation failed" }
+  They're frustrated: "Which contacts failed? The 490 valid ones
+  didn't get created either. I have to fix the 10 bad ones and
+  re-submit all 500?"
+  Scenarios to handle:
+  1. All 500 succeed — easy case
+  2. 10 of 500 fail validation (duplicate emails, missing required
+     fields) — what happens to the other 490?
+  3. 250 succeed, then the database goes down — 250 are created,
+     250 are not. What's the response?
+  4. Request contains 2,000 contacts (over the 1,000 limit)
+  5. One contact has an email that triggers a rate limit on the
+     email verification service — should it block the whole batch?
+  6. Request is 50MB (massive batch) — how to handle before even
+     parsing
+  Design decisions needed:
+  - All-or-nothing (transaction) vs partial success?
+  - What HTTP status code for partial success? (200? 207? 202?)
+  - How to report per-item errors while keeping the response
+     navigable?
+  - How to handle the case where the client needs to retry only
+     the failed items?
+  Task: Design the batch error handling system. Write: the batch
+  response format (showing per-item success/failure), the HTTP status
+  code strategy, the error detail for each of the 6 scenarios, the
+  transaction vs partial-success decision (with trade-offs), and the
+  retry guidance for failed items.
+assertions:
+  - type: llm_judge
+    criteria: "Batch response format is clear — each item in the batch has an individual success/failure status with the item's index or identifier, specific error details for failed items, and a summary (total, succeeded, failed). The format allows clients to identify and retry only failed items"
+    weight: 0.35
+    description: "Clear batch response format"
+  - type: llm_judge
+    criteria: "All 6 scenarios have appropriate responses — full success (200/201), partial failure with a suitable status code (207 Multi-Status or similar), request too large (413), database mid-batch failure is handled (either transactionally or with clear partial status), and the rate limit scenario is isolated to the affected item"
+    weight: 0.35
+    description: "All scenarios handled"
+  - type: llm_judge
+    criteria: "Transaction vs partial-success trade-off is well-analyzed — discusses when all-or-nothing is appropriate (financial operations) vs when partial success is better (data imports), and the chosen approach is justified for the CRM contact import use case. Retry guidance tells clients how to re-submit only failed items"
+    weight: 0.30
+    description: "Well-analyzed trade-offs"

package/courses/rest-api-error-handling/scenarios/level-2/circuit-breaker-pattern.yaml ADDED Viewed

@@ -0,0 +1,52 @@
+meta:
+  id: circuit-breaker-pattern
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Implement circuit breaker pattern — prevent cascading failures when downstream services fail"
+  tags: [REST, API, circuit-breaker, resilience, cascading-failures, intermediate]
+state: {}
+trigger: |
+  Your e-commerce API depends on 5 downstream services. Last Tuesday,
+  the recommendation service went down for 30 minutes. Because your
+  API makes a synchronous call to it on every product page request,
+  ALL product pages became unusable — 100% failure rate when only the
+  recommendation service (a non-critical feature) was down.
+  The cascade:
+  1. Recommendation service starts timing out (30s timeout)
+  2. Your API's thread pool fills up waiting for recommendations
+  3. No threads available to serve product data (which is fine)
+  4. Users see 504 Gateway Timeout on all product pages
+  5. Cart and checkout also slow down (shared thread pool)
+  6. Revenue loss: $150,000 in 30 minutes
+  Your downstream services:
+  - Product catalog (critical): 99.95% uptime, 50ms p99
+  - Inventory (critical): 99.9% uptime, 100ms p99
+  - Recommendations (nice-to-have): 99.5% uptime, 200ms p99
+  - Reviews (nice-to-have): 99.7% uptime, 150ms p99
+  - Pricing (critical): 99.95% uptime, 75ms p99
+  Task: Design the circuit breaker system. Write: the circuit breaker
+  configuration for each service (thresholds, timeouts, fallbacks),
+  the state machine (closed → open → half-open), the fallback
+  strategies for each service (what to return when the circuit is
+  open), the error responses when circuit breakers are active, and
+  the monitoring dashboard that shows circuit breaker state.
+assertions:
+  - type: llm_judge
+    criteria: "Circuit breaker design differentiates by criticality — critical services (catalog, inventory, pricing) have different thresholds and fallback strategies than nice-to-have services (recommendations, reviews). Nice-to-have services fail open with degraded responses, critical services may fail closed with proper error responses"
+    weight: 0.35
+    description: "Criticality-aware circuit breaker"
+  - type: llm_judge
+    criteria: "State machine is complete — defines closed (normal), open (failing fast), and half-open (testing recovery) states with specific thresholds (failure count/percentage, timeout window, half-open success count). Explains what triggers each state transition"
+    weight: 0.35
+    description: "Complete state machine"
+  - type: llm_judge
+    criteria: "Fallback strategies are practical — recommendations return empty array or cached data, reviews show 'reviews unavailable', critical service failures return clear error responses. Error responses indicate degraded mode to clients. Monitoring shows real-time circuit state"
+    weight: 0.30
+    description: "Practical fallback strategies"

package/courses/rest-api-error-handling/scenarios/level-2/error-code-taxonomy.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: error-code-taxonomy
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Design an error code taxonomy — create a structured system of application-level error codes beyond HTTP status codes"
+  tags: [REST, API, error-codes, taxonomy, documentation, intermediate]
+state: {}
+trigger: |
+  Your fintech API returns HTTP status codes but customers want more
+  granular error identification. They're asking: "I got a 400 — but
+  was it a missing field, an invalid format, a business rule violation,
+  or a duplicate request? I need to handle each differently in my
+  code."
+  Examples of why HTTP status codes alone aren't enough:
+  All of these return 400 Bad Request:
+  - Missing required field (title)
+  - Invalid email format
+  - Amount exceeds daily transfer limit ($10,000)
+  - Transfer to self not allowed
+  - Account is in read-only mode (regulatory hold)
+  - Duplicate idempotency key
+  All of these return 403 Forbidden:
+  - User doesn't have the 'transfers' permission
+  - IP address not in allowlist
+  - Account requires 2FA for this operation
+  - Compliance hold on account
+  - Operation not available in user's country
+  Customer requirements:
+  - "I need machine-readable codes I can switch on in my code"
+  - "I need to show different UI messages for different errors"
+  - "I need to know if the error is permanent or temporary"
+  - "I need error codes to be stable — don't change them without
+    warning"
+  - "I need documentation for every error code"
+  Task: Design the error code taxonomy. Write: the naming convention
+  (format, namespacing by domain), the complete error code catalog
+  for the examples above, the error code documentation template,
+  the governance process (how new codes are added, how existing
+  codes are deprecated), and the client SDK integration (how error
+  codes map to typed exceptions).
+assertions:
+  - type: llm_judge
+    criteria: "Error code system is well-structured — uses a consistent format (e.g., DOMAIN_CATEGORY_SPECIFIC like TRANSFER_LIMIT_EXCEEDED), is hierarchical enough for programmatic handling, and each code maps to a specific HTTP status code. Distinguishes between transient and permanent errors"
+    weight: 0.35
+    description: "Well-structured error code system"
+  - type: llm_judge
+    criteria: "Error catalog is complete and documented — all examples from the trigger are assigned codes, each code has a description, common causes, resolution steps, and whether it's retryable. Documentation template is thorough enough for API consumers to self-serve"
+    weight: 0.35
+    description: "Complete documented catalog"
+  - type: llm_judge
+    criteria: "Governance and SDK integration are practical — defines who can create new codes, versioning policy for codes, deprecation process, and shows how error codes translate to typed exceptions in client SDKs (making switch statements possible)"
+    weight: 0.30
+    description: "Practical governance and SDK integration"

package/courses/rest-api-error-handling/scenarios/level-2/error-monitoring-alerting.yaml ADDED Viewed

@@ -0,0 +1,53 @@
+meta:
+  id: error-monitoring-alerting
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Set up error monitoring and alerting — build an alerting system that catches real issues without alert fatigue"
+  tags: [REST, API, monitoring, alerting, observability, intermediate]
+state: {}
+trigger: |
+  Your API team has a monitoring problem: too many alerts that nobody
+  acts on. Last month, the team received 2,847 alerts. Of those:
+  - 2,100 were "5xx error rate above 0.1%" (most were single errors)
+  - 400 were "response time above 500ms" (during daily batch jobs)
+  - 200 were "disk space above 80%" (on log servers, always at 82%)
+  - 100 were "connection pool exhausted" (recovered in seconds)
+  - 47 were real incidents requiring action
+  The real incidents that were missed or delayed because of alert
+  fatigue:
+  - Payment endpoint returning 500 for 15 minutes (buried in noise)
+  - Database connection leak causing gradual degradation over 3 hours
+  - Authentication service returning wrong 200 responses (not caught
+    by status code monitoring)
+  - Third-party API silently returning stale data
+  Your API handles:
+  - 10M requests/day across 30 endpoints
+  - Baseline error rate: 0.05% (mostly 404s from bots)
+  - P99 latency: 200ms (varies by endpoint)
+  - 5 downstream dependencies
+  Task: Redesign the alerting system. Write: the alert taxonomy
+  (severity levels and routing), the specific alert rules for each
+  error category (with thresholds that avoid false positives), the
+  escalation policy, the dashboard design for real-time error
+  visibility, and the process for tuning alerts as traffic patterns
+  change.
+assertions:
+  - type: llm_judge
+    criteria: "Alert taxonomy reduces noise — defines severity levels (critical/warning/info) with clear criteria, routes alerts appropriately (PagerDuty for critical, Slack for warning, dashboard for info), and the new thresholds would have caught the 47 real incidents while eliminating most of the 2,800 false alerts"
+    weight: 0.35
+    description: "Noise-reducing alert taxonomy"
+  - type: llm_judge
+    criteria: "Alert rules are specific and contextual — uses error rate changes (not absolute thresholds), per-endpoint alerting (not global), anomaly detection for latency, and monitors for correctness issues (like the authentication service returning wrong 200s). Addresses the 4 missed incidents specifically"
+    weight: 0.35
+    description: "Specific contextual alert rules"
+  - type: llm_judge
+    criteria: "Escalation and tuning process is practical — defines who gets alerted at each severity, time-based escalation for unacknowledged alerts, and a regular process for reviewing alert effectiveness (false positive rate, mean time to detect)"
+    weight: 0.30
+    description: "Practical escalation and tuning"

package/courses/rest-api-error-handling/scenarios/level-2/intermediate-error-shift.yaml ADDED Viewed

@@ -0,0 +1,69 @@
+meta:
+  id: intermediate-error-shift
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Intermediate error handling shift — manage a complex multi-service outage with cascading error scenarios"
+  tags: [REST, API, error-handling, shift-simulation, outage, intermediate]
+state: {}
+trigger: |
+  You're the on-call engineer for a travel booking API. It's peak
+  summer booking season and the system processes $2M in bookings daily.
+  At 2:15 PM, alerts start firing.
+  Timeline of events:
+  2:15 PM — Hotel availability service starts returning 503
+  - Error rate: 100% on hotel searches
+  - Impact: Users can't search hotels, but flights still work
+  - Root cause unknown
+  2:20 PM — The booking service starts timing out
+  - Users who already found hotels are trying to book
+  - Booking service calls hotel availability to verify → timeout
+  - Bookings in progress are hanging (payment already charged but
+    booking not confirmed)
+  2:25 PM — Payment service reports "orphaned charges"
+  - 47 customers were charged but bookings weren't created
+  - Refund system requires a booking ID to process refunds
+  - Manual refund takes 3-5 business days
+  2:30 PM — Customer support flooded
+  - "I was charged $500 but have no booking confirmation"
+  - "The search page is spinning forever"
+  - "I booked a flight but can't add a hotel"
+  - API consumers (partner sites) are getting 504s from your gateway
+  2:35 PM — You discover the hotel service's database ran out of
+  connections due to a connection leak in a deploy at 1:45 PM
+  Available actions:
+  - Roll back hotel service to previous version
+  - Enable circuit breaker to bypass hotel availability check
+  - Manually refund the 47 orphaned charges
+  - Return cached hotel data (stale by 2 hours)
+  - Put up a maintenance page
+  Task: Handle this incident from the API error handling perspective.
+  Write: the immediate error response changes for each endpoint
+  during the outage, the orphaned payment recovery plan, the
+  customer communication (API status page updates), the partner
+  API consumer communication, and the post-incident improvements
+  to prevent this error cascade.
+assertions:
+  - type: llm_judge
+    criteria: "Immediate response handles each service appropriately — hotel search returns 503 with Retry-After (or cached data with staleness indicator), booking endpoint returns clear error about hotel unavailability without charging, existing orphaned payments are identified and handled, and flight-only flows remain functional"
+    weight: 0.35
+    description: "Appropriate immediate response"
+  - type: llm_judge
+    criteria: "Orphaned payment recovery is thorough — identifies all 47 affected customers, has a plan to refund without booking IDs (use payment timestamps), communicates proactively to affected customers, and designs a reconciliation process to catch any missed orphans"
+    weight: 0.35
+    description: "Thorough payment recovery"
+  - type: llm_judge
+    criteria: "Post-incident improvements prevent cascade — circuit breakers to isolate hotel service failures, payment-before-booking flow redesigned (or compensating transactions), connection pool monitoring and alerting, and deploy-time health checks that would have caught the connection leak"
+    weight: 0.30
+    description: "Cascade prevention improvements"

package/courses/rest-api-error-handling/scenarios/level-2/pagination-errors.yaml ADDED Viewed

@@ -0,0 +1,66 @@
+meta:
+  id: pagination-errors
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Handle pagination errors — design error responses for cursor-based and offset-based pagination edge cases"
+  tags: [REST, API, pagination, cursor, offset, edge-cases, intermediate]
+state: {}
+trigger: |
+  Your social media API supports both offset-based and cursor-based
+  pagination. Users are hitting confusing edge cases:
+  Bug reports this sprint:
+  1. Offset out of range: GET /posts?offset=999999&limit=20
+     Returns: 200 { "data": [], "total": 500 }
+     User says: "Is this an error? Should I get 404? How do I know
+     I've gone past the end?"
+  2. Negative offset: GET /posts?offset=-5&limit=20
+     Returns: 500 Internal Server Error (database throws)
+  3. Limit too large: GET /posts?limit=10000
+     Returns: 200 (but takes 45 seconds and OOMs the server)
+  4. Invalid cursor: GET /posts?cursor=abc123invalid
+     Returns: 200 { "data": [] }
+     User says: "Is the cursor expired or invalid? Same response
+     as 'no more data'."
+  5. Expired cursor: GET /posts?cursor=eyJ0...  (valid format but
+     references a deleted post)
+     Returns: 500 with database error
+  6. Both offset and cursor provided:
+     GET /posts?offset=10&cursor=eyJ0...
+     Returns: uses offset, ignores cursor (confusing)
+  7. Data changes during pagination: User fetches page 1, new posts
+     are added, page 2 has duplicates from page 1
+  8. Zero limit: GET /posts?limit=0
+     Returns: 200 { "data": [], "total": 500 }
+     (Wastes a database query)
+  Task: Design the error handling for all 8 pagination edge cases.
+  For each, decide: is it an error or expected behavior? What status
+  code? What response? Then write pagination error handling guidelines
+  that cover both offset-based and cursor-based approaches, including
+  the response format that helps clients paginate correctly.
+assertions:
+  - type: llm_judge
+    criteria: "Each of the 8 edge cases has a clear, reasoned response — 400 for invalid parameters (negative offset, zero limit, conflicting params), enforced max limit to prevent OOM, distinct responses for invalid vs expired cursors (not the same as 'no more data'), and offset-out-of-range returns empty array with metadata showing total"
+    weight: 0.35
+    description: "All 8 edge cases handled"
+  - type: llm_judge
+    criteria: "Pagination response format prevents client confusion — includes total count, has_more flag, next_cursor or next_offset, and clear indication when the client has reached the end of results. The format makes it impossible to confuse 'no more data' with 'invalid request'"
+    weight: 0.35
+    description: "Clear pagination format"
+  - type: llm_judge
+    criteria: "Guidelines cover both pagination styles — explains trade-offs between offset and cursor pagination, how each handles data consistency (the duplicate data problem), and recommends cursor-based for feeds/timelines and offset-based for stable datasets with known totals"
+    weight: 0.30
+    description: "Comprehensive pagination guidelines"

package/courses/rest-api-error-handling/scenarios/level-2/retry-and-idempotency.yaml ADDED Viewed

@@ -0,0 +1,60 @@
+meta:
+  id: retry-and-idempotency
+  level: 2
+  course: rest-api-error-handling
+  type: output
+  description: "Design retry and idempotency patterns — build resilient API error recovery without duplicate side effects"
+  tags: [REST, API, retry, idempotency, resilience, intermediate]
+state: {}
+trigger: |
+  Your payment processing API has a critical bug: when the network
+  drops between your server and the payment processor, the client
+  retries and customers get charged twice. This has happened 47 times
+  in the last month, costing $12,000 in refunds and significant
+  customer trust.
+  The flow that's failing:
+  1. Client: POST /payments { amount: 100, card: "..." }
+  2. Server: calls Stripe to charge $100 → succeeds
+  3. Server: tries to respond 200 to client → network timeout
+  4. Client: sees timeout, retries POST /payments { same data }
+  5. Server: calls Stripe again → charges another $100
+  6. Customer charged $200 instead of $100
+  Your API has these endpoints that need retry safety:
+  - POST /payments — charge a customer
+  - POST /orders — create an order (generates order number)
+  - POST /refunds — issue a refund
+  - PUT /orders/:id/status — update order status
+  - POST /notifications/send — send push notification
+  - DELETE /orders/:id — cancel an order
+  Questions from the team:
+  1. "Which endpoints need idempotency keys and which are naturally
+     idempotent?"
+  2. "Where do we store the idempotency keys? Redis? Database?"
+  3. "How long do we keep them? Forever?"
+  4. "What if two concurrent requests have the same idempotency key?"
+  5. "What status codes should trigger a client retry vs not?"
+  Task: Design the idempotency and retry system. Write: which
+  endpoints need idempotency keys (with reasoning), the idempotency
+  key implementation (storage, TTL, concurrency handling), the client
+  retry policy (which status codes to retry, backoff strategy), and
+  the API error responses that communicate retry safety to clients.
+assertions:
+  - type: llm_judge
+    criteria: "Idempotency analysis is correct — POST /payments and POST /orders need idempotency keys, PUT is naturally idempotent, DELETE is naturally idempotent, POST /refunds needs keys. Explains why GET is safe, PUT is safe, and POST is dangerous for retries"
+    weight: 0.35
+    description: "Correct idempotency analysis"
+  - type: llm_judge
+    criteria: "Implementation handles edge cases — addresses concurrent duplicate requests (locking), idempotency key TTL (not forever, with rationale), storage choice (Redis vs DB with trade-offs), what to return when a duplicate request arrives (cached response vs 409), and how to handle in-flight requests"
+    weight: 0.35
+    description: "Edge case handling"
+  - type: llm_judge
+    criteria: "Client retry policy is specific — lists retryable status codes (503, 429, 408, network errors) and non-retryable codes (400, 401, 403, 404, 409, 422), includes exponential backoff with jitter, sets max retry count, and the error responses include Retry-After headers where appropriate"
+    weight: 0.30
+    description: "Specific client retry policy"