npm - dojo.md - Versions diffs - 0.1.0 → 0.2.0 - Mend

dojo.md 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

package/courses/rest-api-error-handling/scenarios/level-3/error-correlation.yaml ADDED Viewed

@@ -0,0 +1,58 @@
+meta:
+  id: error-correlation
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Build error correlation across services — connect related errors across distributed systems to identify root causes"
+  tags: [REST, API, correlation, distributed-tracing, root-cause, advanced]
+state: {}
+trigger: |
+  Your platform has 25 microservices generating 50,000 errors per
+  day. Most are correlated (one root cause triggers errors across
+  many services), but your current monitoring treats each error as
+  independent. This creates noise and delays root cause analysis.
+  Recent incident example:
+  - 3:00 PM: Database connection pool exhausted on User Service
+  - 3:01 PM: 500 errors spike on Order Service (calls User Service)
+  - 3:01 PM: 500 errors spike on Cart Service (calls User Service)
+  - 3:02 PM: 500 errors spike on Payment Service (calls Order Service)
+  - 3:02 PM: Timeout errors on API Gateway (all downstream failing)
+  - 3:03 PM: 50 PagerDuty alerts fire simultaneously
+  - 3:15 PM: On-call engineer still investigating which service is
+    the root cause (15 minutes wasted)
+  The 50 alerts were all caused by ONE root cause (DB connection pool).
+  But the engineer had to manually trace through 5 service dashboards
+  to figure that out.
+  Your error data sources:
+  - Application logs (JSON, in Elasticsearch)
+  - Distributed traces (Jaeger, using OpenTelemetry)
+  - Metrics (Prometheus, Grafana)
+  - Error tracking (Sentry, per-service)
+  - Kubernetes events
+  - Cloud provider health (AWS CloudWatch)
+  Task: Design the error correlation system. Write: the correlation
+  algorithm (how to group related errors across services), the root
+  cause identification logic (dependency graph + error timing), the
+  unified error view (single dashboard showing correlated errors),
+  the alert deduplication strategy (50 alerts → 1 incident), and
+  the automated root cause suggestion system.
+assertions:
+  - type: llm_judge
+    criteria: "Correlation algorithm is sound — uses trace IDs to link errors across services, uses temporal proximity and dependency graphs to group errors without trace IDs, and correctly handles the case where one root cause generates errors in 5 dependent services"
+    weight: 0.35
+    description: "Sound correlation algorithm"
+  - type: llm_judge
+    criteria: "Root cause identification is effective — leverages the service dependency graph to identify upstream causes, uses error timing (first service to error is likely the cause), and the automated suggestion system would have identified the DB connection pool issue in minutes rather than 15+"
+    weight: 0.35
+    description: "Effective root cause identification"
+  - type: llm_judge
+    criteria: "Alert deduplication is practical — groups related alerts into a single incident, identifies the root service, suppresses downstream noise, and the unified dashboard shows the error cascade visually with the root cause highlighted"
+    weight: 0.30
+    description: "Practical alert deduplication"

package/courses/rest-api-error-handling/scenarios/level-3/graphql-vs-rest-errors.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+meta:
+  id: graphql-vs-rest-errors
+  level: 3
+  course: rest-api-error-handling
+  type: output
+  description: "Compare GraphQL and REST error handling — design error strategies for APIs that serve both protocols"
+  tags: [REST, API, GraphQL, error-comparison, dual-protocol, advanced]
+state: {}
+trigger: |
+  Your company exposes the same backend through both REST and GraphQL
+  APIs. The error handling is inconsistent between them, and teams
+  building both interfaces need a unified strategy.
+  Current inconsistencies:
+  1. Auth errors:
+     REST: 401 { "error": "Invalid token" }
+     GraphQL: 200 { "data": null, "errors": [{ "message": "Invalid
+     token" }] }
+     Problem: GraphQL always returns 200, making monitoring tools
+     think everything is fine.
+  2. Validation errors:
+     REST: 400 { "errors": [{ "field": "email", "message": "..." }] }
+     GraphQL: 200 { "data": null, "errors": [{ "message": "...",
+     "extensions": { "field": "email" } }] }
+     Problem: Different shapes for the same validation failure.
+  3. Partial failures:
+     REST aggregated endpoint: returns 500 if any sub-call fails
+     GraphQL: returns partial data with errors for failed fields
+     Problem: GraphQL handles partial failure better, but how to
+     achieve similar in REST?
+  4. Rate limiting:
+     REST: 429 with X-RateLimit headers
+     GraphQL: all queries hit one endpoint — per-field or per-query
+     rate limiting?
+     Problem: A single GraphQL query can be equivalent to 50 REST
+     requests.
+  5. Error codes:
+     REST: well-defined (HTTP status codes)
+     GraphQL: no standard error code system (just "message" strings)
+     Problem: Clients can't programmatically handle GraphQL errors.
+  6. Deprecation:
+     REST: version URL + Sunset header
+     GraphQL: field-level @deprecated directive
+     Problem: How to communicate the same deprecation through both?
+  Task: Design the unified error handling strategy for dual-protocol
+  APIs. Write: the error mapping between REST and GraphQL for each
+  scenario, the GraphQL error extension standard (to match REST's
+  machine-readability), the monitoring strategy that works for both
+  (catching errors when GraphQL always returns 200), and the shared
+  error catalog that both protocols use.
+assertions:
+  - type: llm_judge
+    criteria: "Error mapping is thorough — maps each REST error pattern (status code + body) to its GraphQL equivalent (200 + errors array), handles the 200-problem for monitoring (custom error extensions with severity/code), and addresses partial failure handling for both protocols"
+    weight: 0.35
+    description: "Thorough error mapping"
+  - type: llm_judge
+    criteria: "GraphQL error extensions are well-designed — adds machine-readable codes, severity levels, retry guidance, and field-level error attribution. The extensions make GraphQL errors as actionable as REST errors for programmatic handling. Addresses rate limiting for complex queries (query cost analysis)"
+    weight: 0.35
+    description: "Well-designed GraphQL extensions"
+  - type: llm_judge
+    criteria: "Shared error catalog is practical — defines errors once (shared between protocols), translates automatically to REST format or GraphQL format, and the monitoring strategy correctly identifies errors in both protocols (using error extensions, not HTTP status codes, for GraphQL)"
+    weight: 0.30
+    description: "Practical shared catalog"

package/courses/rest-api-error-handling/scenarios/level-4/compliance-error-handling.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: compliance-error-handling
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Design compliance-aware error handling — satisfy PCI DSS, HIPAA, GDPR, and SOC 2 requirements for API error responses and logging"
+  tags: [REST, API, compliance, PCI-DSS, HIPAA, GDPR, SOC-2, expert]
+state: {}
+trigger: |
+  Your healthcare fintech company processes both medical records
+  (HIPAA) and payment data (PCI DSS), serves European users (GDPR),
+  and undergoes annual SOC 2 audits. A compliance audit found 14
+  critical findings related to API error handling.
+  Critical findings:
+  PCI DSS violations:
+  1. API error logs contain full credit card numbers (not just last 4)
+  2. Error responses include cardholder name in validation messages
+  3. Debug mode was left on in production, exposing internal error
+     details including payment processor credentials
+  4. No audit trail for failed payment attempts
+  HIPAA violations:
+  5. API 500 errors return patient medical record IDs in stack traces
+  6. Error logs are stored unencrypted and accessible to all engineers
+  7. Patient-identifiable data appears in error correlation IDs
+     (using patient email as request ID)
+  8. No access logging for PHI-containing API endpoints
+  GDPR violations:
+  9. Error messages include user email addresses
+  10. Error logs retain personal data beyond the consent period
+  11. No mechanism to purge error logs when a user exercises right to
+      erasure (Article 17)
+  12. Cross-border error log storage (EU user data in US logs)
+  SOC 2 violations:
+  13. No evidence of error monitoring and response procedures
+  14. Incomplete audit trail — gaps in error logging during an outage
+  Remediation deadline: 90 days
+  Budget: $200K
+  Task: Design the compliance remediation plan for all 14 findings.
+  Write: the error response sanitization rules per regulation, the
+  error log data classification and handling policy, the audit trail
+  design that satisfies all 4 frameworks, the data residency solution
+  for error logs, and the ongoing compliance monitoring system.
+assertions:
+  - type: llm_judge
+    criteria: "All 14 findings are remediated — PCI DSS: card data masked/removed from logs and responses, debug mode disabled, payment audit trails added. HIPAA: PHI removed from error outputs, logs encrypted and access-controlled, proper request IDs. GDPR: PII removed from errors, log retention aligned with consent, erasure mechanism built, data residency addressed. SOC 2: monitoring procedures documented, audit trail gaps prevented"
+    weight: 0.35
+    description: "All 14 findings remediated"
+  - type: llm_judge
+    criteria: "Unified compliance framework avoids conflicting rules — handles cases where regulations conflict (HIPAA requires audit trails but GDPR requires erasure), proposes a single error handling policy that satisfies all 4 frameworks simultaneously, and classifies data by sensitivity level"
+    weight: 0.35
+    description: "Unified compliance framework"
+  - type: llm_judge
+    criteria: "Implementation is feasible within 90 days and $200K — prioritizes critical findings, phases the work, and includes ongoing compliance monitoring (automated scans for PII/PHI in error logs, data residency checks, retention policy enforcement)"
+    weight: 0.30
+    description: "Feasible remediation plan"

package/courses/rest-api-error-handling/scenarios/level-4/enterprise-error-governance.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: enterprise-error-governance
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Design enterprise API error governance — standardize error handling across a large organization with multiple API teams"
+  tags: [REST, API, governance, enterprise, standardization, expert]
+state: {}
+trigger: |
+  You're the VP of Platform Engineering at a company with 2,000
+  engineers across 40 API teams. An audit revealed catastrophic
+  inconsistency in error handling:
+  Audit findings:
+  - 40 teams use 23 different error response formats
+  - 15 teams expose internal stack traces in production
+  - 8 teams have no rate limiting
+  - Error codes are duplicated across teams (INVALID_INPUT means
+    different things in 12 services)
+  - No team logs errors the same way (impossible to build unified
+    dashboards)
+  - 6 teams return 200 with error bodies (breaking monitoring)
+  - Customer support can't debug cross-service issues (no correlation)
+  - API documentation lists errors inconsistently
+  Previous standardization attempts that failed:
+  - 2024: Published error handling guidelines (PDF) — 5% adoption
+  - 2024: Added error format to API style guide (Confluence) — 12%
+  - 2025: Mandatory review checklist item — reviewers don't check
+  - 2025: Error linting in CI — teams added bypass flags
+  The CEO is frustrated: "We've been trying to standardize for 2
+  years. Why can't 40 teams agree on error formats?"
+  Constraints:
+  - Can't break existing API consumers (backward compatibility)
+  - Teams have autonomy (can't mandate tools)
+  - Migration must be gradual (can't rewrite everything)
+  - Must show progress to the board quarterly
+  Task: Design the governance program that will actually succeed.
+  Write: the error handling standard (technical specification), the
+  adoption strategy (why previous attempts failed and how this one
+  differs), the enforcement mechanism (automated, not manual), the
+  migration path (per-team, backward-compatible), and the success
+  metrics (tracked quarterly for the board).
+assertions:
+  - type: llm_judge
+    criteria: "Governance strategy addresses why previous attempts failed — identifies that PDFs, wikis, and checklists don't work, proposes automated enforcement (API gateway validation, CI/CD error format checks without bypass), and uses incentives rather than mandates (making compliance easier than non-compliance)"
+    weight: 0.35
+    description: "Strategy that overcomes past failures"
+  - type: llm_judge
+    criteria: "Technical standard is comprehensive but adoptable — defines the error format, error code registry, logging standard, and correlation requirements, but provides libraries/middleware that teams can drop in rather than implement from scratch. Backward compatibility is preserved via content negotiation or dual-format period"
+    weight: 0.35
+    description: "Comprehensive adoptable standard"
+  - type: llm_judge
+    criteria: "Success metrics are board-ready — tracks adoption percentage, error format compliance rate, cross-service debugging time, support ticket resolution time, and incident detection latency. Quarterly milestones show concrete progress toward full standardization"
+    weight: 0.30
+    description: "Board-ready success metrics"

package/courses/rest-api-error-handling/scenarios/level-4/error-analytics-platform.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: error-analytics-platform
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Build an API error analytics platform — design the data architecture for error pattern detection, anomaly identification, and trend analysis"
+  tags: [REST, API, analytics, error-patterns, anomaly-detection, expert]
+state: {}
+trigger: |
+  Your API platform processes 500M requests/day across 200 services.
+  The current error handling infrastructure can tell you THAT errors
+  are happening but not WHY patterns form or WHAT to do about them.
+  The CEO's questions you can't answer today:
+  1. "Which errors cost us the most revenue?" (No link between errors
+     and business impact)
+  2. "Are errors getting better or worse over time?" (No trend data)
+  3. "Which teams need the most help with error handling?" (No team-
+     level aggregation)
+  4. "Can we predict outages before they happen?" (No anomaly
+     detection)
+  5. "What's our total cost of API errors?" (Engineering time + lost
+     revenue + support costs)
+  Data sources available:
+  - Application logs: 2TB/day (Elasticsearch, 30-day retention)
+  - Distributed traces: 500GB/day (Jaeger, 7-day retention)
+  - Metrics: Prometheus (90-day retention)
+  - Error tracking: Sentry (all services)
+  - Business metrics: Revenue per API call, conversion rates
+  - Support tickets: Zendesk (tagged by API endpoint)
+  - Incident records: PagerDuty
+  Desired capabilities:
+  - Error-to-revenue impact mapping
+  - Automatic error pattern classification
+  - Anomaly detection (detect novel error patterns)
+  - Team error budget scorecards
+  - Root cause suggestion engine
+  - Predictive alerting (predict errors before they spike)
+  - Error cost attribution (per team, per service, per endpoint)
+  Budget: $500K/year for the platform
+  Task: Design the error analytics platform. Write: the data
+  architecture (ingestion, storage, processing), the analytics models
+  (pattern classification, anomaly detection, prediction), the
+  business impact calculation model, the team scorecards design,
+  and the executive dashboard that answers the CEO's 5 questions.
+assertions:
+  - type: llm_judge
+    criteria: "Data architecture handles the scale — 2TB+/day ingestion with appropriate storage tiers (hot/warm/cold), connects errors to traces, metrics, business data, and support tickets. The pipeline is cost-effective within the $500K budget"
+    weight: 0.35
+    description: "Scalable data architecture"
+  - type: llm_judge
+    criteria: "Analytics models are practical — error pattern classification groups similar errors automatically, anomaly detection identifies novel patterns beyond threshold-based alerting, and the prediction model uses leading indicators (latency trends, error rate derivatives) to forecast incidents"
+    weight: 0.35
+    description: "Practical analytics models"
+  - type: llm_judge
+    criteria: "Executive dashboard answers all 5 CEO questions — maps errors to revenue impact (lost transactions, support costs), shows trends over time, provides team-level scorecards with error budgets, includes anomaly/prediction alerts, and calculates total cost of API errors"
+    weight: 0.30
+    description: "CEO-answering dashboard"

package/courses/rest-api-error-handling/scenarios/level-4/error-cost-optimization.yaml ADDED Viewed

@@ -0,0 +1,63 @@
+meta:
+  id: error-cost-optimization
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Optimize API error costs — reduce the financial impact of errors through prevention, faster resolution, and smarter handling"
+  tags: [REST, API, cost-optimization, ROI, economics, expert]
+state: {}
+trigger: |
+  Your CFO asks: "How much do API errors actually cost us?" You
+  conduct an analysis and discover the total is $8.5M per year.
+  Cost breakdown:
+  Direct costs — $3.2M/year:
+  - Failed transactions: $1.8M (declined payments that were
+    actually valid — errors in fraud detection API)
+  - Duplicate charges requiring refunds: $400K (retry without
+    idempotency)
+  - SLA penalty credits: $600K (error rate SLA breaches)
+  - Compliance fines: $400K (PII in error logs, audit trail gaps)
+  Indirect costs — $5.3M/year:
+  - Engineering time debugging: $2.1M (800 engineer-hours/month
+    × $220/hour)
+  - Customer support for API errors: $1.2M (40% of Tier 2 tickets)
+  - Customer churn attributed to reliability: $1.5M (3 enterprise
+    customers cited error handling in exit interviews)
+  - Delayed feature delivery: $500K (30% of sprint capacity goes
+    to error-related work)
+  The CFO's challenge: "Reduce this by 50% in 12 months. Show me
+  the investment required and the projected ROI."
+  Available levers:
+  - Error prevention (better validation, testing, monitoring)
+  - Error detection speed (faster alerts, better correlation)
+  - Error resolution speed (runbooks, automated remediation)
+  - Error handling quality (better messages, retry logic)
+  - Error architecture (circuit breakers, graceful degradation)
+  Task: Design the error cost optimization program. Write: the
+  prioritized investment plan (highest ROI items first), the cost
+  model (how to measure error cost reduction over time), the
+  specific initiatives for each cost category with projected
+  savings, the implementation roadmap (quarterly milestones), and
+  the executive report template for tracking progress.
+assertions:
+  - type: llm_judge
+    criteria: "Investment plan prioritizes by ROI — identifies quick wins (idempotency for duplicate charges saves $400K with small investment), medium-term improvements (correlation and debugging tools save $2.1M in engineering time), and strategic initiatives (error architecture prevents churn). Each initiative has projected cost and savings"
+    weight: 0.35
+    description: "ROI-prioritized investment plan"
+  - type: llm_judge
+    criteria: "Cost model is measurable — defines how to track error cost reduction per category (failed transactions measured by payment success rate, engineering time measured by debugging hours, support costs measured by error ticket volume), and includes baseline measurements for before/after comparison"
+    weight: 0.35
+    description: "Measurable cost model"
+  - type: llm_judge
+    criteria: "Roadmap is realistic — phases the work quarterly over 12 months, identifies dependencies between initiatives, allocates team capacity realistically, and the projected 50% reduction ($4.25M) is justified by the specific initiative savings"
+    weight: 0.30
+    description: "Realistic 12-month roadmap"

package/courses/rest-api-error-handling/scenarios/level-4/error-executive-communication.yaml ADDED Viewed

@@ -0,0 +1,60 @@
+meta:
+  id: error-executive-communication
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Communicate API errors to executives — translate technical error data into business impact language for leadership"
+  tags: [REST, API, executive, communication, board, business-impact, expert]
+state: {}
+trigger: |
+  You're preparing for the quarterly board meeting. The board wants
+  to understand the company's API reliability story after a high-
+  profile incident last quarter where a 4-hour API outage cost $2M
+  in revenue and made the tech press.
+  Board members and their concerns:
+  - CEO: "Is our API platform a competitive advantage or liability?"
+  - CFO: "What's the ROI of our reliability investments?"
+  - CTO (board member): "Are we architecturally sound or is this
+    held together with duct tape?"
+  - Board member (former CISO): "Are error handling gaps creating
+    compliance or security risk?"
+  - Board member (investor representative): "How does our reliability
+    compare to competitors?"
+  Data you have:
+  - DORA metrics: deployment frequency 15/day, lead time 2 hours,
+    MTTR 45 minutes, change failure rate 8%
+  - Error rate trend: 0.5% → 0.3% over 6 months
+  - Revenue impact of errors: $8.5M/year (down from $12M)
+  - Uptime: 99.95% (target: 99.99%)
+  - Customer NPS for API: 42 (industry avg: 38)
+  - Reliability investment: $3M this year
+  - Competitor comparison: Stripe (99.999%), Twilio (99.95%),
+    Plaid (99.9%)
+  You need to present this in 15 minutes max, with 5 slides, in
+  language that non-technical board members understand.
+  Task: Write the board presentation. Include: the 5-slide structure
+  (with specific content per slide), the narrative arc (from problem
+  to progress to plan), how to translate DORA metrics and error
+  rates into business language, how to handle tough questions from
+  each board member, and the specific asks from the board (budget,
+  support, decisions).
+assertions:
+  - type: llm_judge
+    criteria: "Board presentation translates technical data to business impact — error rates become revenue impact dollars, MTTR becomes customer experience minutes, DORA metrics become competitive positioning. Non-technical board members can understand every slide without technical background"
+    weight: 0.35
+    description: "Business-language translation"
+  - type: llm_judge
+    criteria: "Narrative arc is compelling — acknowledges the incident honestly, shows concrete progress (error cost down from $12M to $8.5M), presents a clear plan to reach the target, and positions reliability as competitive advantage (not just risk mitigation). The 5-slide structure is tight and focused"
+    weight: 0.35
+    description: "Compelling narrative arc"
+  - type: llm_judge
+    criteria: "Handles each board member's concerns — has prepared answers for the CEO (competitive positioning), CFO (ROI of $3M investment), CTO (architectural soundness), CISO (compliance risk), and investor (competitor comparison). Includes specific asks (budget, headcount, decisions)"
+    weight: 0.30
+    description: "Addresses all board concerns"

package/courses/rest-api-error-handling/scenarios/level-4/error-handling-architecture.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: error-handling-architecture
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Architect enterprise error handling — design the 4-layer error handling architecture for a Fortune 500 API platform"
+  tags: [REST, API, architecture, enterprise, Fortune-500, layers, expert]
+state: {}
+trigger: |
+  You're the Chief Architect designing the error handling
+  architecture for a Fortune 500 company's API platform. The platform
+  serves 1B requests/day across 300 microservices, with $500M in
+  annual transaction volume flowing through the APIs.
+  The architecture must support:
+  - 300 microservices across 60 teams
+  - 5 compliance frameworks (PCI DSS, HIPAA, SOC 2, GDPR, FedRAMP)
+  - 10,000 API consumers (external)
+  - 99.99% availability SLA
+  - Multi-region (US-East, US-West, EU-West, AP-Southeast)
+  - Multi-cloud (AWS primary, GCP secondary)
+  Design the 4-layer error handling architecture:
+  Layer 1 — Error Generation:
+  How errors are created, classified, and enriched at the service
+  level. Includes error types, severity, retryability, and context.
+  Layer 2 — Error Propagation:
+  How errors flow through service chains, API gateways, and load
+  balancers. Includes translation, aggregation, and enrichment.
+  Layer 3 — Error Observation:
+  How errors are logged, traced, alerted, and analyzed. Includes
+  the data pipeline, storage, and analytics.
+  Layer 4 — Error Communication:
+  How errors are presented to different audiences (API consumers,
+  developers, ops, executives, regulators). Includes formatting,
+  redaction, and documentation.
+  Cross-cutting concerns:
+  - Multi-region consistency (same error in US and EU)
+  - Compliance (data never crosses wrong borders)
+  - Performance (error handling adds <1ms latency)
+  - Cost (within $2M/year budget for the platform)
+  Task: Design all 4 layers with their interactions. For each layer,
+  write: the detailed design, the technology choices, the failure
+  modes (what happens when the error handling itself fails), and the
+  interfaces between layers.
+assertions:
+  - type: llm_judge
+    criteria: "4-layer design is coherent — layers build on each other, interfaces are well-defined, and each layer handles the scale (1B requests/day, 300 services). Error generation is standardized via shared libraries, propagation handles multi-hop chains, observation handles the data volume, and communication tailors output by audience"
+    weight: 0.35
+    description: "Coherent 4-layer design"
+  - type: llm_judge
+    criteria: "Cross-cutting concerns are addressed — multi-region error consistency (same format, same codes globally), compliance-aware error handling (data residency for logs, PII redaction), performance budget (<1ms added latency), and the error handling system itself has failure modes and fallbacks"
+    weight: 0.35
+    description: "Cross-cutting concerns addressed"
+  - type: llm_judge
+    criteria: "Technology choices are justified — selects specific tools for each layer (error libraries, tracing systems, log platforms, API gateway config) with build-vs-buy rationale, and the total cost fits within the $2M/year budget"
+    weight: 0.30
+    description: "Justified technology choices"

package/courses/rest-api-error-handling/scenarios/level-4/error-org-design.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+meta:
+  id: error-org-design
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Design the API reliability organization — structure teams, roles, and processes for enterprise-scale error management"
+  tags: [REST, API, org-design, SRE, platform-engineering, expert]
+state: {}
+trigger: |
+  Your company has grown from 200 to 2,000 engineers in 3 years.
+  API error handling was manageable when everyone knew each other,
+  but now it's chaos. The VP of Engineering asks you to design the
+  organizational structure for API reliability.
+  Current state:
+  - No dedicated reliability team (every team does their own thing)
+  - On-call rotation is dreaded (engineers lack error handling skills)
+  - Incident response is ad-hoc (whoever's online)
+  - Error handling standards exist but nobody enforces them
+  - Knowledge is siloed (team A's error patterns are invisible to B)
+  - New hires take 3 months before they can diagnose API errors
+  Options to consider:
+  Option A — Centralized SRE team:
+  One 20-person SRE team owns all API reliability.
+  Pro: Consistency, expertise concentration
+  Con: Bottleneck, teams don't learn, "throw it over the wall"
+  Option B — Embedded SREs:
+  Each of the 40 teams gets 0.5 SRE (20 SREs distributed).
+  Pro: Close to product teams, context-aware
+  Con: Inconsistency, isolation, career path unclear
+  Option C — Platform + embedded hybrid:
+  Central platform team (10) builds tools and standards.
+  Embedded reliability champions (1 per team, 40) enforce and adapt.
+  Pro: Best of both worlds
+  Con: Complex coordination, champion role ambiguity
+  Option D — Full team ownership (you build it, you run it):
+  No dedicated SREs. Every team owns their error handling end-to-end.
+  Pro: Full ownership, fast iteration
+  Con: Inconsistency, no expertise concentration, on-call burden
+  Budget: 20 headcount for reliability roles
+  Task: Recommend and design the organizational model. Write: the
+  team structure (with specific roles and responsibilities), the
+  interaction model between platform and product teams, the career
+  path for reliability engineers, the on-call rotation design, and
+  the knowledge sharing system that prevents siloing.
+assertions:
+  - type: llm_judge
+    criteria: "Organizational model is well-reasoned — analyzes all 4 options with trade-offs, recommends one (likely the hybrid) with clear justification, and the 20 headcount is allocated across roles with specific responsibilities. Addresses why pure centralized and pure distributed models fail at this scale"
+    weight: 0.35
+    description: "Well-reasoned org model"
+  - type: llm_judge
+    criteria: "Interaction model is practical — defines how the platform team and product teams collaborate on error handling (shared standards, tooling, escalation paths), how reliability champions bridge the gap, and how knowledge flows between teams. The on-call rotation is sustainable (not just dumping on SREs)"
+    weight: 0.35
+    description: "Practical interaction model"
+  - type: llm_judge
+    criteria: "Career path and knowledge sharing are addressed — reliability engineers have growth opportunities (IC and management tracks), the champion role has recognition and development, and the knowledge sharing system captures error patterns, runbooks, and incident learnings across teams"
+    weight: 0.30
+    description: "Career path and knowledge sharing"

package/courses/rest-api-error-handling/scenarios/level-4/error-sla-design.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: error-sla-design
+  level: 4
+  course: rest-api-error-handling
+  type: output
+  description: "Design API error SLAs — create enforceable service level agreements for error rates, response times, and error resolution"
+  tags: [REST, API, SLA, contracts, reliability, penalties, expert]
+state: {}
+trigger: |
+  Your API platform serves 500 enterprise customers. Three major
+  customers are threatening to leave because of error-related
+  issues, and they're demanding formal SLAs.
+  Customer A (payments processor, $5M ARR):
+  "We need guarantees on error rates. Last month your payment
+  endpoint had a 2% error rate for 4 hours. That cost us $200K
+  in failed transactions. We want financial penalties if errors
+  exceed thresholds."
+  Customer B (healthcare platform, $3M ARR):
+  "When errors occur, your error messages don't help us debug.
+  We need SLAs on error message quality — every error must include
+  an actionable message and a correlation ID we can reference with
+  your support team."
+  Customer C (e-commerce aggregator, $2M ARR):
+  "Your error response times are inconsistent. Sometimes errors
+  return in 50ms, sometimes in 30 seconds (timeout). We need
+  SLAs on error response latency so our circuit breakers work
+  predictably."
+  Your sales team wants to create three SLA tiers:
+  - Platinum ($50K+/mo): strictest SLAs, dedicated support
+  - Gold ($10K+/mo): standard SLAs, priority support
+  - Silver (<$10K/mo): best-effort, standard support
+  Questions from legal:
+  1. "How do we measure error rate? Does a client-caused 400 count?"
+  2. "What are appropriate financial penalties? Credits or cash?"
+  3. "How do we handle force majeure (upstream provider outages)?"
+  4. "How do we prevent SLA gaming (customers triggering errors to
+     claim credits)?"
+  Task: Design the complete SLA framework. Write: the SLA tiers
+  with specific metrics and thresholds for each customer concern
+  (error rate, error quality, error response time), the measurement
+  methodology (how to calculate without disputes), the penalty
+  structure (credits, escalation), the exclusions (what doesn't
+  count), and the internal engineering requirements to meet the SLAs.
+assertions:
+  - type: llm_judge
+    criteria: "SLA metrics are precisely defined — error rate excludes client errors (4xx) and counts only server errors (5xx) and timeouts, error quality is measurable (correlation ID present, actionable message, correct status code), and error response latency has clear P99 targets per tier. All three customer concerns are addressed"
+    weight: 0.35
+    description: "Precisely defined SLA metrics"
+  - type: llm_judge
+    criteria: "Penalty structure is fair and enforceable — service credits are proportional to impact, excludes force majeure and client-caused errors, has anti-gaming provisions, and the measurement methodology prevents disputes (shared monitoring dashboard, agreed-upon measurement points)"
+    weight: 0.35
+    description: "Fair enforceable penalties"
+  - type: llm_judge
+    criteria: "Internal engineering requirements are realistic — identifies what the platform team must build to meet the SLAs (per-customer error tracking, error quality validation, guaranteed timeout behavior), and the SLA tiers are achievable without over-committing"
+    weight: 0.30
+    description: "Realistic engineering requirements"