npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/aws-lambda-debugging/scenarios/level-3/kinesis-stream-processing.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: kinesis-stream-processing
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda Kinesis stream processing — diagnose shard iterator issues, throughput limits, and enhanced fan-out for high-volume stream consumers"
+  tags: [AWS, Lambda, Kinesis, streams, shards, fan-out, advanced]
+state: {}
+trigger: |
+  Your Lambda function processes a Kinesis data stream for real-time
+  analytics. Suddenly, processing lag increases from seconds to hours:
+  CloudWatch Metric — IteratorAge:
+  10:00 - 500ms (normal)
+  11:00 - 30,000ms (30 seconds behind)
+  12:00 - 3,600,000ms (1 hour behind!)
+  Investigation:
+  1. The stream has 4 shards but traffic doubled. Each shard supports
+     1MB/s or 1,000 records/s input. At 2,000 records/s, shards are
+     at capacity. Records back up.
+     Fix: increase shard count (UpdateShardCount) or use on-demand mode.
+  2. Lambda processes one batch per shard concurrently. With 4 shards,
+     maximum 4 concurrent Lambda invocations. Each takes 2 seconds.
+     Throughput: 4 batches × 100 records / 2 seconds = 200 records/s.
+     But 2,000 records/s are arriving!
+     Fix: increase parallelization factor (up to 10 per shard):
+     $ aws lambda update-event-source-mapping --uuid abc-123 \
+       --parallelization-factor 10
+     Now: 4 shards × 10 parallel = 40 concurrent Lambda invocations.
+  3. After fixing throughput, one shard has a "poison pill" record
+     that causes the Lambda to crash. The shard is blocked — the
+     same bad record retries indefinitely.
+     Fix: configure BisectBatchOnFunctionError, MaximumRetryAttempts,
+     and OnFailure destination (same as DynamoDB Streams).
+  4. Multiple consumers on the same stream compete for read throughput
+     (2MB/s per shard shared). Enhanced fan-out gives each consumer
+     a dedicated 2MB/s pipe.
+  Task: Explain Kinesis + Lambda debugging. Write: how Kinesis event
+  source mapping works (shards, iterators, checkpointing), throughput
+  optimization (parallelization factor, shard splitting), error
+  handling for stream records, enhanced fan-out for multiple consumers,
+  and monitoring stream processing health.
+assertions:
+  - type: llm_judge
+    criteria: "Kinesis processing mechanics are explained — Lambda polls each shard independently. One batch per shard at a time by default. Parallelization factor: run up to 10 Lambda invocations per shard concurrently (records within a partition key remain ordered). Shard throughput: 1MB/s write, 2MB/s read per shard. IteratorAge: time between record written and processed — critical metric for lag detection. Checkpointing: Lambda checkpoints after successful batch processing. Failed batches retry from the last successful checkpoint"
+    weight: 0.35
+    description: "Processing mechanics"
+  - type: llm_judge
+    criteria: "Error handling and throughput are covered — poison pill: one bad record blocks the entire shard. Fix: BisectBatchOnFunctionError (split batch to isolate bad record), MaximumRetryAttempts (stop retrying after N attempts), MaximumRecordAgeInSeconds (skip old records), OnFailure destination (send failed records to SQS). Throughput: increase parallelization factor for more concurrency, split shards for more capacity, use on-demand mode for auto-scaling shards. Enhanced fan-out: dedicated 2MB/s per consumer (avoids shared throughput limits)"
+    weight: 0.35
+    description: "Errors and throughput"
+  - type: llm_judge
+    criteria: "Monitoring is practical — key metrics: IteratorAge (processing lag, most important), GetRecords.IteratorAgeMilliseconds, IncomingBytes/IncomingRecords (input rate), ReadProvisionedThroughputExceeded (consumer hitting read limits). Alert on: IteratorAge > threshold (minutes, not hours), ReadProvisionedThroughputExceeded > 0 (need enhanced fan-out or more shards). Lambda metrics: Errors, Duration, ConcurrentExecutions per stream function. Capacity planning: records/second × average record size must be less than shard capacity × number of shards"
+    weight: 0.30
+    description: "Monitoring"

package/courses/aws-lambda-debugging/scenarios/level-3/lambda-at-edge.yaml ADDED Viewed

@@ -0,0 +1,64 @@
+meta:
+  id: lambda-at-edge
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda@Edge and CloudFront Functions — diagnose edge function failures, replication delays, and CloudFront integration issues"
+  tags: [AWS, Lambda, Lambda@Edge, CloudFront, edge, CDN, advanced]
+state: {}
+trigger: |
+  Your Lambda@Edge function for URL rewriting intermittently fails.
+  CloudFront returns 502 errors to some users but not others:
+  CloudFront error response:
+  502 ERROR — The request could not be satisfied.
+  Lambda@Edge function execution error.
+  Debugging Lambda@Edge is harder than regular Lambda:
+  1. Logs are in the REGION where the function executed, not where
+     it was deployed. A user in Tokyo triggers logs in ap-northeast-1,
+     a user in London triggers logs in eu-west-2.
+  2. You deployed the function in us-east-1 (required for Lambda@Edge)
+     but the error happens in eu-west-1. You must check CloudWatch
+     Logs in eu-west-1.
+  3. The function exceeds Lambda@Edge limits:
+     - Viewer request/response: 5 seconds timeout, 128MB memory,
+       40KB response size
+     - Origin request/response: 30 seconds timeout, 128MB memory,
+       1MB response size
+  Your function takes 6 seconds on some requests (cold start +
+  external API call) — exceeding the 5-second viewer request limit.
+  4. After a code fix, replication takes 5-15 minutes. You can't
+     test immediately — the old version runs until replication
+     completes across all edge locations.
+  Alternative: CloudFront Functions (simpler, faster, cheaper):
+  - Sub-millisecond execution
+  - Runs at edge locations (not regional)
+  - Limited to simple request/response manipulation
+  - JavaScript only, no network access
+  Task: Explain Lambda@Edge debugging. Write: Lambda@Edge vs
+  CloudFront Functions (when to use which), deployment and
+  replication process, finding logs across regions, edge-specific
+  limits and constraints, common failures, and testing strategies.
+assertions:
+  - type: llm_judge
+    criteria: "Lambda@Edge vs CloudFront Functions are compared — Lambda@Edge: Node.js/Python, up to 5s (viewer) or 30s (origin), network access, VPC access (origin only), deployed in us-east-1 and replicated. CloudFront Functions: JavaScript only, sub-millisecond, no network access, runs at edge locations, cheaper (1/6th the cost). Use CloudFront Functions for: simple header manipulation, URL rewrites, redirects. Use Lambda@Edge for: complex logic, network calls, authentication, dynamic content generation"
+    weight: 0.35
+    description: "Edge comparison"
+  - type: llm_judge
+    criteria: "Debugging challenges are covered — logs are in the execution region (not us-east-1). To find logs: check CloudFront access logs for x-edge-location, then check CloudWatch Logs in that region. Replication delay: 5-15 minutes after publishing — can't test immediately. Must use published versions (not $LATEST). Limits are stricter than regular Lambda: 128MB memory max, 5s timeout for viewer events, 40KB response size for viewer. Test with CloudFront's test event feature before deploying. Use CloudWatch Logs Insights cross-region query to find errors"
+    weight: 0.35
+    description: "Debugging challenges"
+  - type: llm_judge
+    criteria: "Testing and best practices are practical — test locally: sam local invoke with CloudFront event samples. Test in staging: create a staging CloudFront distribution. Monitor: CloudFront 5xx error rate metric (catches Lambda@Edge failures), Lambda@Edge metrics in us-east-1 (invocations, errors, duration). Keep functions fast: minimize cold starts (small packages), avoid external network calls in viewer events. Cache responses when possible. Use CloudFront Functions for simple tasks — they're faster, cheaper, and easier to debug. Always have a rollback plan (revert to previous version/alias)"
+    weight: 0.30
+    description: "Testing and practices"

package/courses/aws-lambda-debugging/scenarios/level-3/lambda-extensions-debugging.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: lambda-extensions-debugging
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda Extensions — diagnose extension initialization failures, performance impact, and third-party extension integration issues"
+  tags: [AWS, Lambda, extensions, monitoring, third-party, performance, advanced]
+state: {}
+trigger: |
+  After adding a Datadog monitoring extension to your Lambda function,
+  cold starts doubled from 800ms to 2.2 seconds, and some invocations
+  timeout:
+  CloudWatch REPORT (before extension):
+  REPORT Duration: 150.00 ms Init Duration: 800.00 ms
+  CloudWatch REPORT (after extension):
+  REPORT Duration: 350.00 ms Init Duration: 2200.00 ms
+  EXTENSION Name: datadog-agent State: Ready
+  The Datadog extension adds 1,400ms to cold start and 200ms to
+  each invocation. For a function with a 3-second timeout, this
+  pushes warm invocations over the limit.
+  Investigation:
+  1. Extension lifecycle: Extensions run in the INIT, INVOKE, and
+     SHUTDOWN phases. They add overhead to each phase.
+     INIT: extension initializes (downloads config, opens connections)
+     INVOKE: extension runs alongside the function (collects metrics)
+     SHUTDOWN: extension flushes data (sends metrics/logs to backend)
+  2. Extensions share the function's memory and timeout:
+     Function memory: 256MB, extension uses 50MB → only 206MB for
+     your code. Extension INIT counts toward the 10-second init limit.
+  3. The extension makes outbound HTTPS calls during INVOKE to send
+     telemetry. In a VPC Lambda without internet access, these calls
+     timeout, adding the full HTTP timeout to each invocation.
+  4. Multiple extensions stack:
+     Layer 1: Datadog monitoring (+1.4s cold start)
+     Layer 2: Secrets Manager cache (+300ms cold start)
+     Layer 3: Custom logging (+200ms cold start)
+     Total: +1.9s cold start from extensions alone!
+  Task: Explain Lambda Extensions debugging. Write: how extensions
+  work (lifecycle phases, resource sharing), performance impact
+  (cold start, duration, memory), common extension issues (timeout,
+  VPC, memory pressure), popular extensions (Datadog, New Relic,
+  Secrets Manager), and when the overhead is worth it.
+assertions:
+  - type: llm_judge
+    criteria: "Extension lifecycle is explained — extensions participate in three phases: INIT (initialize alongside runtime, counts toward 10s init timeout), INVOKE (runs concurrently with function handler, shares CPU/memory), SHUTDOWN (cleanup, up to 2 seconds). Internal extensions: run in the same process as the function. External extensions: run as separate processes (Lambda Layers). Extensions share the function's configured memory and timeout. Extension errors don't crash the function but degraded extensions may cause issues"
+    weight: 0.35
+    description: "Extension lifecycle"
+  - type: llm_judge
+    criteria: "Performance impact and debugging are covered — cold start: each extension adds initialization time (100ms to 2+ seconds). Duration: extensions running during INVOKE add latency. Memory: extensions share configured memory — reduce available memory for function code. Debug: check Init Duration increase after adding extension, monitor Duration increase on warm invocations, check Max Memory Used for memory pressure. VPC Lambda: extensions making outbound calls need internet access or VPC endpoints. Consider: is monitoring worth 200ms+ per invocation? For user-facing APIs, maybe not"
+    weight: 0.35
+    description: "Performance impact"
+  - type: llm_judge
+    criteria: "Trade-offs and recommendations are practical — worth it: production functions where observability is critical (revenue-impacting APIs, compliance requirements). Not worth it: high-frequency, low-latency functions where every millisecond matters. Alternatives: CloudWatch native metrics (zero extension overhead), X-Ray (minimal overhead, no extension needed), Powertools (in-process, no extension). If using extensions: increase memory to compensate, increase timeout, test performance before and after. Use CloudWatch Lambda Insights (lighter than third-party alternatives). Minimize number of extensions — each one adds overhead"
+    weight: 0.30
+    description: "Trade-offs"

package/courses/aws-lambda-debugging/scenarios/level-3/powertools-observability.yaml ADDED Viewed

@@ -0,0 +1,79 @@
+meta:
+  id: powertools-observability
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Implement Lambda observability with Powertools — use structured logging, custom metrics, and tracing for production-grade monitoring"
+  tags: [AWS, Lambda, Powertools, observability, structured-logging, metrics, advanced]
+state: {}
+trigger: |
+  Your team has 50+ Lambda functions in production. Debugging is
+  painful: logs are unstructured text, no custom metrics, X-Ray
+  tracing is inconsistent. When an incident occurs, you spend
+  30+ minutes correlating logs across functions.
+  Current logging:
+  console.log("Processing order " + orderId);
+  console.log("Error: " + err.message);
+  Searching CloudWatch Logs:
+  fields @timestamp, @message
+  | filter @message like /Error/
+  | sort @timestamp desc
+  (returns hundreds of unrelated errors, no context)
+  Implementing AWS Lambda Powertools (TypeScript):
+  import { Logger, Metrics, Tracer } from '@aws-lambda-powertools/...'
+  const logger = new Logger({ serviceName: 'order-api' });
+  const metrics = new Metrics({ namespace: 'OrderService' });
+  const tracer = new Tracer({ serviceName: 'order-api' });
+  export const handler = async (event) => {
+    logger.addContext(event);  // Adds requestId, coldStart, etc.
+    logger.info('Processing order', { orderId, customerId });
+    metrics.addMetric('OrderProcessed', MetricUnit.Count, 1);
+    metrics.addMetric('OrderAmount', MetricUnit.None, amount);
+    const segment = tracer.getSegment();
+    const subsegment = segment.addNewSubsegment('processPayment');
+    // ... business logic
+    subsegment.close();
+  };
+  Now logs are structured JSON with correlation IDs:
+  {
+    "level": "INFO",
+    "message": "Processing order",
+    "service": "order-api",
+    "timestamp": "2024-12-01T10:00:00.000Z",
+    "xray_trace_id": "1-abc-def",
+    "cold_start": false,
+    "function_name": "order-api",
+    "orderId": "ORD-123",
+    "customerId": "CUST-456"
+  }
+  Task: Explain Lambda Powertools observability. Write: structured
+  logging (Logger), custom metrics (Metrics with EMF), distributed
+  tracing (Tracer), how to correlate across functions, CloudWatch
+  Logs Insights queries for structured logs, custom dashboards, and
+  alerting on business metrics.
+assertions:
+  - type: llm_judge
+    criteria: "Powertools components are explained — Logger: structured JSON logging with automatic Lambda context (requestId, functionName, coldStart, xrayTraceId). Supports log levels, child loggers, sensitive data masking. Metrics: publishes CloudWatch metrics via EMF (Embedded Metric Format) — no API calls, written as structured log lines. Supports dimensions for filtering. Tracer: wraps X-Ray SDK, automatic capture of AWS SDK calls, support for custom subsegments. Available for Python, TypeScript, Java, .NET"
+    weight: 0.35
+    description: "Powertools components"
+  - type: llm_judge
+    criteria: "Correlation and querying are covered — correlation: trace ID connects logs across functions (X-Ray trace propagated through Lambda invocations). Include business context (orderId, customerId) in all log entries for business-level tracing. CloudWatch Logs Insights: query structured JSON fields directly. Example: fields orderId, @timestamp, level | filter level = 'ERROR' | filter service = 'order-api'. Custom dashboards: visualize custom metrics (order count, error rate, processing time) alongside Lambda system metrics"
+    weight: 0.35
+    description: "Correlation and querying"
+  - type: llm_judge
+    criteria: "Business metrics and alerting are practical — EMF metrics: define custom metrics like OrderProcessed, PaymentFailed, OrderAmount. Add dimensions: customer tier, product category. Alert on business metrics: OrderErrors > 5 in 5 minutes, PaymentFailureRate > 2%. Create CloudWatch dashboards per service showing: invocations, errors, duration, custom business metrics. Combine with X-Ray service map for full observability. Cost: Powertools itself is free. CloudWatch metrics: $0.30/metric/month. Logs: $0.50/GB ingested. Budget: set log retention to limit costs"
+    weight: 0.30
+    description: "Business metrics"

package/courses/aws-lambda-debugging/scenarios/level-3/step-functions-debugging.yaml ADDED Viewed

@@ -0,0 +1,80 @@
+meta:
+  id: step-functions-debugging
+  level: 3
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Step Functions workflows — diagnose state machine failures, Lambda integration errors, retry policies, and error handling in complex orchestrations"
+  tags: [AWS, Lambda, Step-Functions, orchestration, state-machine, workflow, advanced]
+state: {}
+trigger: |
+  Your order processing workflow uses Step Functions to orchestrate
+  5 Lambda functions. The workflow fails intermittently:
+  State machine definition:
+  StartAt: ValidateOrder
+  States:
+    ValidateOrder:
+      Type: Task
+      Resource: arn:aws:lambda:...:validate-order
+      Next: CheckInventory
+    CheckInventory:
+      Type: Task
+      Resource: arn:aws:lambda:...:check-inventory
+      Next: ProcessPayment
+      Catch:
+        - ErrorEquals: [States.TaskFailed]
+          Next: NotifyOutOfStock
+    ProcessPayment:
+      Type: Task
+      Resource: arn:aws:lambda:...:process-payment
+      Retry:
+        - ErrorEquals: [PaymentGatewayError]
+          IntervalSeconds: 5
+          MaxAttempts: 3
+          BackoffRate: 2.0
+      Next: FulfillOrder
+    FulfillOrder:
+      Type: Task
+      Resource: arn:aws:lambda:...:fulfill-order
+      End: true
+    NotifyOutOfStock:
+      Type: Task
+      Resource: arn:aws:lambda:...:notify-customer
+      End: true
+  Issue 1: ProcessPayment retries 3 times then fails with:
+  "States.TaskFailed" — but there's no Catch for this state!
+  The entire workflow fails. Missing error handler after retries
+  are exhausted.
+  Issue 2: CheckInventory Lambda returns:
+  {"error": "OutOfStock", "item": "SKU-123"}
+  But Step Functions expects the Lambda to THROW an error to
+  trigger the Catch. Returning an error object is not an error —
+  it's a successful invocation with data.
+  Issue 3: Execution history shows the workflow took 45 minutes.
+  ProcessPayment waited 5s, then 10s, then 20s between retries
+  (exponential backoff). Meanwhile, the customer waited.
+  Task: Explain Step Functions debugging. Write: how Task states
+  invoke Lambda (integration patterns), error handling (Catch, Retry),
+  the difference between Lambda errors and successful returns, reading
+  execution history, timeout and heartbeat configuration, and common
+  Step Functions anti-patterns.
+assertions:
+  - type: llm_judge
+    criteria: "Lambda integration is explained — Task states can invoke Lambda in two ways: RequestResponse (synchronous, default) or InvokeFunction (similar). Lambda must THROW an error (not return an error object) to trigger Catch/Retry. Common mistake: returning {error: message} is a successful invocation. Use callback pattern for long-running tasks (send taskToken to Lambda, Lambda calls SendTaskSuccess/Failure when done). Execution input/output: use InputPath, OutputPath, ResultPath to control data flow between states"
+    weight: 0.35
+    description: "Lambda integration"
+  - type: llm_judge
+    criteria: "Error handling is covered — Retry: ErrorEquals matches error type, IntervalSeconds for delay, MaxAttempts for retry count, BackoffRate for exponential backoff. Catch: ErrorEquals matches error, Next routes to error handling state, ResultPath stores error details. Always add a catch-all Catch after Retry (handles errors after retries are exhausted). Error types: States.ALL (catch everything), States.TaskFailed (Lambda failure), States.Timeout (task timeout), custom errors thrown by Lambda. Without Catch, unhandled errors fail the entire execution"
+    weight: 0.35
+    description: "Error handling"
+  - type: llm_judge
+    criteria: "Debugging and anti-patterns are practical — read execution history: shows each state transition with input/output and timing. Visual workflow view highlights which state failed (red). Anti-patterns: no timeout on Task states (can run indefinitely — set TimeoutSeconds), no Catch after Retry (errors after retries kill the workflow), returning errors instead of throwing them, overly long retry delays for user-facing workflows. Use Step Functions local for development testing. Monitor: ExecutionsFailed, ExecutionThrottled, ExecutionsTimedOut metrics"
+    weight: 0.30
+    description: "Debugging and anti-patterns"

package/courses/aws-lambda-debugging/scenarios/level-4/cost-optimization-strategy.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: cost-optimization-strategy
+  level: 4
+  course: aws-lambda-debugging
+  type: output
+  description: "Design Lambda cost optimization — implement right-sizing, architecture optimizations, and FinOps practices for serverless applications at scale"
+  tags: [AWS, Lambda, cost, optimization, FinOps, right-sizing, expert]
+state: {}
+trigger: |
+  Your serverless application's AWS bill has grown from $5K to $45K
+  per month. The CFO wants answers. Cost Explorer breakdown:
+  Lambda:           $18,000 (40%)
+  DynamoDB:          $9,000 (20%)
+  API Gateway:       $6,000 (13%)
+  CloudWatch:        $5,000 (11%)
+  S3/Data Transfer:  $4,000 (9%)
+  Other:             $3,000 (7%)
+  Lambda cost analysis:
+  - 50 functions, 120M invocations/month
+  - Average duration: 800ms
+  - Average memory: 512MB (many over-provisioned)
+  - 5 functions with provisioned concurrency: $3,200/month
+  - ARM (Graviton2): not used (could save 20%)
+  Investigation reveals cost drivers:
+  1. Over-provisioned memory: 30 functions at 512MB-1024MB but
+     Max Memory Used is 50-100MB. They were set high "just in case."
+     Right-sizing to actual usage + 20% buffer saves $4,500/month.
+  2. Unnecessary invocations: a "poller" Lambda runs every 1 second
+     checking for SQS messages. SQS event source mapping would
+     eliminate 2.5M unnecessary invocations/month ($800).
+  3. CloudWatch Logs: $5,000/month! Functions log everything at
+     DEBUG level in production. Switching to INFO saves 80% of
+     log volume ($4,000/month).
+  4. Provisioned concurrency: 5 functions with 100 provisioned
+     concurrent executions each, but traffic only needs 100 total.
+     Reduce to 20 per function with auto-scaling.
+  5. Not using ARM: all functions on x86_64. Graviton2 is 20%
+     cheaper and often faster. Simple switch for most functions.
+  Task: Design Lambda cost optimization strategy. Write: memory
+  right-sizing methodology, ARM migration, invocation reduction
+  patterns, CloudWatch cost management, provisioned concurrency
+  optimization, and the FinOps process for ongoing cost governance.
+assertions:
+  - type: llm_judge
+    criteria: "Right-sizing methodology is explained — use Lambda Power Tuning to test functions at different memory levels. Analyze Max Memory Used from CloudWatch Logs Insights (filter @type = 'REPORT' | stats avg(@maxMemoryUsed), p99(@maxMemoryUsed) by @logStream). Set memory to p99 + 20% buffer. Remember: memory also controls CPU — some functions need more memory for CPU, not for memory. Monitor after changes: track duration and cost. ARM (Graviton2): 20% cheaper, often faster. Migration: change architecture in function config, rebuild native dependencies for arm64"
+    weight: 0.35
+    description: "Right-sizing"
+  - type: llm_judge
+    criteria: "Invocation and infrastructure optimization are covered — reduce invocations: use event source mappings instead of polling, batch processing (increase SQS batch size), use EventBridge scheduled rules instead of CloudWatch Events for cron. CloudWatch costs: set log level to INFO/WARN in production, set log retention (7-30 days), use sampling for high-frequency functions, consider structured logging with selective field logging. API Gateway: use HTTP API (cheaper than REST API) where possible. DynamoDB: on-demand vs provisioned (on-demand more expensive per request but no over-provisioning)"
+    weight: 0.35
+    description: "Infrastructure optimization"
+  - type: llm_judge
+    criteria: "FinOps process is practical — tag all Lambda functions (team, environment, service) for cost attribution. Use AWS Cost Explorer with tags for per-service cost breakdown. Set up budgets with alerts ($X threshold per service). Monthly cost review: identify top 10 cost drivers, track cost per transaction/user. Compute Savings Plans: up to 17% additional savings for committed usage. Reserved capacity: use for stable base load, on-demand for burst. Track: cost per invocation, cost per request, cost per user as key metrics. Automate: Lambda Power Tuning on schedule to catch optimization opportunities"
+    weight: 0.30
+    description: "FinOps process"

package/courses/aws-lambda-debugging/scenarios/level-4/expert-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,62 @@
+meta:
+  id: expert-debugging-shift
+  level: 4
+  course: aws-lambda-debugging
+  type: output
+  description: "Combined expert debugging shift — diagnose and design solutions for an enterprise serverless platform with architecture, security, cost, and operational challenges"
+  tags: [AWS, Lambda, troubleshooting, combined, shift-simulation, expert]
+state: {}
+trigger: |
+  You're hired as the serverless platform architect for a fintech
+  company. The existing Lambda-based platform has grown organically
+  over 3 years and has significant technical debt.
+  Assessment findings:
+  Architecture:
+  - 200+ Lambda functions across 3 AWS accounts (dev, staging, prod)
+  - No naming convention — functions named randomly
+  - 40 functions with inline code (console editor, not IaC!)
+  - Step Functions workflows with 20+ states and no error handling
+  - Mixed deployment methods: SAM, Serverless Framework, CDK, console
+  Security:
+  - 30% of functions share a single IAM role with AdministratorAccess
+  - API Gateway endpoints with no authentication
+  - Secrets stored as plain-text environment variables
+  - No dependency scanning — last audit found 85 CRITICAL CVEs
+  - Docker socket pattern found in 5 CI/CD Lambda functions
+  Operations:
+  - No structured logging — console.log everywhere
+  - CloudWatch Logs retention: unlimited (costing $3K/month alone)
+  - No X-Ray tracing enabled
+  - On-call rotation: 2 people who "know how it works"
+  - MTTR: 4+ hours average
+  Cost:
+  - Monthly bill: $65K (Lambda $28K, CloudWatch $8K, API GW $12K,
+    DynamoDB $10K, other $7K)
+  - 50% of functions over-provisioned (1GB memory, using 100MB)
+  - 15 functions with provisioned concurrency never utilizing it
+  Task: Design the 6-month platform improvement plan. Write: the
+  priority ranking (security first!), IaC standardization approach,
+  security remediation, observability implementation, cost optimization
+  targets, and the team structure needed to sustain the platform.
+assertions:
+  - type: llm_judge
+    criteria: "Priority ranking is justified — Month 1: Security remediation (replace AdministratorAccess roles, add API authentication, move secrets to Secrets Manager, scan dependencies). Month 2: IaC migration (move inline functions to SAM/CDK, standardize on one IaC tool). Month 3: Observability (Powertools Logger + X-Ray + custom metrics, set log retention). Month 4: Cost optimization (right-size memory, remove unused provisioned concurrency, ARM migration). Month 5-6: Architecture improvements (error handling in Step Functions, standardize deployment, documentation)"
+    weight: 0.35
+    description: "Priority ranking"
+  - type: llm_judge
+    criteria: "Security and IaC remediation are covered — security: per-function IAM roles with least privilege (automated with IAM Access Analyzer), Cognito or Lambda authorizers on all APIs, Secrets Manager with automatic rotation, dependency scanning in CI (block on CRITICAL). IaC: standardize on CDK or SAM (pick one), migrate inline functions to IaC (start with critical functions), use infrastructure review in PR process. Naming convention: {service}-{function}-{env}. Tag all resources: team, service, environment, cost-center"
+    weight: 0.35
+    description: "Security and IaC"
+  - type: llm_judge
+    criteria: "Cost and team structure are practical — cost optimization: right-size memory (target: $10K Lambda reduction), set log retention to 30 days ($6K CloudWatch savings), remove unused provisioned concurrency ($2K savings), switch to HTTP API where possible ($4K API GW savings). Total target: $22K/month savings (34%). Team: minimum 3-person platform team (1 security-focused, 1 observability, 1 developer experience). On-call: expand rotation to 5+ engineers with runbooks. Success metrics: MTTR < 30 min, security findings < 5, IaC coverage > 95%, cost trend decreasing quarter over quarter"
+    weight: 0.30
+    description: "Cost and team"

package/courses/aws-lambda-debugging/scenarios/level-4/incident-management-serverless.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+meta:
+  id: incident-management-serverless
+  level: 4
+  course: aws-lambda-debugging
+  type: output
+  description: "Design serverless incident management — implement on-call processes, runbooks, and incident response procedures for Lambda-based applications"
+  tags: [AWS, Lambda, incident-management, on-call, runbooks, SRE, expert]
+state: {}
+trigger: |
+  Your team has had 15 production incidents in the past quarter.
+  Average MTTR (Mean Time To Resolve): 2.5 hours. The engineering
+  VP wants MTTR under 30 minutes.
+  Analysis of past incidents:
+  Incident 1 (3.5 hours):
+  - Alert: "Lambda Errors > 100/minute"
+  - 45 minutes spent finding which function
+  - 30 minutes finding the root cause (DynamoDB throttling)
+  - 1 hour trying different fixes
+  - 1 hour implementing and deploying the fix
+  - 15 minutes verifying
+  Incident 2 (4 hours):
+  - No alert! Customer reported "orders not processing"
+  - 1 hour investigating (SQS DLQ had 10K messages)
+  - 1 hour finding root cause (Lambda VPC ENI limit reached)
+  - 1 hour requesting ENI limit increase
+  - 1 hour waiting for AWS support + reprocessing DLQ
+  Incident 3 (1 hour):
+  - Alert: Step Functions ExecutionsFailed
+  - On-call engineer knew the system, diagnosed in 10 minutes
+  - Fix: increase Lambda timeout from 30s to 60s
+  - Deploy and verify in 50 minutes
+  Patterns: incidents are slow when engineers don't know the system,
+  when alerts don't provide enough context, and when there's no
+  runbook.
+  Task: Design the serverless incident management process. Write:
+  alert design (actionable alerts with context), runbook structure
+  for serverless (per-function and per-workflow), on-call rotation
+  and escalation, automated remediation (Lambda-based auto-fixing),
+  post-incident review process, and SLO/SLI framework for Lambda.
+assertions:
+  - type: llm_judge
+    criteria: "Alert design is actionable — alerts must include: what is broken (function/workflow name), impact (customer-facing? internal?), severity (P1-P3), link to dashboard, link to runbook. Serverless-specific alerts: Lambda Errors by function (not account-wide), Step Functions ExecutionsFailed, SQS ApproximateAgeOfOldestMessage (processing lag), DLQ message count > 0 (something is failing silently), EventBridge FailedInvocations. Reduce noise: use anomaly detection instead of static thresholds, composite alarms for related metrics"
+    weight: 0.35
+    description: "Alert design"
+  - type: llm_judge
+    criteria: "Runbooks and automation are covered — runbook per function: what it does, dependencies, common failure modes, diagnostic commands, fix procedures. Automated runbook steps: link to CloudWatch Logs Insights query (pre-built), link to X-Ray trace search, link to function configuration. Automated remediation: Lambda function triggered by CloudWatch alarm that can: increase concurrency limits, restart event source mappings, reprocess DLQ messages, scale DynamoDB capacity. Guard rails: automated remediation must be safe (idempotent, bounded, logged)"
+    weight: 0.35
+    description: "Runbooks and automation"
+  - type: llm_judge
+    criteria: "SLOs and process are practical — define SLOs: order completion rate > 99.5%, API p99 latency < 2 seconds, payment success rate > 99.9%. SLIs: measure from CloudWatch custom metrics. Error budget: 0.5% error budget for order completion — when budget is consumed, focus on reliability instead of features. On-call rotation: weekly rotation, 2 tiers (primary responds in 5 min, secondary backup). Post-incident: blameless review within 48 hours, identify: timeline, root cause, detection gap, prevention measures. Track: MTTD (detect), MTTR (resolve), incident count, error budget consumption"
+    weight: 0.30
+    description: "SLOs and process"

package/courses/aws-lambda-debugging/scenarios/level-4/multi-region-serverless.yaml ADDED Viewed

@@ -0,0 +1,67 @@
+meta:
+  id: multi-region-serverless
+  level: 4
+  course: aws-lambda-debugging
+  type: output
+  description: "Design multi-region serverless architecture — implement active-active deployments, data replication, and failover for globally distributed Lambda applications"
+  tags: [AWS, Lambda, multi-region, global, failover, disaster-recovery, expert]
+state: {}
+trigger: |
+  Your serverless application serves users globally. A US-East-1
+  outage took the entire service down for 4 hours. The board demands
+  multi-region resilience. Current architecture (single region):
+  Route53 → API Gateway (us-east-1) → Lambda → DynamoDB
+  Target: active-active across us-east-1 and eu-west-1 with
+  automatic failover. Users should be routed to the nearest region
+  with < 200ms latency.
+  Design challenges:
+  1. Data replication — DynamoDB Global Tables:
+     Automatically replicates data across regions with ~1 second
+     latency. Last-writer-wins conflict resolution. But: Global
+     Tables cost 2x (write replicated to all regions).
+  2. API deployment — same code, multiple regions:
+     Deploy same Lambda functions and API Gateway to both regions.
+     Use SAM StackSets or CDK Pipelines for multi-region deployment.
+     Route53 health checks route traffic to healthy region.
+  3. Event processing — how to handle events in both regions:
+     SQS queues are regional. S3 events trigger in the bucket's
+     region. EventBridge can be global (cross-region event bus).
+     Challenge: prevent duplicate processing when both regions
+     process the same event.
+  4. State management — avoid split-brain:
+     If both regions write to the same DynamoDB item simultaneously,
+     last-writer-wins may lose data. Design for conflict resolution
+     or use conditional writes.
+  5. Deployment coordination — how to deploy across regions safely:
+     Deploy to secondary first, test, then deploy to primary.
+     Canary deployment per region before full rollout.
+  Task: Design multi-region serverless architecture. Write: the
+  active-active vs active-passive trade-off, data replication
+  strategies (DynamoDB Global Tables, S3 Cross-Region Replication),
+  routing and failover (Route53, CloudFront), event processing in
+  multi-region, deployment strategy, and cost analysis.
+assertions:
+  - type: llm_judge
+    criteria: "Active-active architecture is explained — active-active: both regions serve traffic simultaneously, Route53 latency-based routing sends users to nearest region. RPO: ~0 (replicated data), RTO: ~0 (automatic failover). Active-passive: secondary region on standby, Route53 failover routing switches on health check failure. RPO: replication lag, RTO: DNS propagation (60-300 seconds). Cost: active-active costs ~2x for compute and data replication. Decision: active-active for global users or strict availability requirements, active-passive for regional with DR needs"
+    weight: 0.35
+    description: "Active-active design"
+  - type: llm_judge
+    criteria: "Data and event replication are covered — DynamoDB Global Tables: automatic multi-master replication, ~1s lag, last-writer-wins. Design for eventual consistency. S3: Cross-Region Replication for object storage. SQS: regional only — deploy separate queues per region, route events to appropriate region. EventBridge: global event bus for cross-region events. Idempotency: critical in multi-region — same event may be processed in both regions. Use idempotency keys and conditional DynamoDB writes. Conflict resolution: design data model to avoid conflicts (partition by region, use timestamps)"
+    weight: 0.35
+    description: "Data and events"
+  - type: llm_judge
+    criteria: "Deployment and operations are practical — multi-region deployment: CDK Pipelines or SAM StackSets deploy to all regions. Deploy secondary first, run integration tests, then deploy primary (blue-green across regions). Route53 health checks: check API Gateway endpoint in each region, failover if unhealthy. CloudFront: distribute static assets globally, origin failover group for API. Monitoring: centralized dashboard showing both regions (CloudWatch cross-account/cross-region). Cost: estimate additional cost of multi-region (2x DynamoDB, 2x Lambda, Route53 health checks). Justify: compare cost of multi-region to cost of downtime"
+    weight: 0.30
+    description: "Deployment and operations"