npm - dojo.md - Versions diffs - 0.2.0 → 0.2.1 - Mend

dojo.md 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

package/courses/aws-lambda-debugging/scenarios/level-1/invocation-errors.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: invocation-errors
+  level: 1
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda invocation errors — diagnose throttling, concurrency limits, and invocation type differences (sync vs async)"
+  tags: [AWS, Lambda, invocation, throttling, concurrency, sync-async, beginner]
+state: {}
+trigger: |
+  Your Lambda function starts returning errors during a traffic spike:
+  $ aws lambda invoke --function-name process-order output.json
+  {
+    "StatusCode": 429,
+    "FunctionError": "Unhandled"
+  }
+  $ cat output.json
+  {
+    "errorType": "TooManyRequestsException",
+    "errorMessage": "Rate exceeded"
+  }
+  CloudWatch metrics show:
+  - Throttles: 1,500 (these are rejected requests!)
+  - ConcurrentExecutions: 1,000 (account limit)
+  - Errors: 200 (actual function errors)
+  - Invocations: 8,300
+  The account has a default concurrency limit of 1,000 across ALL
+  Lambda functions. Your function is consuming 800 concurrent
+  executions, starving other functions.
+  Important: Throttles are NOT counted in Invocations or Errors
+  metrics! You must monitor the Throttles metric separately.
+  Understanding invocation types:
+  Synchronous (RequestResponse):
+  - API Gateway, CLI invoke, SDK invoke
+  - Caller waits for response
+  - Throttled: returns 429 immediately
+  - Errors: returned to caller
+  Asynchronous (Event):
+  - S3 events, SNS, EventBridge
+  - Caller gets 202 Accepted immediately
+  - Lambda retries twice on failure
+  - Throttled: Lambda retries with backoff for up to 6 hours
+  - Failed after retries: sent to DLQ or destination
+  Task: Explain Lambda invocation debugging. Write: synchronous vs
+  asynchronous invocation differences, throttling (what it is, how
+  to detect, how to fix), concurrency limits (account vs function
+  reserved), the retry behavior for each invocation type, and how
+  to monitor invocation health.
+assertions:
+  - type: llm_judge
+    criteria: "Invocation types are explained — synchronous: caller waits, gets response or error. Used by API Gateway, CLI, SDK. Asynchronous: caller gets 202 immediately, Lambda processes in background. Used by S3, SNS, EventBridge. Lambda retries async failures twice (3 total attempts). Poll-based: Lambda polls event source (SQS, Kinesis, DynamoDB Streams) — behaves differently for each. Understanding invocation type is critical because error handling, retry behavior, and throttling behavior differ for each"
+    weight: 0.35
+    description: "Invocation types"
+  - type: llm_judge
+    criteria: "Throttling and concurrency are covered — default account limit: 1,000 concurrent executions (can be increased via support). Throttling = requests rejected when concurrency limit reached. Throttles metric: must monitor separately (not counted in Invocations or Errors!). Reserved concurrency: guarantee capacity for a function (but limits it too). Provisioned concurrency: pre-warm environments for consistent latency. Fix throttling: request limit increase, use reserved concurrency to protect critical functions, implement backoff in callers, use SQS to buffer requests"
+    weight: 0.35
+    description: "Throttling and concurrency"
+  - type: llm_judge
+    criteria: "Monitoring and retry behavior are practical — monitor: Invocations, Errors, Throttles, Duration, ConcurrentExecutions. Set alarms on: Throttles > 0, Error rate > 5%, Duration approaching timeout. Retry behavior: sync — no automatic retry (caller must retry), async — 2 retries with exponential backoff, then DLQ/destination. SQS: retries based on visibility timeout, then DLQ. Kinesis/DynamoDB Streams: retries until record expires or succeeds (can block shard). Configure maxRetryAttempts and DLQ for each invocation pattern"
+    weight: 0.30
+    description: "Monitoring and retry"

package/courses/aws-lambda-debugging/scenarios/level-1/lambda-timeout-errors.yaml ADDED Viewed

@@ -0,0 +1,65 @@
+meta:
+  id: lambda-timeout-errors
+  level: 1
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda timeout errors — diagnose why functions exceed their configured timeout and how to fix common timeout causes"
+  tags: [AWS, Lambda, timeout, duration, configuration, beginner]
+state: {}
+trigger: |
+  Your Lambda function fails with a timeout error:
+  $ aws lambda invoke --function-name process-order output.json
+  {
+    "StatusCode": 200,
+    "FunctionError": "Unhandled",
+    "ExecutedVersion": "$LATEST"
+  }
+  $ cat output.json
+  {
+    "errorMessage": "2024-12-01T10:00:00.000Z abc-123 Task timed out after 3.00 seconds"
+  }
+  CloudWatch Logs:
+  START RequestId: abc-123 Version: $LATEST
+  2024-12-01T10:00:00.000Z abc-123 Connecting to database...
+  2024-12-01T10:00:03.000Z abc-123 Task timed out after 3.00 seconds
+  END RequestId: abc-123
+  REPORT RequestId: abc-123 Duration: 3001.45 ms
+  Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 65 MB
+  The function has a 3-second timeout (default) but the database
+  connection takes 4+ seconds on cold start. The function never
+  reaches the actual business logic.
+  Fix options:
+  1. Increase timeout: aws lambda update-function-configuration \
+       --function-name process-order --timeout 30
+  2. Optimize: connection pooling, move DB connection outside handler
+  3. Use RDS Proxy for faster connection establishment
+  But setting timeout too high is also a problem — if the function
+  hangs, you pay for the full duration (max 15 minutes).
+  Task: Explain Lambda timeout debugging. Write: what the timeout
+  error means, how to read the REPORT line (Duration, Billed Duration,
+  Memory), common timeout causes (DB connections, API calls, large
+  payloads), how to set appropriate timeouts, and the relationship
+  between Lambda timeout and API Gateway timeout (29 seconds max).
+assertions:
+  - type: llm_judge
+    criteria: "Timeout mechanics are explained — Lambda timeout is configurable (default 3 seconds, max 15 minutes). When exceeded, Lambda kills the execution and returns a Task timed out error. The REPORT line shows: Duration (actual execution time), Billed Duration (rounded up to nearest ms or 1ms minimum), Memory Size (configured), Max Memory Used (peak). Billed for full timeout duration if it times out. CloudWatch Logs show what happened before timeout"
+    weight: 0.35
+    description: "Timeout mechanics"
+  - type: llm_judge
+    criteria: "Common causes and fixes are covered — database connections on cold start (use connection pooling, RDS Proxy, initialize outside handler for reuse), external API calls without timeout (always set HTTP timeout shorter than Lambda timeout), large S3 file processing (stream instead of loading entirely into memory), cold start initialization (heavy imports, large packages). Set Lambda timeout slightly higher than expected max duration. API Gateway has a hard 29-second limit — Lambda behind API GW must complete within 29s"
+    weight: 0.35
+    description: "Causes and fixes"
+  - type: llm_judge
+    criteria: "Best practices are practical — set timeout based on p99 duration + buffer (not arbitrary large values). Monitor with CloudWatch Duration metric. Alert when duration approaches timeout. Use X-Ray to identify slow segments. Move initialization code outside handler (runs once per cold start, reused on warm invocations). Set HTTP client timeouts shorter than Lambda timeout to get a proper error instead of Lambda timeout. For long-running tasks: use Step Functions, SQS, or async invocation instead of increasing Lambda timeout"
+    weight: 0.30
+    description: "Best practices"

package/courses/aws-lambda-debugging/scenarios/level-1/memory-and-oom.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: memory-and-oom
+  level: 1
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda memory issues — diagnose out-of-memory errors, understand memory-CPU relationship, and right-size Lambda functions"
+  tags: [AWS, Lambda, memory, OOM, right-sizing, configuration, beginner]
+state: {}
+trigger: |
+  Your Lambda function fails intermittently with large payloads:
+  {
+    "errorType": "Runtime.ExitError",
+    "errorMessage": "RequestId: abc-123 Error: Runtime exited with
+    error: signal: killed"
+  }
+  CloudWatch REPORT line:
+  REPORT RequestId: abc-123 Duration: 4521.00 ms
+  Billed Duration: 4521 ms Memory Size: 128 MB
+  Max Memory Used: 128 MB
+  Max Memory Used equals Memory Size — the function ran out of memory
+  and was killed by the OOM killer. "signal: killed" = SIGKILL from
+  the system.
+  The function processes CSV files from S3. Small files work (< 5MB),
+  large files fail because loading the entire file into memory exceeds
+  128MB.
+  Fix options:
+  1. Increase memory:
+     aws lambda update-function-configuration \
+       --function-name process-csv --memory-size 512
+  2. Stream processing — read the file in chunks instead of loading
+     it entirely into memory.
+  Important: Lambda memory also controls CPU allocation!
+  - 128 MB  = ~0.08 vCPU (very slow)
+  - 512 MB  = ~0.3 vCPU
+  - 1769 MB = 1 vCPU
+  - 3538 MB = 2 vCPU
+  - 10240 MB = 6 vCPU (maximum)
+  A function at 128MB runs on a fraction of a CPU. Increasing memory
+  to 256MB might actually DECREASE total cost because the function
+  runs 3x faster (more CPU) even though per-ms cost doubles.
+  Task: Explain Lambda memory debugging. Write: how to identify OOM
+  errors (signal: killed, Max Memory Used = Memory Size), the
+  memory-CPU relationship, how to right-size memory (AWS Lambda
+  Power Tuning), common memory-hungry operations (loading files,
+  large JSON parsing), and cost optimization through memory tuning.
+assertions:
+  - type: llm_judge
+    criteria: "OOM identification is explained — signal: killed or Runtime.ExitError with killed means OOM. Check REPORT line: if Max Memory Used equals or is very close to Memory Size, function ran out of memory. Memory includes: your code, runtime, libraries, and all variables/data in memory. /tmp directory has separate 512MB-10GB allocation (configurable). Lambda memory range: 128MB to 10,240MB in 1MB increments"
+    weight: 0.35
+    description: "OOM identification"
+  - type: llm_judge
+    criteria: "Memory-CPU relationship is covered — Lambda allocates CPU proportional to memory. 1,769MB = 1 full vCPU. At 128MB you get ~7% of a vCPU — CPU-bound tasks are extremely slow. This means: increasing memory for a CPU-bound function makes it faster AND can make it cheaper (shorter duration offsets higher per-ms cost). Use AWS Lambda Power Tuning tool to find optimal memory: tests your function at different memory sizes and plots cost vs duration. Memory is the ONLY performance lever for Lambda — no separate CPU setting"
+    weight: 0.35
+    description: "Memory-CPU relationship"
+  - type: llm_judge
+    criteria: "Right-sizing is practical — use Lambda Power Tuning (open-source Step Functions tool): runs your function at multiple memory configs, shows cost and duration at each. Look for the 'sweet spot' where increasing memory stops improving duration. Common memory-hungry operations: loading entire files into memory (stream instead), parsing large JSON (use streaming JSON parser), image processing (needs 512MB+). Monitor Max Memory Used metric over time. Set memory to ~1.5x the typical Max Memory Used for headroom. /tmp for large file processing (up to 10GB ephemeral storage)"
+    weight: 0.30
+    description: "Right-sizing"

package/courses/aws-lambda-debugging/scenarios/level-2/async-invocation-failures.yaml ADDED Viewed

@@ -0,0 +1,72 @@
+meta:
+  id: async-invocation-failures
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda async invocation failures — diagnose retry behavior, dead-letter queue issues, and Lambda Destination configuration"
+  tags: [AWS, Lambda, async, DLQ, destinations, retry, intermediate]
+state: {}
+trigger: |
+  Your S3-triggered Lambda function processes uploaded images. Some
+  images fail processing but you have no idea which ones or why:
+  $ aws cloudwatch get-metric-statistics --namespace AWS/Lambda \
+    --metric-name Errors --dimensions Name=FunctionName,Value=process-image
+  Errors: 45 in the last hour
+  45 errors but no visibility into which uploads failed. The function
+  is invoked asynchronously (S3 event) — errors don't go back to a
+  caller. Where do they go?
+  Async invocation error flow:
+  1. First attempt: function fails
+  2. Lambda waits 1 minute, retries (attempt 2)
+  3. Lambda waits 2 minutes, retries (attempt 3)
+  4. All 3 attempts failed → event is discarded (lost!)
+  Without a DLQ or destination, failed events simply disappear.
+  Fix 1 — Dead Letter Queue:
+  $ aws lambda update-function-configuration \
+    --function-name process-image \
+    --dead-letter-config TargetArn=arn:aws:sqs:...:image-dlq
+  Now failed events go to SQS after all retries exhausted. But the
+  DLQ only contains the original event, not the error details.
+  Fix 2 — Lambda Destinations (better):
+  $ aws lambda put-function-event-invoke-config \
+    --function-name process-image \
+    --destination-config '{
+      "OnSuccess": {"Destination": "arn:aws:sqs:...:success-queue"},
+      "OnFailure": {"Destination": "arn:aws:sqs:...:failure-queue"}
+    }'
+  Destinations include: the original event, the function response or
+  error, request/response metadata. Much more useful than DLQ.
+  Additional issue: MaximumRetryAttempts set to 0 means no retries.
+  But this also means transient errors (temporary network glitch)
+  are not retried. Tradeoff: retries help with transient failures
+  but tripling processing for permanent errors wastes compute.
+  Task: Explain async invocation debugging. Write: the retry behavior
+  (timing, attempt count), DLQ vs Lambda Destinations (comparison),
+  configuring retry limits and maximum event age, monitoring async
+  failures, and designing robust async processing pipelines.
+assertions:
+  - type: llm_judge
+    criteria: "Async retry behavior is explained — async invocation: Lambda queues event internally, returns 202 immediately. On failure: retries 2 more times (3 total). Delay: ~1 minute between first and second, ~2 minutes between second and third. After all retries: event sent to DLQ/destination or discarded. MaximumRetryAttempts: configure 0-2 retries. MaximumEventAgeInSeconds: discard events older than X seconds (60-21600). Important: retries mean the function must be idempotent — same event may be processed multiple times"
+    weight: 0.35
+    description: "Retry behavior"
+  - type: llm_judge
+    criteria: "DLQ vs Destinations are compared — DLQ: only captures original event on failure, configured on function, supports SQS and SNS only. Lambda Destinations: captures original event + function response/error + metadata, supports SQS/SNS/Lambda/EventBridge, can handle both success and failure. Destinations are preferred (more information, more flexible). DLQ limitation: doesn't tell you WHY it failed. Destination OnFailure: includes error message, stack trace, request context. Use destinations for new functions, DLQ still works for existing setups"
+    weight: 0.35
+    description: "DLQ vs Destinations"
+  - type: llm_judge
+    criteria: "Monitoring and design are practical — monitor: AsyncEventsDropped metric (events lost without DLQ/destination), DeadLetterErrors (failed to send to DLQ). Alert on both! Design pattern: S3 event → Lambda → on failure: SQS DLQ → monitoring Lambda that alerts and logs details. Make functions idempotent: use a deduplication table to prevent reprocessing. For critical events: record event in DynamoDB before processing, mark as processed when done, sweep unprocessed events periodically as backup. Test failure scenarios: deliberately throw errors to verify DLQ/destination works"
+    weight: 0.30
+    description: "Monitoring and design"

package/courses/aws-lambda-debugging/scenarios/level-2/cold-start-optimization.yaml ADDED Viewed

@@ -0,0 +1,76 @@
+meta:
+  id: cold-start-optimization
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Optimize Lambda cold starts — implement provisioned concurrency, SnapStart, and package optimization to reduce initialization latency"
+  tags: [AWS, Lambda, cold-start, provisioned-concurrency, SnapStart, optimization, intermediate]
+state: {}
+trigger: |
+  Your Java Lambda function has cold starts of 8-12 seconds. Users
+  hitting the API after an idle period wait 10+ seconds for the first
+  response. This is unacceptable for a user-facing API.
+  CloudWatch REPORT for cold start:
+  REPORT RequestId: abc-123 Duration: 450.00 ms
+  Billed Duration: 8950 ms Memory Size: 1024 MB
+  Max Memory Used: 256 MB Init Duration: 8500.00 ms
+  Init Duration: 8.5 seconds! The handler only takes 450ms.
+  Breakdown of cold start time:
+  - Download code package: 500ms (50MB jar)
+  - Start JVM: 3,000ms
+  - Class loading & dependency injection: 4,500ms
+  - First database connection: 500ms
+  Optimization approaches:
+  1. SnapStart (Java 11+, .NET 8):
+     Lambda snapshots the initialized function after the first init.
+     Subsequent cold starts restore from snapshot instead of
+     re-initializing. Reduces cold start from 8.5s to ~200ms.
+     $ aws lambda update-function-configuration \
+       --function-name my-func \
+       --snap-start ApplyOn=PublishedVersions
+     Caveat: must use published versions (not $LATEST).
+  2. Provisioned Concurrency:
+     Pre-initialize N execution environments. Zero cold starts
+     for up to N concurrent requests.
+     $ aws lambda put-provisioned-concurrency-config \
+       --function-name my-func --qualifier prod \
+       --provisioned-concurrent-executions 10
+     Cost: ~$15/month per provisioned environment (always running).
+  3. Package optimization:
+     Reduce jar from 50MB to 15MB (remove unused dependencies).
+     Use GraalVM native image for sub-second JVM start.
+     Use lighter frameworks (Micronaut/Quarkus instead of Spring).
+  4. Warm-up pings:
+     CloudWatch Events rule invokes function every 5 minutes.
+     Keeps one environment warm but doesn't prevent cold starts
+     during scaling.
+  Task: Explain cold start optimization. Write: SnapStart (how it
+  works, requirements, limitations), provisioned concurrency (cost,
+  auto-scaling, when to use), package optimization techniques,
+  runtime-specific advice (Java vs Node.js vs Python), and how to
+  measure cold start impact on users.
+assertions:
+  - type: llm_judge
+    criteria: "SnapStart and provisioned concurrency are explained — SnapStart: snapshots initialized function at publish time, restores on cold start (~200ms instead of seconds). Works with Java 11+ (Corretto), .NET 8. Must publish versions. Cannot combine with provisioned concurrency. Limitations: uniqueness (random/UUID generated during init may be duplicated). Provisioned concurrency: pre-initializes environments, eliminates cold starts entirely for configured capacity. Cost: hourly charge per provisioned unit. Auto-scaling provisioned concurrency via Application Auto Scaling for variable traffic"
+    weight: 0.35
+    description: "SnapStart and provisioned"
+  - type: llm_judge
+    criteria: "Package and runtime optimization are covered — reduce package size: remove unused dependencies, use tree shaking/bundlers. Java: use Micronaut/Quarkus (designed for Lambda), GraalVM native image (sub-100ms starts), avoid Spring Boot (heavy reflection/classpath scanning). Node.js: bundle with esbuild, lazy-load heavy modules. Python: avoid large packages (pandas, numpy) unless needed, use Lambda-optimized versions. General: initialize SDK clients and DB connections outside handler (reused on warm starts). Measure Init Duration in REPORT line"
+    weight: 0.35
+    description: "Package and runtime"
+  - type: llm_judge
+    criteria: "Measurement and decision-making are practical — measure cold start impact: percentage of invocations that are cold starts (check Init Duration presence in CloudWatch Logs Insights). Query: filter @type = 'REPORT' | stats count() as total, sum(@initDuration > 0) as coldStarts. Calculate: if 5% of requests are cold starts with 5s delay, average latency impact is 250ms. Decision: if cold starts affect p99 latency for user-facing APIs → use SnapStart or provisioned concurrency. If async processing (SQS, events) → cold starts usually don't matter. Cost-benefit: provisioned concurrency costs $15/month per unit — compare to user impact of cold starts"
+    weight: 0.30
+    description: "Measurement"

package/courses/aws-lambda-debugging/scenarios/level-2/dynamodb-streams-debugging.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: dynamodb-streams-debugging
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda DynamoDB Streams integration — diagnose stream processing failures, iterator exhaustion, and event ordering issues"
+  tags: [AWS, Lambda, DynamoDB, Streams, event-source, iterator, intermediate]
+state: {}
+trigger: |
+  Your Lambda function processes DynamoDB Streams events (change data
+  capture) to sync order data to Elasticsearch. It stopped processing:
+  $ aws lambda list-event-source-mappings \
+    --function-name sync-to-es --query "EventSourceMappings[].State"
+  ["Enabled"]
+  The mapping is enabled but CloudWatch shows no invocations. Checking
+  the event source mapping details:
+  $ aws lambda get-event-source-mapping --uuid abc-123
+  {
+    "State": "Enabled",
+    "StateTransitionReason": "USER_INITIATED",
+    "LastProcessingResult": "PROBLEM: Function call failed"
+  }
+  LastProcessingResult: "Function call failed" — the function threw
+  an unhandled error. For DynamoDB Streams (and Kinesis), Lambda
+  retries the SAME batch until it succeeds or the record expires
+  (24 hours by default). One bad record blocks the entire shard!
+  CloudWatch Logs (from the failing invocation):
+  Processing INSERT event for order ORD-999
+  Error: Cannot read property 'S' of undefined
+  — A DynamoDB record had an unexpected schema (missing field)
+  The function was stuck retrying the same bad record for 6 hours,
+  blocking all subsequent stream records.
+  Fix — configure error handling:
+  $ aws lambda update-event-source-mapping --uuid abc-123 \
+    --maximum-retry-attempts 3 \
+    --bisect-batch-on-function-error \
+    --destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:..."}}'
+  - maximum-retry-attempts: stop retrying after 3 attempts
+  - bisect-batch-on-function-error: split batch to isolate bad record
+  - destination on failure: send failed records to SQS for investigation
+  Task: Explain DynamoDB Streams + Lambda debugging. Write: how
+  stream processing works (shards, iterators, ordering), the blocking
+  behavior on errors (entire shard stops), error handling configuration
+  (retry limits, bisect, on-failure destination), common stream
+  processing patterns, and monitoring stream consumer lag.
+assertions:
+  - type: llm_judge
+    criteria: "Stream processing mechanics are explained — DynamoDB Streams capture INSERT, MODIFY, DELETE events in order per partition key. Lambda polls stream shards via event source mapping. Records include OldImage and/or NewImage (depending on StreamViewType). Processing is ordered within a shard — one batch at a time. If Lambda fails, it retries the SAME batch (blocks shard). Different from SQS: stream records must be processed in order, so failed records block subsequent records"
+    weight: 0.35
+    description: "Stream mechanics"
+  - type: llm_judge
+    criteria: "Error handling and blocking are covered — default behavior: retry forever until record expires (24h) — blocks shard processing. Configure: MaximumRetryAttempts (limit retries), BisectBatchOnFunctionError (split batch to isolate bad record), MaximumRecordAgeInSeconds (skip old records), DestinationConfig OnFailure (send failed records to SQS/SNS for investigation). Without these settings, one bad record stops ALL processing on that shard. Always configure error handling for stream consumers"
+    weight: 0.35
+    description: "Error handling"
+  - type: llm_judge
+    criteria: "Monitoring and patterns are practical — monitor: IteratorAge metric (time between record written and processed — high value means lag). Alert when IteratorAge exceeds acceptable threshold (e.g., 5 minutes). Check: LastProcessingResult in event source mapping for error state. Common patterns: use DynamoDB Streams for CDC (change data capture), search index sync, cross-region replication, audit logging. Design Lambda for idempotency (records may be delivered more than once). Handle schema evolution gracefully (new/missing fields shouldn't crash the function)"
+    weight: 0.30
+    description: "Monitoring and patterns"

package/courses/aws-lambda-debugging/scenarios/level-2/intermediate-debugging-shift.yaml ADDED Viewed

@@ -0,0 +1,71 @@
+meta:
+  id: intermediate-debugging-shift
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Combined intermediate debugging shift — diagnose a serverless application with event source, VPC, layer, and concurrency issues simultaneously"
+  tags: [AWS, Lambda, troubleshooting, combined, shift-simulation, intermediate]
+state: {}
+trigger: |
+  You're troubleshooting a serverless order processing system with
+  5 Lambda functions. The system "partially works":
+  Architecture:
+  API Gateway → order-api (Lambda)
+  order-api → SQS queue → process-order (Lambda)
+  process-order → DynamoDB + payment-service (Lambda)
+  payment-service → Stripe API (external)
+  DynamoDB Stream → sync-to-search (Lambda) → OpenSearch
+  Current issues:
+  1. order-api: intermittent 504 errors from API Gateway.
+     REPORT shows Init Duration: 4200ms for cold starts.
+     The function is in a VPC (for RDS access) and uses a heavy
+     Python layer with pandas (unnecessary for this API).
+     Fix: remove pandas layer, add VPC endpoint for DynamoDB,
+     consider provisioned concurrency.
+  2. process-order: messages accumulate in SQS, not being processed.
+     Event source mapping shows LastProcessingResult: "OK" but
+     BatchSize is 1 and function processes one message per second.
+     With 500 messages/minute incoming, it can't keep up.
+     Fix: increase batch size to 10, enable batch window of 5s.
+  3. payment-service: throws errors but retries succeed.
+     The function calls Stripe API with no timeout configured.
+     Occasionally Stripe takes 15+ seconds, exceeding the Lambda's
+     10-second timeout. On retry (async), it succeeds → duplicate
+     payment! No idempotency key used.
+     Fix: set HTTP timeout < Lambda timeout, use idempotency key.
+  4. sync-to-search: completely stuck, IteratorAge growing.
+     One malformed DynamoDB record causes the function to crash,
+     blocking the entire shard. No error handling configured.
+     Fix: configure MaximumRetryAttempts, BisectBatchOnFunctionError,
+     OnFailure destination.
+  5. Account-wide: ConcurrentExecutions at 850/1000. No reserved
+     concurrency configured. A batch job running at the same time
+     could push the account to the limit.
+     Fix: set reserved concurrency for critical functions.
+  Task: Walk through diagnosing all issues. Write: the triage
+  approach for serverless architectures, the fix for each issue,
+  idempotency patterns, and operational best practices.
+assertions:
+  - type: llm_judge
+    criteria: "All five issues are diagnosed and fixed — (1) cold start: remove unnecessary layer, add VPC endpoints, use provisioned concurrency for API-facing function. (2) SQS throughput: increase batch size and batch window for higher processing rate. (3) duplicate payments: set HTTP client timeout < Lambda timeout, use idempotency keys in Stripe API calls. (4) stream blocking: configure retry limits, bisect on error, failure destination for DynamoDB Streams event source. (5) concurrency: set reserved concurrency for critical functions to prevent resource starvation"
+    weight: 0.35
+    description: "All issues fixed"
+  - type: llm_judge
+    criteria: "Idempotency and error handling are covered — idempotency: use unique transaction IDs for payment operations (Stripe supports Idempotency-Key header). Store processing status in DynamoDB before and after processing. Design all Lambda functions to be safely re-invokable. Error handling: different for each invocation pattern — sync (return error to caller), async (retry + DLQ), stream (retry + bisect + destination). Always configure error handling for event sources — defaults are not production-ready"
+    weight: 0.35
+    description: "Idempotency and errors"
+  - type: llm_judge
+    criteria: "Serverless triage approach is systematic — check CloudWatch metrics first: Errors, Throttles, Duration, ConcurrentExecutions. For event sources: check event source mapping state and LastProcessingResult. For streams: check IteratorAge (growing = falling behind). For SQS: check ApproximateAgeOfOldestMessage (lag indicator). Work through the data flow: API → queue → processing → database → stream → search. Fix upstream issues first (they cascade). Test each fix independently. Monitor for 30+ minutes after fixes to confirm stability"
+    weight: 0.30
+    description: "Triage approach"

package/courses/aws-lambda-debugging/scenarios/level-2/lambda-concurrency-management.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+meta:
+  id: lambda-concurrency-management
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda concurrency issues — diagnose throttling, configure reserved and provisioned concurrency, and protect downstream services"
+  tags: [AWS, Lambda, concurrency, throttling, reserved, scaling, intermediate]
+state: {}
+trigger: |
+  A marketing campaign sends a blast email with a link to your API.
+  Traffic spikes 10x. Multiple functions start throttling:
+  CloudWatch Metrics (account-wide):
+  ConcurrentExecutions: 1,000 (at limit!)
+  Throttles (order-api): 5,200
+  Throttles (user-api): 1,800
+  Throttles (analytics): 450
+  All functions compete for the same 1,000 account concurrency pool.
+  The analytics function (low priority) consumed 600 concurrent
+  executions processing a backlog, leaving only 400 for critical
+  order and user APIs.
+  Fix 1 — Reserved concurrency (free):
+  $ aws lambda put-function-concurrency \
+    --function-name order-api \
+    --reserved-concurrent-executions 300
+  $ aws lambda put-function-concurrency \
+    --function-name user-api \
+    --reserved-concurrent-executions 200
+  $ aws lambda put-function-concurrency \
+    --function-name analytics \
+    --reserved-concurrent-executions 50
+  Now order-api is guaranteed 300 concurrent executions. But:
+  reserved concurrency is also a MAXIMUM — order-api can never
+  exceed 300 even if capacity is available.
+  Unreserved pool: 1000 - 300 - 200 - 50 = 450 for all other functions.
+  Fix 2 — Downstream protection:
+  Your RDS database has a max_connections of 100. If 300 Lambda
+  instances each open a connection, the database is overwhelmed:
+  "too many connections" error.
+  Solution: RDS Proxy pools database connections. Lambda instances
+  connect to RDS Proxy, which maintains a connection pool to the
+  database.
+  Task: Explain Lambda concurrency management. Write: account vs
+  function concurrency limits, reserved concurrency (guarantees AND
+  limits), provisioned concurrency (pre-warmed), protecting downstream
+  services (RDS Proxy, SQS buffering), scaling behavior, and how to
+  request limit increases.
+assertions:
+  - type: llm_judge
+    criteria: "Concurrency model is explained — account default: 1,000 concurrent executions (soft limit, can increase). All functions share the pool unless reserved concurrency is set. Reserved concurrency: guarantees AND limits a function's concurrent executions (free, no additional cost). Unreserved pool: total limit minus all reserved concurrency — must leave at least 100 unreserved. Provisioned concurrency: pre-initialized environments for instant response (charged per provisioned unit per hour). Scaling: Lambda scales at 500-3000 instances per minute depending on region"
+    weight: 0.35
+    description: "Concurrency model"
+  - type: llm_judge
+    criteria: "Downstream protection is covered — Lambda can scale faster than downstream services. Database: max_connections limit — use RDS Proxy (connection pooling), or limit Lambda concurrency. APIs: use reserved concurrency to match API rate limits. SQS: use as buffer between Lambda and slow services. Pattern: high-concurrency Lambda → SQS → low-concurrency Lambda → database. ElastiCache: use as cache to reduce database load. Always consider downstream capacity when setting Lambda concurrency"
+    weight: 0.35
+    description: "Downstream protection"
+  - type: llm_judge
+    criteria: "Scaling and limits are practical — request limit increase via AWS Support (usually approved within days). Monitor: ConcurrentExecutions, Throttles, UnreservedAccountConcurrency metrics. Alert on throttles (any throttle = degraded service). Scaling behavior: burst (3000 in us-east-1, 1000 in other regions for first minute), then 500 additional per minute. Event source specific: SQS scales up to 1000 batches/minute, Kinesis limited by shard count. Plan capacity: expected peak concurrent executions = (invocations per second) × (average duration in seconds)"
+    weight: 0.30
+    description: "Scaling and limits"

package/courses/aws-lambda-debugging/scenarios/level-2/lambda-layers-debugging.yaml ADDED Viewed

@@ -0,0 +1,76 @@
+meta:
+  id: lambda-layers-debugging
+  level: 2
+  course: aws-lambda-debugging
+  type: output
+  description: "Debug Lambda Layer issues — diagnose layer version conflicts, incorrect directory structure, and dependency resolution problems"
+  tags: [AWS, Lambda, layers, dependencies, version-conflicts, intermediate]
+state: {}
+trigger: |
+  Your Lambda function fails after adding a new layer:
+  {
+    "errorType": "Runtime.ImportModuleError",
+    "errorMessage": "Unable to import module 'handler': No module
+    named 'pandas'"
+  }
+  But pandas IS in your layer! Checking the layer:
+  $ aws lambda get-layer-version --layer-name data-layer --version 3 \
+    --query "Content.Location" --output text | xargs curl -o layer.zip
+  $ unzip -l layer.zip
+  Archive: layer.zip
+    Length  Name
+    ------  ----
+           pandas/
+           pandas/__init__.py
+           ...
+  Problem: wrong directory structure! Lambda expects Python
+  dependencies in specific paths:
+  - python/
+  - python/lib/python3.x/site-packages/
+  The layer has pandas/ at the root instead of python/pandas/.
+  Lambda can't find it.
+  Correct structure:
+  $ mkdir -p python/lib/python3.12/site-packages/
+  $ pip install pandas -t python/lib/python3.12/site-packages/
+  $ zip -r layer.zip python/
+  After fixing, a new error:
+  "numpy.core._multiarray_umath: undefined symbol: PyFloat_Type"
+  The numpy in the layer was compiled for Python 3.11 but the
+  function uses Python 3.12. Native extensions must match the
+  exact Python version.
+  Another issue — layer order matters:
+  Function has layers: [common-utils:v2, data-layer:v3]
+  Both layers have a different version of the 'requests' library.
+  Layer 2 (data-layer) overrides layer 1 (common-utils) because
+  layers are merged in order, with later layers taking precedence.
+  Task: Explain Lambda Layer debugging. Write: correct layer
+  directory structure for each runtime, version compatibility
+  (runtime, architecture), layer merge order and conflicts, the
+  250MB unzipped limit (function + all layers), and when to use
+  layers vs bundled dependencies.
+assertions:
+  - type: llm_judge
+    criteria: "Layer structure is explained — Python: python/ or python/lib/python3.x/site-packages/. Node.js: nodejs/node_modules/. Both: lib/ or bin/ for shared libraries/executables. Wrong directory structure is the #1 layer issue. Lambda adds layer paths to the runtime's module search path. Must match exactly or imports fail. Check with: unzip -l layer.zip to verify structure before publishing"
+    weight: 0.35
+    description: "Layer structure"
+  - type: llm_judge
+    criteria: "Compatibility and conflicts are covered — runtime version: native extensions (numpy, pandas, Pillow) must match the exact Python version. Build on matching Amazon Linux environment or use Docker. Architecture: layers must match function architecture (x86_64 vs arm64). Layer order: layers merged in order, later layers override earlier ones if they have the same file. Up to 5 layers per function. Total size: function code + all layers ≤ 250MB unzipped. Debugging: print sys.path (Python) or module.paths (Node.js) to see layer paths"
+    weight: 0.35
+    description: "Compatibility"
+  - type: llm_judge
+    criteria: "Usage guidance is practical — use layers for: shared code across multiple functions (utility libraries, SDK wrappers), large dependencies that don't change often (pandas, numpy), custom runtimes. Don't use layers for: function-specific code (bundle with function), frequently updated dependencies (layer versioning adds complexity). Alternative: use esbuild/webpack bundling for smaller packages. Container images eliminate layer complexity entirely. Layer versions are immutable — publish new version for updates, update all functions referencing the old version"
+    weight: 0.30
+    description: "Usage guidance"